Skip to content

Use default chunk size as 4 MB than than 16Mb in arrow's rolling policy. #2729

@loserwang1024

Description

@loserwang1024

Search before asking

  • I searched in the issues and found nothing similar.

Description:

Apache Arrow currently calculates chunk sizes based on the assumption that Netty's PooledByteBufAllocatorL uses a 16MB page size. If size is blow 16MB, will use nextPowerOfTwo of requestSize.

  public long getRoundedSize(long requestSize) {
    return requestSize < chunkSize ? CommonUtil.nextPowerOfTwo(requestSize) : requestSize;
  }

However, Netty's default maximum buffer size has been reduced from 16MB to 4MB in recent versions (see Netty PR #12108). This mismatch leads to memory inefficiency in scenarios where chunk sizes fall between 4MB and 16MB.

Problem Details

For example, if fluss batch size is 4.1MB, then arrow will getRoundedSize as 8MB. 50% is wasted.

Impact on Flink:

Flink's default off-heap memory is only 128MB, and with multiple slots, this is divided further. It is not only used for Fluss (batch reading, decompression, and Netty network requests) but also for the framework itself or other connectors.

In such constrained environments, Arrow's over-allocation exacerbates memory pressure and reduces throughput.

Willingness to contribute

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions