Skip to content

Comments

PR1: add ScanOrder enum to ArrowScan.to_record_batches#1

Draft
sumedhsakdeo wants to merge 1 commit intofix/arrow-scan-batch-size-3036from
fix/arrow-scan-streaming-3036
Draft

PR1: add ScanOrder enum to ArrowScan.to_record_batches#1
sumedhsakdeo wants to merge 1 commit intofix/arrow-scan-batch-size-3036from
fix/arrow-scan-streaming-3036

Conversation

@sumedhsakdeo
Copy link
Owner

@sumedhsakdeo sumedhsakdeo commented Feb 14, 2026

Part of apache#3036

Summary

  • Add ScanOrder enum (TASK, ARRIVAL) to control batch ordering
  • ScanOrder.TASK (default): existing behavior unchanged — executor.map + list(), files materialized before yielding
  • ScanOrder.ARRIVAL: yields batches as PyArrow produces them, one file at a time, without materializing entire files into memory

Ordering semantics

Config File ordering Within-file ordering
ScanOrder.TASK (default) Grouped by file, submission order Row order
ScanOrder.ARRIVAL Grouped by file, sequential Row order

PR Stack

This is PR 1 of 3 for apache#3036:

  1. PR 0: batch_size forwarding
  2. PR 1 (this): ScanOrder enum — stop materializing entire files
  3. PR 2: concurrent_files — bounded concurrent reads in arrival order
  4. PR 3: benchmark

Are these changes tested?

Yes — test_task_order_produces_same_results, test_arrival_order_yields_all_batches, test_arrival_order_with_limit, test_arrival_order_file_ordering_preserved, positional delete tests for both modes

Are there any user-facing changes?

Yes — new order parameter (ScanOrder enum) on to_arrow_batch_reader() and new ScanOrder class exported from pyiceberg.table.
This addresses the OOM issue in apache#3036 for single-file-at-a-time streaming.

@sumedhsakdeo sumedhsakdeo changed the title feat: add streaming flag to ArrowScan.to_record_batches PR1: add streaming flag to ArrowScan.to_record_batches Feb 14, 2026
@sumedhsakdeo sumedhsakdeo force-pushed the fix/arrow-scan-streaming-3036 branch 3 times, most recently from 55d68b8 to 444549f Compare February 15, 2026 00:35
@sumedhsakdeo sumedhsakdeo force-pushed the fix/arrow-scan-batch-size-3036 branch from 8f8a2d2 to 5ab0fd1 Compare February 15, 2026 02:07
@sumedhsakdeo sumedhsakdeo force-pushed the fix/arrow-scan-streaming-3036 branch 3 times, most recently from 07287b6 to a0a29c8 Compare February 15, 2026 02:29
Introduce ScanOrder.TASK (default) and ScanOrder.ARRIVAL to control
batch ordering. TASK materializes each file before yielding; ARRIVAL
yields batches as produced for lower memory usage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sumedhsakdeo sumedhsakdeo force-pushed the fix/arrow-scan-streaming-3036 branch from a0a29c8 to 2474b12 Compare February 17, 2026 05:07
@sumedhsakdeo sumedhsakdeo changed the title PR1: add streaming flag to ArrowScan.to_record_batches PR1: add ScanOrder enum to ArrowScan.to_record_batches Feb 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant