[Parquet] Avoid fetching multiple pages when the predicate cache is disabled by nuno-faria · Pull Request #8554 · apache/arrow-rs

nuno-faria · 2025-10-05T12:02:18Z

Which issue does this PR close?

Closes [Parquet] Avoid fetching multiple pages when max_predicate_cache_sizeis 0 #8542.

Rationale for this change

When the max_predicate_cache_size is set to 0 there is no need to select multiple data pages until batch_size is reached.

What changes are included in this PR?

Make ReaderFactory::compute_cache_projection return None if the cache is disabled, which will end up not retrieving multiple pages unnecessarily.
Added a unit test to confirm the new behavior.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

…isabled

alamb

Thank you @nuno-faria -- this looks very great to me

alamb · 2025-10-06T14:43:45Z

parquet/src/arrow/async_reader/mod.rs

+            .unwrap();
+        let parquet_schema = metadata.file_metadata().schema_descr_ptr();
+
+        // the filter is not clone-able, so we use a lambda to simplify


yeah, this is something that makes the filters very tricky to handle internally. Nothing to change for this PR, I am just observing

alamb · 2025-10-06T14:44:27Z

FYI @XiangpengHao

XiangpengHao

Looks good to me, thank you @nuno-faria

alamb · 2025-10-07T18:25:10Z

Thanks again @nuno-faria and @XiangpengHao

[Parquet] Avoid fetching multiple pages when the predicate cache is d…

21c1614

…isabled

github-actions bot added the parquet Changes to the parquet crate label Oct 5, 2025

nuno-faria mentioned this pull request Oct 5, 2025

[Parquet] Avoid fetching multiple pages when max_predicate_cache_sizeis 0 #8542

Closed

alamb approved these changes Oct 6, 2025

View reviewed changes

XiangpengHao approved these changes Oct 6, 2025

View reviewed changes

alamb merged commit 84a7e35 into apache:main Oct 7, 2025
16 checks passed

nuno-faria deleted the fix_disable_preficate_cache branch October 7, 2025 18:53

alamb mentioned this pull request Oct 30, 2025

Rewrite ParquetRecordBatchStream in terms of the PushDecoder #8159

Merged

3 tasks

alamb mentioned this pull request Nov 18, 2025

Parquet 56: encounter error: item_reader def levels are None when reading nested field with row filter #8657

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Parquet] Avoid fetching multiple pages when the predicate cache is disabled#8554

[Parquet] Avoid fetching multiple pages when the predicate cache is disabled#8554
alamb merged 1 commit intoapache:mainfrom
nuno-faria:fix_disable_preficate_cache

nuno-faria commented Oct 5, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Oct 6, 2025

Uh oh!

alamb commented Oct 6, 2025

Uh oh!

XiangpengHao left a comment

Uh oh!

Uh oh!

alamb commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nuno-faria commented Oct 5, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Oct 6, 2025

Uh oh!

XiangpengHao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants