Skip to content

Fix truncated string statistics handling#79

Merged
WenyXu merged 1 commit intodatafusion-contrib:mainfrom
Flyangz:bugfix/truncate-string-stats-missing-data
Mar 3, 2026
Merged

Fix truncated string statistics handling#79
WenyXu merged 1 commit intodatafusion-contrib:mainfrom
Flyangz:bugfix/truncate-string-stats-missing-data

Conversation

@Flyangz
Copy link
Contributor

@Flyangz Flyangz commented Feb 27, 2026

Problem
When string statistics are truncated (long strings), the reader ignored lower_bound/upper_bound and defaulted to empty strings. This caused valid row groups to be incorrectly skipped during predicate pushdown.

Fix

  1. Prioritize Exact Min/Max: Use minimum/maximum first; fallback to lower_bound/upper_bound if missing.
  2. Handle Empty Rows: Return None for statistics immediately if number_of_values is 0.

These changes match Java writer logic in https://github.com/apache/orc/blob/495c5b364aa7763bc36d08c2ca4c41e5db968d0b/java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java#L797-L816

Test
Added unit tests.

@WenyXu
Copy link
Collaborator

WenyXu commented Feb 27, 2026

Hi @Flyangz , The CI is failed

@Flyangz Flyangz force-pushed the bugfix/truncate-string-stats-missing-data branch 2 times, most recently from 23a6358 to 7ee8a1b Compare February 28, 2026 02:34
@Flyangz Flyangz force-pushed the bugfix/truncate-string-stats-missing-data branch from 7ee8a1b to 51b97ea Compare February 28, 2026 02:55
@WenyXu WenyXu requested a review from Copilot March 2, 2026 02:47
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes predicate pushdown correctness when ORC string statistics are truncated by properly interpreting minimum/maximum vs lower_bound/upper_bound, and introduces explicit “exactness” indicators to avoid incorrect row-group pruning.

Changes:

  • Update string statistics decoding to prefer exact minimum/maximum and otherwise use lower_bound/upper_bound, tracking whether bounds are exact.
  • Extend row-group string predicate evaluation to incorporate exact-vs-bound semantics.
  • Update CLI/stat formatting outputs and golden expected output to include IsExactMin/IsExactMax.

Reviewed changes

Copilot reviewed 2 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/bin/expected/stats.out Updates golden output to include IsExactMin/IsExactMax lines for string stats.
src/statistics.rs Decodes string stats using exact min/max when available, otherwise uses lower/upper bounds and records exactness flags; returns no type stats when number_of_values == 0.
src/row_group_filter.rs Uses the new string bound/exactness fields in predicate evaluation and adds unit tests for string comparison logic.
src/bin/orc/stats.rs Prints exactness flags for string statistics in the orc stats CLI output.
src/bin/orc/common.rs Includes exactness flags in formatted stats output for CLI/common formatting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@WenyXu
Copy link
Collaborator

WenyXu commented Mar 3, 2026

@Flyangz Thanks!

@WenyXu WenyXu merged commit e8a9168 into datafusion-contrib:main Mar 3, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants