fix: Do not assume missing nullcount stat means zero nullcount#9481
fix: Do not assume missing nullcount stat means zero nullcount#9481scovich wants to merge 4 commits intoapache:mainfrom
Conversation
| .map(|null_count| { | ||
| if null_count < 0 { | ||
| return Err(general_err!( | ||
| "Statistics null count is negative {}", |
There was a problem hiding this comment.
Indentation change only (review with whitespace ignored)
(again below)
Both those PR (by title at least) seem to relate only to the write path, where this one addresses the read path? The read path is where incorrect behavior arises, and fixing that can be independent of whether a given parquet writer produces stats -- if we buy the argument in #9451 that fake stats are worse than no stats. |
|
Ah -- #6257 just has an inaccurate title. The actual change indeed addresses the read path. And the documentation changes seem a lot better than what I had come up with here. |
|
@alamb -- Does this behavior change merit a |
|
Also -- do we know which arrow-rs version started storing nullcount=0 in the stats? |
It was fixed in #6490, which seems to be 53.1.0 https://github.com/apache/arrow-rs/releases/tag/53.1.0 |
|
Updated the doc comment. Please check wording. It speculatively refers to the fix landing in 58.1.0 -- we should double check that before merging. |
Which issue does this PR close?
null count = 0correctly #6256Rationale for this change
A reader might be annoyed (performance wart) if a parquet footer lacks nullcount stats, but inferring nullcount=0 for missing stats makes the stats untrustworthy and can lead to incorrect behavior.
What changes are included in this PR?
If a parquet footer nullcount stat is missing, surface it as None, reserving
Some(0)for known-no-null cases.Are these changes tested?
Fixed one unit test that broke, added a missing unit test that covers the other change site.
Are there any user-facing changes?
The stats API doesn't change signature, but there is a behavior change. The existing doc that called out the incorrect behavior has been removed to reflect that the incorrect behavior no longer occurs.