Skip to content

fix: Do not assume missing nullcount stat means zero nullcount#9481

Open
scovich wants to merge 4 commits intoapache:mainfrom
scovich:fix-missing-nullcount-stat
Open

fix: Do not assume missing nullcount stat means zero nullcount#9481
scovich wants to merge 4 commits intoapache:mainfrom
scovich:fix-missing-nullcount-stat

Conversation

@scovich
Copy link
Contributor

@scovich scovich commented Feb 25, 2026

Which issue does this PR close?

Rationale for this change

A reader might be annoyed (performance wart) if a parquet footer lacks nullcount stats, but inferring nullcount=0 for missing stats makes the stats untrustworthy and can lead to incorrect behavior.

What changes are included in this PR?

If a parquet footer nullcount stat is missing, surface it as None, reserving Some(0) for known-no-null cases.

Are these changes tested?

Fixed one unit test that broke, added a missing unit test that covers the other change site.

Are there any user-facing changes?

The stats API doesn't change signature, but there is a behavior change. The existing doc that called out the incorrect behavior has been removed to reflect that the incorrect behavior no longer occurs.

@scovich scovich requested review from alamb and etseidl February 25, 2026 22:42
@github-actions github-actions bot added the parquet Changes to the parquet crate label Feb 25, 2026
.map(|null_count| {
if null_count < 0 {
return Err(general_err!(
"Statistics null count is negative {}",
Copy link
Contributor Author

@scovich scovich Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation change only (review with whitespace ignored)

(again below)

@etseidl
Copy link
Contributor

etseidl commented Feb 25, 2026

Thanks @scovich, this has long been an issue. There is a fix for this already (#6257. The write side was fixed in #6490). I've given up on convincing @alamb, but perhaps the arguments in #9451 will provide the final push 😉

@scovich
Copy link
Contributor Author

scovich commented Feb 26, 2026

This has long been an issue. There is a fix for this already (#6257. The write side was fixed in #6490).

Both those PR (by title at least) seem to relate only to the write path, where this one addresses the read path?

The read path is where incorrect behavior arises, and fixing that can be independent of whether a given parquet writer produces stats -- if we buy the argument in #9451 that fake stats are worse than no stats.

@scovich
Copy link
Contributor Author

scovich commented Feb 26, 2026

Ah -- #6257 just has an inaccurate title. The actual change indeed addresses the read path. And the documentation changes seem a lot better than what I had come up with here.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is time -- let's fix this -- thank you @scovich

Can you please also bring the comments from #6257 into this PR too?

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @scovich.

@scovich
Copy link
Contributor Author

scovich commented Feb 27, 2026

@alamb -- Does this behavior change merit a next-major-release label? Or should we just merge right away so it goes out with the next minor release? (that affects what arrow version the doc comments reference)

@scovich
Copy link
Contributor Author

scovich commented Feb 27, 2026

Also -- do we know which arrow-rs version started storing nullcount=0 in the stats?
(I get the impression that was fixed a while ago)?

@etseidl
Copy link
Contributor

etseidl commented Feb 27, 2026

Also -- do we know which arrow-rs version started storing nullcount=0 in the stats?

(I get the impression that was fixed a while ago)?

It was fixed in #6490, which seems to be 53.1.0 https://github.com/apache/arrow-rs/releases/tag/53.1.0

@scovich
Copy link
Contributor Author

scovich commented Feb 28, 2026

Updated the doc comment. Please check wording.

It speculatively refers to the fix landing in 58.1.0 -- we should double check that before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet Statistics::null_count_opt wrongly returns Some(0) when stats are missing parquet writer does not encode null count = 0 correctly

3 participants