Skip to content

Refactor the code that infers the schema of a JSON file#9485

Closed
Rafferty97 wants to merge 7 commits intoapache:mainfrom
Rafferty97:refactor-json-schema-inference
Closed

Refactor the code that infers the schema of a JSON file#9485
Rafferty97 wants to merge 7 commits intoapache:mainfrom
Rafferty97:refactor-json-schema-inference

Conversation

@Rafferty97
Copy link
Contributor

@Rafferty97 Rafferty97 commented Feb 26, 2026

Which issue does this PR close?

This PR fixes #9484, and also sets the groundwork for implementing #9482.

I have refactored the code that infers the schema of a JSON file, and specifically:

  • Simplify InferredType and use arena allocation for efficiency
  • Simplify logic to simple recursive inference and unification
  • Remove scalar-to-array coersion that doesn't exist in the JSON reader itself
  • Move ValueIter into its own submodule and expose record_count via a getter method
  • Closes #NNN.

Rationale for this change

While working on #9482, I saw a need and opportunity to refactor the schema inference code for JSON schemas. I also discovered the bug detailed in #9484.

These changes not only make the code more readible and predictable by eliminating a lot of special case handling, but make it trivial to create a new inference function for "single field" JSON reading.

What changes are included in this PR?

  • An overhaul of arrow-json/src/reader/schema.rs
  • Removed mixed_arrays.json as it's no longer valid, and replaced mixed_arrays.json.gz with arrays.json.gz
  • Added a dependency on Bumpalo for arena allocation

Are these changes tested?

Yes, the changes pass all existing unit tests - except for one intentionally removed due to the change in behaviour related to #9484 (removing scalar-to-array promotion). There may be scope for adding more tests to improve coverage, but I haven't pulled this thread yet.

Are there any user-facing changes?

There are no API changes, except for the addition of the record_count method.

However, the error messages returned by infer_json_schema and its cousins will significantly change. They have all been condensed into the one "Incompatible type found during schema inference: {self:?} vs {other:?}" message.

Finally, some files that used to generate a valid schema will now return errors. However, this is desirable because those files would have failed to be read by the actual JSON reader anyway - due to the lack of support for scalar-to-array promotion in the JSON reader.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 26, 2026
@Rafferty97 Rafferty97 marked this pull request as draft February 27, 2026 00:08
@Rafferty97 Rafferty97 force-pushed the refactor-json-schema-inference branch 2 times, most recently from 4ebebb7 to 5844fb3 Compare February 27, 2026 23:53
* Simplify `InferredType` and use arena allocation for efficiency
* Simplify logic to simple recursive inference and unification
* Remove scalar-to-array coersion that doesn't exist in the JSON reader itself
* Move `ValueIter` into its own submodule and expose `record_count` via a getter method
@Rafferty97 Rafferty97 force-pushed the refactor-json-schema-inference branch from 743341f to 0993953 Compare February 28, 2026 00:54
@Rafferty97 Rafferty97 force-pushed the refactor-json-schema-inference branch from 0d58267 to f52d8cc Compare February 28, 2026 03:55
@Rafferty97 Rafferty97 closed this Feb 28, 2026
@Rafferty97 Rafferty97 deleted the refactor-json-schema-inference branch February 28, 2026 11:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JSON reader doesn't support scalar-to-list promotion, even though schema inference does

1 participant