feat: two-stage builder API for async Avro reader by mzabaluev · Pull Request #9462 · apache/arrow-rs

mzabaluev · 2026-02-22T20:07:52Z

Which issue does this PR close?

Closes Expose Avro writer schema when building the reader #9460.

What changes are included in this PR?

Expose the read_header method in reader::async_reader::ReaderBuilder, returning another builder typestate that exposes the writer schema as it was read from the file header.

Are these changes tested?

Tests and doc tests to be added for the new API, showing possible use.

Are there any user-facing changes?

The new API augments the existing ReaderBuilder in a backward-compatible way.

Expose the read_header method in reader::async_reader::ReaderBuilder, returning another builder typestate that exposes the writer schema as it was read from the file header.

EmilyMatt · 2026-02-22T21:10:52Z

arrow-avro/src/reader/async_reader/builder.rs


-impl<R: AsyncFileReader> ReaderBuilder<R> {
-    async fn read_header(&mut self) -> Result<(Header, u64), AvroError> {
+impl<R> ReaderBuilder<R>


I wonder, do we want to allow maybe a with_header function as well? that will accept a user's header directly?

Not the typical usecase, but makes it more flexible

What would be the behavior with this method? Skip reading the header from the file, and start decoding from...?

Presumable the range?
The behaviour would be the exact same, since the header ends with the magic I believe? and we start the actual decoding from the first magic we encounter

So the use case would be, the application parses the header once (or just supplies their own), and then passes it to read ranges in the file on the object store, assuming the header stays the same?

Header is not currently public, but this could be just an oversight. Its interface looks public-ready.

So the use case would be, the application parses the header once (or just supplies their own), and then passes it to read ranges in the file on the object store, assuming the header stays the same?

At the worst case (range is 0-something), we scan the header bytes very fast until we find the magic, no decoding needed, then we start scanning normally.
Best case is range is middleOfFile-something, and we don't need to do the first call to read the header at all since the user provided it. we just scan until the first magic and party on

Header is not currently public, but this could be just an oversight. Its interface looks public-ready.

I also think so, but maybe it's better to do this in a separate PR, making this public has a tendency to bite back 😅

Presumable the range?

What if the range is not given?

The behaviour would be the exact same, since the header ends with the magic I believe?

The current behavior uses the discovered length of the header as it was parsed from the file.
If the application supplies its own, the with_header method should also give the length, i.e. the offset past the header to start parsing the data from. Alternatively, we could just scan for the magic from the start of the file (unless the range option directs otherwise), but I'm not sure this is bulletproof.

if range is not given it is 0..EOF
in which case, as I said - we scan the bytes quickly for the magic(which was provided in the header by the user), no decoding happens, then we start decoding normally.

feat: two-stage builder API for async Avro reader

a59023d

Expose the read_header method in reader::async_reader::ReaderBuilder, returning another builder typestate that exposes the writer schema as it was read from the file header.

github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Feb 22, 2026

EmilyMatt reviewed Feb 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: two-stage builder API for async Avro reader#9462

feat: two-stage builder API for async Avro reader#9462
mzabaluev wants to merge 1 commit intoapache:mainfrom
mzabaluev:avro-async-reader-builder-with-writer-schema

mzabaluev commented Feb 22, 2026

Uh oh!

EmilyMatt Feb 22, 2026

Uh oh!

EmilyMatt Feb 22, 2026

Uh oh!

mzabaluev Feb 23, 2026

Uh oh!

EmilyMatt Feb 23, 2026

Uh oh!

mzabaluev Feb 26, 2026

Uh oh!

mzabaluev Feb 26, 2026

Uh oh!

EmilyMatt Feb 26, 2026

Uh oh!

EmilyMatt Feb 26, 2026

Uh oh!

mzabaluev Feb 27, 2026

Uh oh!

EmilyMatt Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mzabaluev commented Feb 22, 2026

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants