Optimization: Prune manifest in snapshot overwrite operations#3011
Optimization: Prune manifest in snapshot overwrite operations#3011gabeiglio wants to merge 1 commit intoapache:mainfrom
Conversation
e3aaa5e to
190bf4d
Compare
190bf4d to
7525c9c
Compare
| def snapshot_id(self) -> int: | ||
| return self._snapshot_id | ||
|
|
||
| def schema(self, schema_id: int | None = None) -> Schema: |
There was a problem hiding this comment.
nit: I don't see anywhere where we pass in the schema_id
There was a problem hiding this comment.
+1 on dropping. without that this would just become self._transaction.table_metadata.schema().
geruh
left a comment
There was a problem hiding this comment.
Thanks for raising this @gabeiglio! I went through the core behavior here and tested out some of the pruning logic. but ultimately, the existing integration tests like test_overwrite_partitioned_table, and test_delete_overwrite would catch regressions. Just a few inline comments from my side.
| group = partition_to_overwrite.setdefault(data_file.spec_id, set()) | ||
| group.add(data_file.partition) | ||
|
|
||
| for spec_id, data_files in partition_to_overwrite.items(): |
There was a problem hiding this comment.
nit:
| for spec_id, data_files in partition_to_overwrite.items(): | |
| for spec_id, partition_records in partition_to_overwrite.items(): |
| write_manifest_list, | ||
| ) | ||
| from pyiceberg.partitioning import ( | ||
| PartitionSpec, |
There was a problem hiding this comment.
This moved a bunch of the imports to multi line for some reason
| def snapshot_id(self) -> int: | ||
| return self._snapshot_id | ||
|
|
||
| def schema(self, schema_id: int | None = None) -> Schema: |
There was a problem hiding this comment.
+1 on dropping. without that this would just become self._transaction.table_metadata.schema().
Rationale for this change
Doing some performance tests for overwriting partitions, we noticed that PyIceberg took double the time it usually takes java based implementation, we noticed that
_exisiting_manifestsdoes not take advantage of manifest pruning before reading all Manifest EntriesIn this PR I:
Are these changes tested?
I believe current tests in tests/integration/test_writes.py cover all cases
Are there any user-facing changes?
Nope