Optimization: Prune manifest in snapshot overwrite operations by gabeiglio · Pull Request #3011 · apache/iceberg-python

gabeiglio · 2026-02-09T10:40:48Z

Rationale for this change

Doing some performance tests for overwriting partitions, we noticed that PyIceberg took double the time it usually takes java based implementation, we noticed that _exisiting_manifests does not take advantage of manifest pruning before reading all Manifest Entries

In this PR I:

Moved methods from _DeleteFiles to _SnapshotProducer parent class to share with other classes (_OverwriteFiles)
Implemented manifest pruning over all deleted files partitions to not read manifests that do not match file partitions
Refactored the method to only iterate once over all files (instead of multiple)

Are these changes tested?

I believe current tests in tests/integration/test_writes.py cover all cases

Are there any user-facing changes?

Nope

yingjianwu98 · 2026-02-18T22:20:11Z

pyiceberg/table/update/snapshot.py

    def snapshot_id(self) -> int:
        return self._snapshot_id

+    def schema(self, schema_id: int | None = None) -> Schema:


nit: I don't see anywhere where we pass in the schema_id

+1 on dropping. without that this would just become self._transaction.table_metadata.schema().

geruh

Thanks for raising this @gabeiglio! I went through the core behavior here and tested out some of the pruning logic. but ultimately, the existing integration tests like test_overwrite_partitioned_table, and test_delete_overwrite would catch regressions. Just a few inline comments from my side.

geruh · 2026-02-19T02:11:46Z

pyiceberg/table/update/snapshot.py

+            group = partition_to_overwrite.setdefault(data_file.spec_id, set())
+            group.add(data_file.partition)
+
+        for spec_id, data_files in partition_to_overwrite.items():


nit:

Suggested change

for spec_id, data_files in partition_to_overwrite.items():

for spec_id, partition_records in partition_to_overwrite.items():

geruh · 2026-02-19T02:14:18Z

pyiceberg/table/update/snapshot.py

    write_manifest_list,
 )
-from pyiceberg.partitioning import (
-    PartitionSpec,


This moved a bunch of the imports to multi line for some reason

geruh · 2026-02-19T02:26:26Z

pyiceberg/table/update/snapshot.py

    def snapshot_id(self) -> int:
        return self._snapshot_id

+    def schema(self, schema_id: int | None = None) -> Schema:


+1 on dropping. without that this would just become self._transaction.table_metadata.schema().

gabeiglio marked this pull request as ready for review February 9, 2026 12:47

gabeiglio force-pushed the perf-optimization branch from e3aaa5e to 190bf4d Compare February 9, 2026 17:14

prune manifests

7525c9c

gabeiglio force-pushed the perf-optimization branch from 190bf4d to 7525c9c Compare February 9, 2026 18:08

gabeiglio mentioned this pull request Feb 12, 2026

Speed up logical overwrites by pruning manifests in Snapshot._manifests #3039

Open

yingjianwu98 reviewed Feb 18, 2026

View reviewed changes

yingjianwu98 approved these changes Feb 19, 2026

View reviewed changes

geruh reviewed Feb 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization: Prune manifest in snapshot overwrite operations#3011

Optimization: Prune manifest in snapshot overwrite operations#3011
gabeiglio wants to merge 1 commit intoapache:mainfrom
gabeiglio:perf-optimization

gabeiglio commented Feb 9, 2026 •

edited

Loading

Uh oh!

yingjianwu98 Feb 18, 2026

Uh oh!

geruh Feb 19, 2026

Uh oh!

geruh left a comment

Uh oh!

geruh Feb 19, 2026

Uh oh!

geruh Feb 19, 2026

Uh oh!

geruh Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

	for spec_id, data_files in partition_to_overwrite.items():
	for spec_id, partition_records in partition_to_overwrite.items():

Conversation

gabeiglio commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

yingjianwu98 Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

geruh Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

geruh left a comment

Choose a reason for hiding this comment

Uh oh!

geruh Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

geruh Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

geruh Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

gabeiglio commented Feb 9, 2026 •

edited

Loading