Skip to content

Expand SQLite3 data validation#23

Open
PimSanders wants to merge 6 commits intofox-it:mainfrom
PimSanders:improvement/expand-wal-validation
Open

Expand SQLite3 data validation#23
PimSanders wants to merge 6 commits intofox-it:mainfrom
PimSanders:improvement/expand-wal-validation

Conversation

@PimSanders
Copy link
Contributor

This PR close #16 by expanding the data validation capabilities in SQLite3.

The SQLite3 WAL file can store multiple versions of the same frame, when reading only valid frames should be returned. The docs define a valid frame as follows:

A frame is considered valid if and only if the following conditions are true:

  1. The salt-1 and salt-2 values in the frame-header match salt values in the wal-header
  2. The checksum values in the final 8 bytes of the frame-header exactly match the checksum computed consecutively on the first 24 bytes of the WAL header and the first 8 bytes and the content of all frames up to and including the current frame.

The first check was already implemented, I have interpreted the second check as:

The checksum values in the final 8 bytes of the frame-header (checksum-1 and checksum-2) exactly match the computed checksum over:

  1. the first 24 bytes of the WAL header
  2. the first 8 bytes of each frame header (up to and including this frame)
  3. the page data of each frame (up to and including this frame)

When initializing a database the option validate_checksum can be passed to use the new validation. I have chosen to only calculate the salts by default (just like before) as this will probably be good enough, and a lot faster. See the example below for the time impact:

In [1]: %timeit -n10 list(list(sqlite3.SQLite3(Path("./big.sqlite"), Path("./big.sqlite-wal"), validate_checksum=False).tables())[0].rows())
33 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [2]: %timeit -n10 list(list(sqlite3.SQLite3(Path("./big.sqlite"), Path("./big.sqlite-wal"), validate_checksum=True).tables())[0].rows())
1.05 s ± 4.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@codecov
Copy link

codecov bot commented Feb 2, 2026

Codecov Report

❌ Patch coverage is 0% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 0.00%. Comparing base (64ae6d8) to head (f4b6ffb).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
dissect/database/sqlite3/wal.py 0.00% 22 Missing ⚠️
dissect/database/sqlite3/sqlite3.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff           @@
##            main     #23    +/-   ##
======================================
  Coverage   0.00%   0.00%            
======================================
  Files        146     150     +4     
  Lines       3881    4086   +205     
======================================
- Misses      3881    4086   +205     
Flag Coverage Δ
unittests 0.00% <0.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@codspeed-hq
Copy link

codspeed-hq bot commented Feb 2, 2026

Merging this PR will not alter performance

✅ 6 untouched benchmarks


Comparing PimSanders:improvement/expand-wal-validation (f4b6ffb) with main (6149d6f)1

Open in CodSpeed

Footnotes

  1. No successful run was found on main (798cf10) during the generation of this report, so 6149d6f was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

Copy link
Member

@Schamper Schamper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a benchmark test too? I'll look at the actual checksum checking part later when I have a bit more time.

@PimSanders
Copy link
Contributor Author

Take your time, I don't think I will be doing a whole lot of Dissect dev to coming weeks ...

PimSanders and others added 2 commits February 18, 2026 21:28
Co-authored-by: Erik Schamper <1254028+Schamper@users.noreply.github.com>
Comment on lines 174 to 175
# Start seed with checksum over first 24 bytes of WAL header
seed = calculate_checksum(wal_hdr_bytes[:24], endian=self.wal.checksum_endian)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can cache this in the WAL object itself? wal._header_checksum or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you also suggest calculating it on initialization? Or just to prevent it from being calculated on every loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do a @cached_property, take the middle road.

PimSanders and others added 3 commits February 19, 2026 07:33
Co-authored-by: Erik Schamper <1254028+Schamper@users.noreply.github.com>
Copy link
Member

@Schamper Schamper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a benchmark test too, so that we can track future changes to this algorithm?

exactly match the computed checksum over:

1. the first 24 bytes of the WAL header
2. the first 8 bytes of each frame header (up to and including this frame)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. the first 8 bytes of each frame header (up to and including this frame)
2. the first 16 bytes of each frame header (up to and including this frame)

first_frame_offset = len(c_sqlite3.wal_header)
offset = first_frame_offset

while offset <= self.offset:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems wasteful to "throw away" the results for every frame we pass, while we may use it in the next frame checksum calculation. But I can't think of a super nice way to keep it. Caching on the Frame object is a bit pointless since it's relatively short-lived. It's LRU cached, but then you might still lose cached checksum information and have to re-checksum half the WAL at some point. How large can WAL logs become? Otherwise we might be able to cache seeds for a given offset in the WAL object

Do you know exactly how the checksumming works if at any point in the middle of the WAL a checksum fails? You'd think that everything after it can never have a matching checksum again, unless future frames just ignore this fact and "checksum" the bad data as part of their checksummed data?

If the former is true, might it be possible to just store the "highest offset" that we verified a good checksum of? Anything that is before that offset is an automatic return True, and anything that comes after that offset can just continue calculating from that offset. If at some point a checksum no longer matches, maybe a boolean can indicate that this is the final highest offset with a valid checksum, and all future checksum checks automatically just become an offset comparison.

Copy link
Contributor Author

@PimSanders PimSanders Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh I like the way you're thinking, definitely going to look into this. It seems like a good way to significantly reduce the time it takes to checksum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expand SQLite3 data validation when reading from WAL

2 participants