Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #23 +/- ##
======================================
Coverage 0.00% 0.00%
======================================
Files 146 150 +4
Lines 3881 4086 +205
======================================
- Misses 3881 4086 +205
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Schamper
left a comment
There was a problem hiding this comment.
Maybe add a benchmark test too? I'll look at the actual checksum checking part later when I have a bit more time.
|
Take your time, I don't think I will be doing a whole lot of Dissect dev to coming weeks ... |
Co-authored-by: Erik Schamper <1254028+Schamper@users.noreply.github.com>
dissect/database/sqlite3/wal.py
Outdated
| # Start seed with checksum over first 24 bytes of WAL header | ||
| seed = calculate_checksum(wal_hdr_bytes[:24], endian=self.wal.checksum_endian) |
There was a problem hiding this comment.
Maybe we can cache this in the WAL object itself? wal._header_checksum or something?
There was a problem hiding this comment.
Would you also suggest calculating it on initialization? Or just to prevent it from being calculated on every loop.
There was a problem hiding this comment.
Can do a @cached_property, take the middle road.
Co-authored-by: Erik Schamper <1254028+Schamper@users.noreply.github.com>
Schamper
left a comment
There was a problem hiding this comment.
Can you add a benchmark test too, so that we can track future changes to this algorithm?
| exactly match the computed checksum over: | ||
|
|
||
| 1. the first 24 bytes of the WAL header | ||
| 2. the first 8 bytes of each frame header (up to and including this frame) |
There was a problem hiding this comment.
| 2. the first 8 bytes of each frame header (up to and including this frame) | |
| 2. the first 16 bytes of each frame header (up to and including this frame) |
| first_frame_offset = len(c_sqlite3.wal_header) | ||
| offset = first_frame_offset | ||
|
|
||
| while offset <= self.offset: |
There was a problem hiding this comment.
It seems wasteful to "throw away" the results for every frame we pass, while we may use it in the next frame checksum calculation. But I can't think of a super nice way to keep it. Caching on the Frame object is a bit pointless since it's relatively short-lived. It's LRU cached, but then you might still lose cached checksum information and have to re-checksum half the WAL at some point. How large can WAL logs become? Otherwise we might be able to cache seeds for a given offset in the WAL object
Do you know exactly how the checksumming works if at any point in the middle of the WAL a checksum fails? You'd think that everything after it can never have a matching checksum again, unless future frames just ignore this fact and "checksum" the bad data as part of their checksummed data?
If the former is true, might it be possible to just store the "highest offset" that we verified a good checksum of? Anything that is before that offset is an automatic return True, and anything that comes after that offset can just continue calculating from that offset. If at some point a checksum no longer matches, maybe a boolean can indicate that this is the final highest offset with a valid checksum, and all future checksum checks automatically just become an offset comparison.
There was a problem hiding this comment.
Ooh I like the way you're thinking, definitely going to look into this. It seems like a good way to significantly reduce the time it takes to checksum.
This PR close #16 by expanding the data validation capabilities in SQLite3.
The SQLite3 WAL file can store multiple versions of the same frame, when reading only valid frames should be returned. The docs define a valid frame as follows:
The first check was already implemented, I have interpreted the second check as:
When initializing a database the option
validate_checksumcan be passed to use the new validation. I have chosen to only calculate the salts by default (just like before) as this will probably be good enough, and a lot faster. See the example below for the time impact: