A python implementation of multiple index hashing by Norouzi et al, based on a description in the threatexchange repository
Multiple Index Hashing (MIH) is a relatively lightweight means for accelerating lookups for fuzzy hashes (e.g. PhotoDNA and PDQ) within a pre-defined hamming distance. Instead of a linear search through every record, multiple indices are made for separate windows/slots within each hash. The threatexchange document (linked above) provides a good, 'plain English' description for those who (like me) struggle with mathematical terminology and notation.
PyMIH is available via pypi:
pip install pyMIHDefault values for MIHIndex constructor and train() method are set for PDQ hash, using 32 bit match threshold (30 being non divisible by 8).
from pyMIH import MIHIndex
# Defaults to a 256 bit hash size
x = MIHIndex()
# For alternate hash sizes, enter bit size at declaration
x = MIHIndex(512)
# Example - load hashes from file.
# Update function only accepts hex strings (4 bits per char)
PDQs = set()
with open('ignorable.PDQ') as fi:
for line in fi:
PDQs.add(line.replace('\n', ''))
# Add to index - hashes plus category name (non-string types should work, but aren't recommended)
x.update(PDQs, 'ignorable')
# Train index. Word length is internal slot/word size for individual indices (see afore mentioned documentation for more info)
# Threshold is maximum hamming distance (inclusive). Defaults to values shown here.
x.train(wordLength=16,threshold=32)
# To query, pass in a hex string
for y in x.query('358c86641a5269ab5b0db5f1b2315c1642cef9652c39b6ced9f646d91f071927'):
# hits are returned as a tuple of hash value and hamming distance.
print(y)
# ('358c86641a5269ab5b0db5f1b2315c1642cef9652c39b6ced9f646d91f071927', ['ignorable'], 0)
# ('358c86641a5269ab5b0db5f1b2315c1642cef9652c39b6ced9f646d91f071928', ['ignorable'], 4)This is released under an MIT licence. This project utilises bitarray, at time of writing, released under the Python Software Foundation License (PSF).