Overview
++ HashPrep is a dataset profiler and debugger for machine learning. Think of it as + “Pandas Profiling + a linter for datasets” that runs before you train. +
++ It scans your data, surfaces critical issues, explains why they matter for modeling, and suggests (or + generates) concrete fixes. You can use it as a single CLI command or as a Python library inside your + pipelines. +
+Installation
+Using pip
+pip install hashprep
+ Using uv (recommended)
+# Install uv (if needed)
+curl -LsSf https://astral.sh/uv/install.sh | sh
+
+# Install HashPrep
+uv pip install hashprep
+
+# Or from source
+git clone https://github.com/cachevector/hashprep.git
+cd hashprep
+uv sync
+
+ After installation, the hashprep command is available directly in your terminal.
+
Quickstart — CLI
+1. Run a quick scan
+Get a concise summary of dataset issues in your terminal.
+hashprep scan dataset.csv
+ Common options
+-
+
--target COLUMN— target column for ML-specific checks
+ --checks outliers,high_missing_values— run only selected checks
+ --json— JSON output for automation
+ --sample-size N— restrict analysis to N rows
+
Example
+hashprep scan train.csv \
+ --target Survived \
+ --checks outliers,high_missing_values,class_imbalance
+ 2. Detailed analysis
+Drill into every issue HashPrep found.
+hashprep details train.csv --target Survived
+ Quickstart — Python
+ +Basic analysis
+import pandas as pd
+from hashprep import DatasetAnalyzer
+
+df = pd.read_csv("dataset.csv")
+
+analyzer = DatasetAnalyzer(df)
+summary = analyzer.analyze()
+
+print("Critical issues:", summary["critical_count"])
+print("Warnings:", summary["warning_count"])
+ With a target column
+analyzer = DatasetAnalyzer(
+ df,
+ target_col="target",
+)
+summary = analyzer.analyze()
+ Run specific checks
+analyzer = DatasetAnalyzer(
+ df,
+ selected_checks=["outliers", "high_missing_values", "class_imbalance"],
+)
+summary = analyzer.analyze()
+ Reports
+ +Generate a report from the CLI
+hashprep report dataset.csv --format html --theme minimal
+
+ Reports can be exported as md, json, html, or
+ pdf, using the --format flag. Use --with-code to generate
+ companion Python scripts with cleaning logic and pipelines.
+
Generate reports from Python
+from hashprep import DatasetAnalyzer
+from hashprep.reports import generate_report
+
+analyzer = DatasetAnalyzer(df, include_plots=True)
+summary = analyzer.analyze()
+
+generate_report(
+ summary,
+ format="html",
+ full=True,
+ output_file="dataset_hashprep_report.html",
+ theme="minimal",
+)
+ CLI reference
+ +hashprep scan
+ + Run a quick scan and print a compact summary of issues to the terminal. Ideal for a fast health check + during development. +
+hashprep scan dataset.csv --target Survived
+ -
+
--target COLUMN— target column for ML checks (class imbalance, leakage, etc.)
+ --checks LIST— comma-separated list of checks to run
+ --critical-only— hide warnings and show only critical issues
+ --json— emit JSON instead of human-readable text
+ --quiet— minimal output, useful in CI
+
hashprep details
+ + Produce a verbose, line-by-line description of each issue in the dataset, with statistics and + recommendations. +
+hashprep details dataset.csv --target Survived
+
+ Accepts the same options as scan and is best used when you are actively debugging a dataset
+ or deciding which columns to drop or transform.
+
hashprep report
+ + Generate a full report in HTML, PDF, Markdown, or JSON. Reports contain summaries, plots, and (optionally) + auto-generated Python code. +
+hashprep report dataset.csv --format html --theme minimal --with-code
+ -
+
--format {md,json,html,pdf}— output format (Markdown is the default)
+ --theme {minimal,neubrutalism}— HTML theme
+ --with-code— write companion_fixes.pyand_pipeline.pyfiles
+ --comparison FILE— compare two datasets for drift (train vs test, etc.)
+ --sample-size N/--no-sample— control automatic sampling
+
Exit codes
++ HashPrep is CI-friendly: non-zero exit codes indicate that critical issues were detected or an internal + error occurred. You can wire this into pre-commit hooks or data-quality checks in your pipeline. +
+Python API
+ +DatasetAnalyzer
+
+ The core entry point for programmatic usage. It accepts a pandas DataFrame and optional
+ configuration for targets, sampling, comparison datasets, and which checks to run.
+
from hashprep import DatasetAnalyzer
+
+analyzer = DatasetAnalyzer(
+ df,
+ target_col="target",
+ selected_checks=["outliers", "high_missing_values"],
+ include_plots=True,
+)
+summary = analyzer.analyze()
+ The returned summary is a JSON-serializable dictionary with keys such as:
-
+
critical_count/warning_count
+ issues— list of individual issues with severity, column, check name, and message
+ summaries— per-column statistics and optional plots
+ sampling_info— information about any sampling that was applied
+
Sampling large datasets
++ For very large tables, you can control how many rows HashPrep inspects using a sampling configuration. + This keeps runtimes predictable in production. +
+from hashprep.utils.sampling import SamplingConfig
+
+sampling = SamplingConfig(max_rows=10000)
+
+analyzer = DatasetAnalyzer(
+ df,
+ sampling_config=sampling,
+ auto_sample=True,
+)
+summary = analyzer.analyze()
+
+ When sampling is applied, summary["sampling_info"] contains details such as fractions and
+ original row counts.
+
Generating fix scripts and pipelines
++ Beyond reports, HashPrep can emit executable Python code that encodes suggested fixes and ML pipelines. +
+from hashprep.checks.core import Issue
+from hashprep.preparers.codegen import CodeGenerator
+from hashprep.preparers.pipeline_builder import PipelineBuilder
+from hashprep.preparers.suggestions import SuggestionProvider
+
+issues = [Issue(**i) for i in summary["issues"]]
+column_types = summary.get("column_types", dict())
+
+provider = SuggestionProvider(
+ issues=issues,
+ column_types=column_types,
+ target_col="target",
+)
+suggestions = provider.get_suggestions()
+
+codegen = CodeGenerator(suggestions)
+fixes_code = codegen.generate_pandas_script()
+
+builder = PipelineBuilder(suggestions)
+pipeline_code = builder.generate_pipeline_code()
+
+ You can write these strings to disk (for example, fixes.py and pipeline.py) or
+ load them dynamically in your tooling.
+
Drift detection
+
+ To compare training and serving data, construct a DatasetAnalyzer with both a primary and
+ comparison dataset and enable the dataset_drift check.
+
analyzer = DatasetAnalyzer(
+ train_df,
+ comparison_df=test_df,
+ selected_checks=["dataset_drift"],
+)
+summary = analyzer.analyze()
+ Available checks
+ +Data quality
+-
+
missing_values— overall missingness patterns
+ high_missing_values— columns with heavy missingness
+ duplicates— duplicate rows
+ single_value_columns— near-constant features
+
Distribution
+-
+
outliers— IQR-based outlier detection
+ high_cardinality— categorical columns with too many uniques
+ uniform_distribution— uniformly distributed numeric columns
+ many_zeros— features dominated by zeros
+
ML-specific
+-
+
class_imbalance— target imbalance (requires--target)
+ feature_correlation— highly correlated features
+ target_leakage— features leaking target information
+ dataset_drift— drift between train / test datasets
+
Contributing
++ HashPrep is open source and welcomes contributions. If you want to fix a bug, add a check, or improve the + reports: +
+-
+
- Read the contribution guide +
- Open an issue describing the change you’d like to make +
- Create a pull request with clear motivation and tests where appropriate +