Releases: cachevector/hashprep
Releases · cachevector/hashprep
v0.1.0b3
HashPrep v0.1.0b3
New features
- Config file loading (#69) Load analysis settings from YAML, TOML, or JSON via
--config. Supports runtime threshold overrides (e.g. outlier, missingness, correlation) so you can tune checks without code changes. - Mutual information and Shannon entropy (#68) New checks and summaries for feature-target and feature-feature mutual information, plus Shannon entropy for categorical columns. Helps spot low-information or redundant features.
- Normality and variance homogeneity tests (#67) Built-in normality tests (e.g. Shapiro-Wilk) and variance homogeneity (e.g. Levene) for numeric columns. Surfaces non-normal or heteroscedastic variables that may need transforms.
- First-class DateTime support (#66) Proper handling of datetime columns: inference, summaries, and checks (e.g. future dates, skew). Datetime columns are no longer treated as plain text.
- Edge-case tests and CI (#64) Broader test coverage for correlation, leakage, and other edge cases, plus GitHub Actions CI so regressions are caught automatically.
- Website UI and docs (#59) Updated hashprep.com with clearer UI and documentation (installation, CLI, Python API, checks).
Fixes
- PDF reports in limited environments PDF generation is optional: if WeasyPrint or system libs (e.g. libgobject) are missing, MD/JSON/HTML still work and the CLI reports a clear error for
--format pdfinstead of crashing. - Docs page light mode (#70) Fixed syntax highlighting on the docs site in light theme (contrast and colors) so code blocks are readable.
- Mobile menu and routing (#60) Fixed mobile menu behavior, responsiveness, and routing issues on the website.
Refactors and quality
v0.1.0b1 - Beta Release
HashPrep v0.1.0b1 - Beta Release
This release marks HashPrep's graduation from alpha to beta status.
What's New
HashPrep is now feature-complete and ready for broader community testing. Core features are stable and the API is mature enough for real-world ML workflows.
Highlights
- 82 passing tests with comprehensive coverage across all features
- Stable APIs for both CLI and library usage
- Complete documentation with installation and usage guides
- Multiple report formats (HTML, PDF, Markdown, JSON)
- Production-ready code generation (fix scripts and sklearn pipelines)
Installation
pip install hashprepKey Features
- Intelligent dataset profiling with ML-specific checks
- Automated data quality issue detection
- Context-aware preprocessing suggestions
- Rich report generation with modern themes
- Reproducible pipeline code generation
Documentation
See the README for complete usage instructions.
What Beta Means
- Core features are stable and tested
- APIs should remain stable (breaking changes will trigger major version bump)
- Ready for community testing and feedback
- Minor bugs and edge cases may still exist
We encourage users to test HashPrep in their ML workflows and report any issues on GitHub.
v0.1.0a1
Improved correlation checks and reduced false positives in missing patterns
Improvements
- Refined correlation checks in
calculate_correlations- Fixed type inference errors by iterating over
analyzer.column_typesinstead ofanalyzer.df - Updated mixed-variable thresholds to
{'warning': 0.5, 'critical': 0.8}for consistency with Cramer’s V - Ensured seamless integration with
run_checks
- Fixed type inference errors by iterating over
- Reduced over-flagging in missing patterns detection
- Introduced effect size thresholds:
- Categorical: Cramer’s V > 0.1
- Numeric: Cohen’s d > 0.2
- Tightened p-value threshold to 0.01
- Increased minimum samples per group to 10
- Replaced ANOVA (
f_oneway) with Mann-Whitney U test for better handling of skewed distributions - Added pattern grouping to summarize correlations per missing column (top 3 shown for conciseness)
- Introduced effect size thresholds:
Fixes
Corrected correlation dictionary iteration (analyzer.column_types)
Prevented spurious warnings by filtering weak associations
v0.1.0a0
First alpha release of HashPrep