An intelligent NLP-based system for extracting structured birth history information from clinical audio recordings and text documents. This project combines speech-to-text technology with advanced natural language processing to automatically extract medical information such as delivery mode, conception method, birth weight, and other critical birth history fields.
- Quick Start Guide - Get up and running in 5 minutes
- API Documentation - Detailed API reference
- Docker Guide - Docker deployment instructions
- Contributing - How to contribute to the project
- Changelog - Version history and changes
- Examples - Working code examples
LazyFormFill automates the extraction of birth history data from medical narratives, significantly reducing manual data entry time for healthcare professionals. The system uses:
- Speech-to-Text: Faster Whisper model for accurate audio transcription
- NLP Extraction: spaCy and medSpaCy for medical entity recognition
- Pattern Matching: Regex-based extraction for specific medical fields
- ๐ค Audio Transcription: Convert medical audio recordings to text using Faster Whisper
- ๐ Text Extraction: Extract birth history from clinical text documents
- ๐ฅ Medical Field Detection: Automatically identify and extract:
- Conception mode (Natural/Assisted)
- Delivery mode (NVD/LSCS)
- Term (Preterm/Term)
- Birth weight (in kg)
- Crying at birth (Yes/No)
- Pedigree information
- Consanguinity status
- Antenatal history
- Perinatal history
- Postnatal complications
- Breastfeeding duration
- ๐ Multiple Extraction Methods: Uses both spaCy and medSpaCy approaches for improved accuracy
- ๐ณ Docker Support: Easy deployment with Docker Compose
- Python 3.9+
- Faster Whisper: Speech-to-text transcription
- spaCy: Natural language processing
- medSpaCy: Medical-specific NLP extensions
- NumPy: Numerical computations
- word2number: Converts spelled-out numbers to digits
- Docker: Containerization support
- Python 3.9 or higher
- Docker and Docker Compose (optional, for containerized deployment)
- Sufficient disk space for Whisper models (~500MB for small model)
# Clone the repository
git clone https://github.com/AchuAshwath/lazyFormFill.git
cd lazyFormFill
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
uv pip install -e .# Clone the repository
git clone https://github.com/AchuAshwath/lazyFormFill.git
cd lazyFormFill
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt# Download the English language model
python -m spacy download en_core_web_sm# Clone the repository
git clone https://github.com/AchuAshwath/lazyFormFill.git
cd lazyFormFill/docker-files
# Build and run with Docker Compose
docker-compose up -dfrom main import extract_birth_history_from_audio
# Path to your audio file (supports .m4a, .mp3, .wav)
audio_file = "path/to/medical_recording.m4a"
# Extract birth history
birth_history = extract_birth_history_from_audio(audio_file)
print(birth_history)
# Output: {
# 'conception_mode': 'Natural',
# 'delivery_mode': 'LSCS',
# 'term': 'Term',
# 'cried_at_birth': 'Yes',
# 'birth_weight': 3.2,
# 'pedigree': 'family history of diabetes',
# 'consanguinity': None,
# 'antenatal_history': 'gestational diabetes',
# 'perinatal_history': None,
# 'postnatal_complications': None,
# 'breastfed_upto': '6 months'
# }from main import extract_birth_history_from_dataset
# Your clinical text
text = """
A 4-year-old child with a family history of diabetes.
The child was conceived naturally and delivered by caesarean section at 38 weeks.
The baby cried immediately after birth. Birth weight was 3.2 kg.
There were no postnatal complications, and breastfeeding continued for 4 months.
"""
# Extract birth history
birth_history = extract_birth_history_from_dataset(text)
print(birth_history)# Run the extraction on a sample audio file
python main.pylazyFormFill/
โโโ main.py # Main extraction pipeline
โโโ pyproject.toml # Project dependencies
โโโ uv.lock # Dependency lock file
โโโ .gitignore # Git ignore rules
โโโ .python-version # Python version specification
โ
โโโ src/ # Source code
โ โโโ birth_history_extractor/ # Birth history extraction modules
โ โ โโโ birth_weight.py # Birth weight extraction logic
โ โ
โ โโโ radio_extractor/ # Radio field extraction modules
โ โโโ config.py # Keyword mappings and configuration
โ โโโ spacy_extractor.py # spaCy-based extraction
โ โโโ medspacy_extractor.py # medSpaCy-based extraction
โ
โโโ data/ # Data files
โ โโโ dataset.py # Sample dataset for testing
โ
โโโ tests/ # Test files
โ โโโ test_extractor.py # Extraction benchmarking tests
โ โโโ test_birth_history_examples.py # Example test cases
โ โโโ birthHistory_testCases.py # Birth history test cases
โ
โโโ docker-files/ # Docker configuration
โ โโโ docker-compose.yml # Docker Compose configuration
โ โโโ entrypoint.sh # Docker entrypoint script
โ
โโโ dev/ # Development files
โโโ whisper.py # Whisper model experiments
โโโ seellama_infer.py # LLM inference experiments
โโโ medsapcy.ipynb # medSpaCy experiments notebook
Run the test suite to verify the extraction accuracy:
# Run all tests
python -m pytest tests/
# Run specific test
python tests/test_extractor.py
# Run benchmarking tests
python tests/test_extractor.pyThe benchmarking tests compare the accuracy and speed of different extraction methods (pure spaCy vs. medSpaCy).
The system uses keyword mappings defined in src/radio_extractor/config.py. You can customize these mappings to:
- Add new synonyms for existing fields
- Add support for different medical terminologies
- Adjust extraction rules
Example:
KEYWORD_MAP = {
"conception_mode": {
"natural": "Natural",
"ivf": "Assisted",
# Add more synonyms...
},
# Add more field mappings...
}You can change the Whisper model size in main.py:
model_size = "small" # Options: "tiny", "base", "small", "medium", "large"Larger models provide better accuracy but require more resources.
The project includes Docker support for easy deployment:
cd docker-files
docker-compose up -dThis will:
- Pull the Faster Whisper Docker image
- Mount your project directory
- Configure the Whisper model (default: small)
- Expose the service on port 10300
Contributions are welcome! Please feel free to submit a Pull Request. For major changes:
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Faster Whisper for efficient speech-to-text
- spaCy for NLP capabilities
- medSpaCy for medical text processing
For questions or feedback, please open an issue on GitHub.
- Web API for easy integration
- Support for more audio formats
- Multi-language support
- Real-time streaming transcription
- Enhanced medical entity recognition
- Database integration for storing extracted data
- GUI for easier interaction
Note: This is a testing/development version. Please validate all extracted information before using in production medical environments.