A Python project for extracting structured data (like invoice fields) from unstructured PDF documents using LLM-based parsing. This repository includes sample PDFs, and processing-pipeline for exporting extracted data to Excel.
- Extracts structured fields (customer name, address, invoice number, date, amount due) from PDFs.
- LLM-based workflow (LLaMa) for high-quality parsing.
- Used locally hosted open-source LLM to mitigate data privacy concerns associated with cloud-based APIs.
- Saves extracted results into an Excel file (
extracted_invoice_data.xlsx). - Easily extensible for other document types or models..
.
├── Sample PDFs/
├── PDFdataExtraction_ollama.py
├── util.py
├── extracted_invoice_data.xlsx
├── requirements.txt
├── .gitignore
└── README.md
Clone the repository:
git clone https://github.com/AnandKri/structured-data-extraction.git
cd structured-data-extractionEnsure Ollama is installed and the required model is pulled:
ollama pull llama3.2:1b-instruct-q8_0Create a virtual environment:
python -m venv venv
source venv/bin/activate # Linux/macOS
venv\Scripts\activate # WindowsInstall dependencies:
pip install -r requirements.txtRun the extraction script:
python PDFdataExtraction_ollama.py