Skip to content

AnandKri/structured-data-extraction

Repository files navigation

Structured Data Extraction

A Python project for extracting structured data (like invoice fields) from unstructured PDF documents using LLM-based parsing. This repository includes sample PDFs, and processing-pipeline for exporting extracted data to Excel.


Features

  • Extracts structured fields (customer name, address, invoice number, date, amount due) from PDFs.
  • LLM-based workflow (LLaMa) for high-quality parsing.
  • Used locally hosted open-source LLM to mitigate data privacy concerns associated with cloud-based APIs.
  • Saves extracted results into an Excel file (extracted_invoice_data.xlsx).
  • Easily extensible for other document types or models..

Project Structure

. ├── Sample PDFs/
├── PDFdataExtraction_ollama.py
├── util.py
├── extracted_invoice_data.xlsx
├── requirements.txt
├── .gitignore
└── README.md


Installation

Clone the repository:

git clone https://github.com/AnandKri/structured-data-extraction.git
cd structured-data-extraction

Ensure Ollama is installed and the required model is pulled:

ollama pull llama3.2:1b-instruct-q8_0

Create a virtual environment:

python -m venv venv
source venv/bin/activate   # Linux/macOS
venv\Scripts\activate      # Windows

Install dependencies:

pip install -r requirements.txt

Usage

Run the extraction script:

python PDFdataExtraction_ollama.py

About

Extracting structured data from PDFs using LLM

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages