Structured Data Extraction

A Python project for extracting structured data (like invoice fields) from unstructured PDF documents using LLM-based parsing. This repository includes sample PDFs, and processing-pipeline for exporting extracted data to Excel.

Features

Extracts structured fields (customer name, address, invoice number, date, amount due) from PDFs.
LLM-based workflow (LLaMa) for high-quality parsing.
Used locally hosted open-source LLM to mitigate data privacy concerns associated with cloud-based APIs.
Saves extracted results into an Excel file (extracted_invoice_data.xlsx).
Easily extensible for other document types or models..

Project Structure

. ├── Sample PDFs/
├── PDFdataExtraction_ollama.py
├── util.py
├── extracted_invoice_data.xlsx
├── requirements.txt
├── .gitignore
└── README.md

Installation

Clone the repository:

git clone https://github.com/AnandKri/structured-data-extraction.git
cd structured-data-extraction

Ensure Ollama is installed and the required model is pulled:

ollama pull llama3.2:1b-instruct-q8_0

Create a virtual environment:

python -m venv venv
source venv/bin/activate   # Linux/macOS
venv\Scripts\activate      # Windows

Install dependencies:

pip install -r requirements.txt

Usage

Run the extraction script:

python PDFdataExtraction_ollama.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Structured Data Extraction

Features

Project Structure

Installation

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Sample PDFs		Sample PDFs
.gitignore		.gitignore
PDFdataExtraction_ollama.py		PDFdataExtraction_ollama.py
README.md		README.md
extracted_invoice_data.xlsx		extracted_invoice_data.xlsx
requirements.txt		requirements.txt
util.py		util.py

Folders and files

Latest commit

History

Repository files navigation

Structured Data Extraction

Features

Project Structure

Installation

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages