Skip to content

Commit 3296458

Browse files
first pass at structured data extraction article
1 parent ba87ccb commit 3296458

File tree

5 files changed

+140
-0
lines changed

5 files changed

+140
-0
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,3 +21,5 @@ boc-*.tar
2121

2222
# Temporary files, for example, from tests.
2323
/tmp/
24+
25+
.DS_Store
Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
---
2+
title: "How I Use LLM's to Extract Structured Data"
3+
publication_date: "2025-09-12"
4+
tags: ["llm", "ai", "ocr", "document parsing", "structured data extraction"]
5+
---
6+
# How I Use LLM's to Extract Structured Data
7+
8+
Large language models (aka AI) are a hot topic right now. Some consider LLM's the greatest thing that ever existed while others think it's the doom of society as we know and everything will go downhill from here. Wow, bold statements, not packing a lot of good sense in them. It seems outrageous just to write that down, but I've seen some crazy things out there kids. Crazy things! We'll let's focus on what we know, it's pretty incredible how far we've come and the things we can do with these generational models.
9+
10+
One of the things I learned it can do well is understand text and transform that in a structured format, so that it can be used by our beloved software systems. I'll try to show you how you can extract text from pdf documents, supposedly documents with many formats and layouts and make it into a json.
11+
12+
So think about a PDF document, it could be the receipt you got this morning from your preferred coffee shop, an invoice for your netflix subscription or a quote for repairing your car. You probably see many of these every day, imagine a large company, they handle hundreds, if not thousands of these every day.
13+
14+
These documents come in many shapes and forms, it could be pdfs, sometimes be images, maybe they come embedded in emails or maybe they are spreadsheets. We'll that's a lot, but if you can turn it into text, there is a big change you can use an LLM to make it into a structured format.
15+
16+
We can find out of the box solutions out there but with the expense documents I've worked with, the LLM approach worked better, even more when the document shape is not as common.
17+
18+
Well, there are many ways to extract data with LLM's depending on the model provider you are using, here I'll share what worked for me with a relatively low budget and how you can accomplish something similar. It's a simple 2 step approach. The first step is to make your document into plain text and then fedd that to an llm instructing the format you want as a result.
19+
20+
Let's walk through the first step...
21+
22+
## Text Extraction
23+
24+
This step consists of extracting, as the title says, content from the documents. Once the LLM's understand text you need to figure out a way of transforming your pdf, image, spreadsheet into text so that the model can process it.
25+
26+
Depending on the model you are using, this step can be skipped and I think currently most models accept images or documents so you don't even have to extract the text, the model will do that for you. This was definitely not the case when I started using this approach for data extraction.
27+
28+
Even with the model allowing you to just send a file, handling the extraction yourself can still be useful if you want to massage the data before you send it to avoid noise. Depending on your business needs it might be required, if not, just skip it.
29+
30+
The common practice is to use OCR (Optical Character Recognition) for text extraction, OCR is the technique where an algorithm can "look" into an image and based on the pixel arrangement figure out which letters are present in the image and where. There are many well known paid services and open source tools that you can use for that. I'll leave a list of some popular choices at the end.
31+
32+
The OCR route is generally the more reliable and required if you are dealing with images or encrypted pdfs, for other loosely structured formats like html and most PDFs we can use simple tools like [`pdftotext`](https://www.xpdfreader.com/pdftotext-man.html) to get what we need and for simplicity's sake that's what we'll do.
33+
34+
```bash
35+
pdftotext file_name.pdf -layout -
36+
```
37+
38+
The command above is all you need to extract text from a regular pdf, you also have options to only extract parts of the document or save it to a text file on disk. Since we are running it inside elixir I want whole extracted text spit out so I can send it further to the LLM.
39+
40+
Once we have the text, it's time for the transformation prompt...
41+
42+
## Transformation Prompt
43+
44+
This step consists of getting the extracted text into an LLM prompt so that the chosen model can perform the transformation.
45+
46+
This is a simplified example of the system prompt for extracting basic data from an invoice.
47+
````
48+
Extract invoice data and return ONLY valid JSON in this exact format:
49+
50+
```json
51+
{
52+
"total": 0.00,
53+
"items": [
54+
{
55+
"item_id": "string or null",
56+
"description": "string",
57+
"price": 0.00
58+
}
59+
]
60+
}
61+
```
62+
63+
Rules:
64+
- Extract the invoice total and all line items
65+
- Use null if item_id is not present
66+
- Format all amounts with 2 decimal places
67+
- Return ONLY the JSON object - no explanations or markdown
68+
````
69+
70+
As you can see in the example, we have to instruct the model to extract what we want, it's effective to sometimes describe what the value you are trying to extract looks like and what it means, this will trigger the llm "neurons" to make connections to the data point you are trying to extract and return better results. The prompt will vary depending on the model, test it out and see what works best for you. Take a look at this [example](/public/articles/structured-data-extraction-llms/prompt.txt) for a more comprehensive extraction prompt.
71+
72+
In the old days (6 months ago) you would need to explicitly ask for json format in the prompt, sometimes more than once, although currently most api's have an option to explicily ask for json output. OpenAI for example has the `response_format` option where you can pass `{"type": "json_object"}` and it will force the model to avoid explanations and just return json.
73+
74+
This extraction technique can work well with smaller models, like claude sonnet, and gpt4.1-mini, I even had some success with small local models like qwen2.5 and phi4 using ollama.
75+
76+
## Closing
77+
78+
This extraction technique can do wonders when dealing with multiple formats of data. I've used it in a few use cases, like: Ingesting orders from receipts pictures, Parsing inbound quotes in a procurement system and reading parts manuals to make the parts data available in the system. I know some TLS systems use this same technique to extract candidates data for resumes.
79+
80+
This technique works surprisingly well and keeps improving as the models get better.
81+
82+
Check out this [example with elixir](https://gist.github.com/robsonperassoli/cc1ed743a99132f1b42ee5dbfd0b05a9) and OpenAI. I've tested it with a couple of sample invoices, the invoice files are linked into the gist comments for you to try it out. Feel free to change the prompt to extract invoice number, item quantity and so on if you feel like it.
83+
84+
### List of OCR tools
85+
86+
- Google Vision AI
87+
- AWS Textract Document Analysis
88+
- [DocTR](https://github.com/mindee/doctr)
89+
- Tesseract
90+
- EasyOCR
91+
- RapidOCR
92+
- olmoOCR
Binary file not shown.
Binary file not shown.
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# Invoice Data Extraction System Prompt
2+
3+
You are an invoice data extraction assistant. Your task is to analyze invoice files and extract key information, returning ONLY valid JSON in the specified format.
4+
5+
## Instructions:
6+
1. Extract all relevant data from the provided invoice
7+
2. Return ONLY valid JSON - no explanations, comments, or additional text
8+
3. Ensure all monetary values are formatted as decimals with two decimal places
9+
4. If an item ID is not present, use null for item_id
10+
5. Extract the complete item description as written on the invoice
11+
6. Calculate or extract the total amount for the entire invoice
12+
13+
## Required JSON Format:
14+
```json
15+
{
16+
"total": 0.00,
17+
"items": [
18+
{
19+
"item_id": "string or null",
20+
"description": "string",
21+
"price": 0.00
22+
}
23+
]
24+
}
25+
```
26+
27+
## Field Definitions:
28+
- **total**: The grand total amount of the invoice (number with 2 decimal places)
29+
- **items**: Array of all line items on the invoice
30+
- **item_id**: Product code, SKU, or item number if available (string or null)
31+
- **description**: Full description of the item/service (string)
32+
- **price**: Unit price or line total for the item (number with 2 decimal places)
33+
34+
## Extraction Rules:
35+
- Extract ALL items listed on the invoice
36+
- If quantity and unit price are separate, use the line total as the price
37+
- Include service charges, fees, or additional costs as separate items
38+
- Do not include tax as a separate item unless it's itemized as a service
39+
- The total should match the invoice's final amount including all taxes and fees
40+
- Preserve the original description text without abbreviation
41+
42+
## Error Handling:
43+
- If a field cannot be determined, use null for item_id, empty string for description, or 0.00 for numeric values
44+
- If the document is not an invoice or is unreadable, return: {"total": 0.00, "items": []}
45+
46+
Remember: Output ONLY the JSON object. No markdown formatting, no code blocks, no explanations.

0 commit comments

Comments
 (0)