Abstract

As AI developers, we're always looking for ways to make data more accessible and queryable through natural language. While Retrieval-Augmented Generation (RAG) has revolutionized how we interact with unstructired textual documents, it falls short when dealing with structured data. The RAG approach is so powerful that users or even early stage AI developers may fall in the illusion that it can be applied to any kind of data, including structured data like Excel files. However, this is a misconception that can lead to frustration and inefficiency. One of most ubiquitous kind of file asset across all organization is the Excel file format, which could also be considered as structured or "semi-structured" at least. Anyone who has tryed to process an Excel file using the standard Rag approach, quickly realized there is no real value with processing excel files the same way as PDFs.

I've built a system that combines some prompting techniques to create a powerful Excel analysis tool based on SQL.

The Solution: LLM-Powered Excel-to-SQL Pipeline

Instead of trying to force RAG into a structured data world, I've built a system that embraces the structured nature of Excel files and uses LLMs to convert the excel data into a SQL database schema. This allows us to leverage the power of SQL for querying and analyzing the data.

Here's the architecture:

graph TD;
    A[Excel File Upload] --> B[LLM Metadata Analysis];
    B --> C[Column Type Detection];
    C --> D[SQL Schema Generation];
    D --> E[Data Insertion];
    E --> F[Ready for Queries];
    
    G[Natural Language Query] --> H[LLM SQL Generation];
    H --> I[Query Execution];
    I --> J[Results & Visualization];
    
    F --> H;
    
    style A fill:#e1f5fe;
    style F fill:#e8f5e8;
    style J fill:#fff3e0;

As for the DB, I used SQLite for simplicity, but this architecture can be adapted to any SQL database. As for the LLM, I used OpenAI's gpt-4.1-mini, but you can use any comparable LLM.

System Components

The pipeline consists of four main components:

1. Metadata Analyzer

Uses an LLM to analyze sheet names and column headers, inferring the purpose and structure of the data:

# Example LLM prompt for metadata analysis
"""
Analyze the following Excel metadata:
Sheet Name: Portfolio_Holdings
Columns: Ticker, Company_Name, Market_Value, Weight_Percent, Sector

Based on column names, suggest:
- Appropriate table name
- Data description
- Primary key candidates
- Data category
"""

2. Type Detection Engine

Combines LLM analysis with statistical sampling to determine the correct SQL data types:

graph LR
    A[Sample Data] --> B[LLM Analysis]
    A --> C[Statistical Analysis]
    B --> D[Final Type Decision]
    C --> D
    D --> E[SQL Schema]

At the end of this process, the system generates a SQL schema that accurately represents the data types and relationships in the Excel file and executes it to create the table in the database.

3. SQL Generator

Converts natural language questions into SQL queries using the database schema as context:

sql_prompt = PromptTemplate(
    input_variables=["question", "schema", "sample_data"],
    template="""
    Generate an SQL query to answer the following question:
    
    Question: {question}
    
    Database schema:
    {schema}
    
    Sample data:
    {sample_data}
    
    Generate ONLY the SQL query without any additional explanations.
    Use standard SQLite syntax.
    """
)

4. Query Executor

Executes the generated SQL and formats results for presentation.

Real-World Example: ETF Portfolio Analysis

Let me walk you through a concrete example using an XTrackers ETF holdings composition as example file. The Excel file is pretty simple and contains the breakdown of the "Xtrackers MSCI World ex USA UCITS" ETF, with related underlying stocks, their market value, weight in the portfolio, and sector classification.

Input Excel File Structure

ID	Name	ISIN	Country	Currency	Exchange	Type of Security	Rating	Industry Classification	Weighting
1	SAP	DE0007164600	Germany	EUR	XETRA	Equity	-	Information Technology	1.47%
2	ASML HOLDING NV	NL0010273215	Netherlands	EUR	Euronext Amsterdam	Equity	Baa2	Information Technology	1.46%
3	NESTLE SA	CH0038863350	Switzerland	CHF	Scoach Switzerland	Equity	Aa2	Consumer Staples	1.22%
4	NOVARTIS AG	CH0012005267	Switzerland	CHF	Scoach Switzerland	Equity	Aa3	Health Care	1.08%
5	ROCHE HOLDING PAR AG	CH0012032048	Switzerland	CHF	Scoach Switzerland	Equity	A1	Health Care	1.06%
6	...	.....	....	...	...	......	..	....	...%

System Processing Steps

Metadata Analysis
- LLM identifies the dataset as portfolio holdings data
- Suggests table name: securities_list
- Identifies ID as primary key candidate
Type Detection
- ID: NUMBER (sequence number)
- Name: TEXT (Company Name)
- ISIN: TEXT (Security Identifier)
- Country: TEXT (Country of origin)
- Currency: TEXT (Currency of the security)
- Exchange: TEXT (Trading exchange)
- Type of Security: TEXT (e.g., Equity, Bond)
- Rating: TEXT (Credit rating)
- Industry Classification: TEXT (Sector classification)
- Weighting: REAL (Percentage weight in the portfolio)
SQL Schema Generation Automatically generated DDL for the table:

   CREATE TABLE securities_list (
    id INTEGER NOT NULL,
    name TEXT NOT NULL,
    isin TEXT NOT NULL,
    country TEXT,
    currency TEXT NOT NULL,
    exchange TEXT,
    type_of_security TEXT NOT NULL,
    rating TEXT,
    industry_classification TEXT,
    weighting REAL
);

Data Insertion Here, the LLM generates automatically the SQL INSERT statements to populate the table with data from the Excel file:
- Handles format conversion (B for billions, % for percentages)
- Validates data integrity
- Inserts all holdings records

Query Examples

Once processed, users can ask natural language questions:

Let's start with a straightforward question:

Query: "How many rows are there in total?"

Generated SQL

SELECT COUNT(*) as N FROM securities_list;

Output

N
796

Ok, now le'ts see a more complex query that requires aggregation and understanding of the data structure:

Query: "Can you show me the weight of the portfolio for each Country and Sector?"

Generated SQL

SELECT country, industry_classification AS sector, SUM(weighting) AS total_weight
FROM securities_list
GROUP BY country, industry_classification;

Output

country	industry_classification	total_weight
-	unknown	0.00349444
Australia	Communication Services	0.00139149
Australia	Consumer Discretionary	0.00439915
Australia	Consumer Staples	0.00200737
Australia	Energy	0.00214571
Australia	Financials	0.0250382
Australia	Health Care	0.00195356
Australia	Industrials	0.00295084
Australia	Information Technology	0.000675708
....	....	....

Now, let's take it a step further and apply some where conditions:

Query: "Show me the top 5 Non-European holdings by weight"

Generated SQL

SELECT name, country, weighting
FROM securities_list
WHERE country NOT IN ('Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'Iceland', 'Ireland', 'Italy', 'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania', 'Slovakia', 'Slovenia', 'Spain', 'Sweden', 'Switzerland', 'United Kingdom')
ORDER BY weighting DESC
LIMIT 5;

Output

name	Country	weighting
COMMONWEALTH BANK OF AUSTRALIA	Australia	0.00917773
ROYAL BANK OF CANADA	Canada	0.00869856
TOYOTA MOTOR CORP	Japan	0.00762924
MITSUBISHI UFJ FINANCIAL GROUP INC	Japan	0.00730266
SONY GROUP CORP	Japan	0.00721053

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
E0006WW1TQ4.xlsx		E0006WW1TQ4.xlsx
ExcelAnalyzer.py		ExcelAnalyzer.py
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abstract

The Solution: LLM-Powered Excel-to-SQL Pipeline

System Components

1. Metadata Analyzer

2. Type Detection Engine

3. SQL Generator

4. Query Executor

Real-World Example: ETF Portfolio Analysis

Input Excel File Structure

System Processing Steps

Query Examples

Query: "How many rows are there in total?"

Generated SQL

Output

Query: "Can you show me the weight of the portfolio for each Country and Sector?"

Generated SQL

Output

Query: "Show me the top 5 Non-European holdings by weight"

Generated SQL

Output

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Abstract

The Solution: LLM-Powered Excel-to-SQL Pipeline

System Components

1. Metadata Analyzer

2. Type Detection Engine

3. SQL Generator

4. Query Executor

Real-World Example: ETF Portfolio Analysis

Input Excel File Structure

System Processing Steps

Query Examples

Query: "How many rows are there in total?"

Generated SQL

Output

Query: "Can you show me the weight of the portfolio for each Country and Sector?"

Generated SQL

Output

Query: "Show me the top 5 Non-European holdings by weight"

Generated SQL

Output

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages