This project demonstrates how to build a lightweight AI-assisted data analysis pipeline using Python and the OpenAI API.
Instead of sending the full dataset to the AI model, the system generates a structured metadata summary (schema, statistics, missing values, etc.) and sends only that summary for analysis. This approach:
- Reduces token usage
- Improves performance
- Minimizes unnecessary data exposure
- Maintains scalability
The project simulates an AI-powered data analyst that reviews dataset structure and produces business insights.
Source: Kaggle
Dataset: Synthetic Mobile Sales 2025
The dataset contains simulated mobile device sales data, including:
- Product information
- Sales quantities
- Revenue
- Transaction dates
- Regions
- Additional transactional attributes
The workflow follows this structure:
- Install dependencies
- Download dataset from Kaggle
- Load dataset into Pandas
- Generate automated metadata summary
- Send structured summary to OpenAI
- Receive AI-generated business insights
Only summarized metadata is transmitted to the model.
Install required dependencies:
pip install kagglehub[pandas-datasets] openai pandas numpyImport required libraries:
import os
import json
import pandas as pd
import numpy as np
import kagglehub
from getpass import getpass
from openai import OpenAISecurely load your OpenAI API key:
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
client = OpenAI()Important: Never hard-code API keys into a public repository.
Programmatically download the dataset:
path = kagglehub.dataset_download("syedaeman2212/mobile-sales-data")Load the CSV file into a Pandas DataFrame:
df = pd.read_csv(os.path.join(path, "synthetic_mobile_sales_2025.csv"))Preview the dataset:
df.head()
df.info()Instead of sending raw records, build a structured metadata summary:
summary = {
"shape": df.shape,
"columns": df.columns.tolist(),
"dtypes": df.dtypes.astype(str).to_dict(),
"missing": df.isnull().sum().to_dict(),
"numeric_summary": df.describe().to_dict()
}Convert summary to JSON for API transmission:
summary_json = json.dumps(summary, indent=2)This ensures:
- Reduced token usage
- Faster API responses
- Improved security
- Scalability
Construct a prompt that instructs the AI to analyze the dataset summary:
prompt = f"""
You are a senior data analyst. Based on the following dataset metadata, provide:
1. Key insights
2. Observed trends
3. Data quality concerns
4. Potential business recommendations
Dataset Summary:
{summary_json}
"""Send request to the API:
response = client.responses.create(
model="gpt-4.1-mini",
input=prompt
)
analysis_output = response.output_text
print(analysis_output)The AI returns:
- Business insights
- Identified patterns
- Anomaly detection suggestions
- Data quality observations
- Strategic recommendations
This simulates an AI-powered analytical review.
Sending only metadata rather than full datasets:
- Reduces API token costs
- Prevents unnecessary data sharing
- Improves performance
- Maintains governance standards
- Enables production scalability
This architecture is well-suited for enterprise AI analytics pipelines.
The project can be expanded to include:
- Automated visualizations (Matplotlib, Seaborn, Plotly)
- Correlation matrix analysis
- Outlier detection
- Feature engineering
- Machine learning model training
- Automated report generation (PDF or HTML)
- Interactive dashboard (Streamlit)
- User-uploaded dataset support
- Data ingestion and preprocessing
- Automated exploratory data analysis
- Structured metadata engineering
- API integration with OpenAI
- Efficient token management
- AI-assisted analytics pipeline design
- Reproducible Python workflows
For real-world deployment:
- Add environment variable management
- Implement error handling
- Add structured logging
- Implement API rate limit handling
- Containerize with Docker
- Deploy via a cloud platform (AWS, Azure, GCP)
This project demonstrates how to build an AI-assisted analytics system capable of reviewing datasets and generating actionable insights without exposing raw data.
It is suitable for:
- AI-driven reporting systems
- Internal analytics assistants
- Enterprise data governance workflows
- Portfolio demonstration of applied AI integration