Quaero
Data analysts spend a large amount of time preparing data before analysis.
Typical problems:
- messy CSV exports
- inconsistent schemas
- missing documentation
- manual data cleaning
- lack of reproducible pipelines
As a result:
- analytics work becomes slow
- dashboards are difficult to reproduce
- KPI definitions vary across projects
Create a system that converts raw datasets into clean analytics-ready datasets, analytical marts, and project-level analysis outputs while assisting analysts in identifying meaningful KPIs.
The current interaction model is question-driven: the user should be able to run the pipeline with only a dataset, question, source name, and project root, while the system infers the metric column, dimension column, and aggregation intent automatically.
The framework also supports config-driven project execution, where a project may contain one dataset or many datasets under a shared project folder.
In config-driven mode, the top-level project question remains part of the core interaction model. It guides the primary analytical narrative while the framework still generates its generic mart outputs.
This interaction must remain available through:
- the CLI
- a lightweight local UI
- an importable Python entrypoint for app integration
Deterministic inference remains the default behavior. An optional LLM fallback may assist ambiguous inference, but it must remain opt-in, non-breaking, schema-validated, and secondary to the deterministic path.
Primary users:
- Data Analysts
- BI Developers
- Independent consultants
- Open data researchers
The platform helps analysts move from:
Raw messy data
→
Clean structured dataset
→
Suggested KPIs
→
Dashboard-ready analytical marts
Supported sources:
- CSV upload
- Parquet upload
- direct CSV URL
- direct Parquet URL
- config-driven
project_config.json
Automatic generation of:
- schema
- null analysis
- distribution statistics
- potential identifiers
Standardization tasks:
- column normalization
- type inference
- date parsing
- duplicate detection
- missing value handling
Different pipeline types based on the business question.
Examples:
Trend Analysis
Benchmark Comparison
Release Impact Analysis
Cohort Analysis
Provider Performance
The framework supports both:
- single-dataset projects using
raw/ -> staging/ -> marts/ - multi-dataset projects using
raw/ -> staging/ -> integrated/ -> marts/when integration is required
AI analyzes:
- column names
- dataset schema
- user question
Outputs:
- primary KPI
- secondary KPIs
- suggested dimensions
- suggested filters
Outputs ready for dashboards and BI tools.
Deliverables:
- analytical marts
- KPI notes
- profiling and quality metadata
- project README documentation
- question-aware dashboard suggestions
- final summary metadata, including dataset coverage when applicable
Analytical marts are the primary output layer for BI tools.
First version should include:
- CSV ingestion
- dataset profiling
- cleaning pipeline
- mart builder
- KPI suggestion
- dashboard-ready analytical marts
Preferred usage:
python -m app.main <dataset> --question "..." --source-name "..." --project-root .The pipeline performs deterministic semantic inference for:
- metric column
- dimension column
- aggregation intent
Optional manual overrides remain available through:
--metric-column--dimension-column
Config-driven project usage:
python -m app.main --config projects/<project_name>/project_config.json --project-root .Config mode supports one or more datasets in the same project.
For config-driven projects, the top-level question must be propagated into:
final_summary_report.json- the generated project README
- dashboard suggestions metadata
The project question should influence:
- primary mart selection
- README narrative emphasis
- dashboard suggestion emphasis
Example projects/<project_name>/project_config.json:
{
"project": "project_name",
"question": "What should this project analyze?",
"datasets": [
{
"name": "dataset_a",
"path": "path/to/dataset_a.csv",
"time_column": "year"
}
]
}The product isolates generated artifacts per project so that multiple runs do not mix outputs in shared folders. If an artifact depends on a dataset, mart, metric column, business question, or pipeline run, it must be stored under the project folder:
projects/{normalized_source_name}/
Canonical structure:
projects/
<project_name>/
raw/
staging/
integrated/ # optional, only for multi-dataset projects
marts/
metadata/
README.md
This keeps raw snapshots, staging data, optional integrated datasets, analytical marts, and metadata isolated per project. The CLI automatically creates the project folder before execution.
For config-driven projects, per-dataset coverage metadata should be recorded when a usable time column exists so BI authors can see each source's minimum and maximum year without inspecting Parquet files directly.
The analytical outputs are BI-tool agnostic and are intended to support tools such as Power BI, Tableau, and Looker Studio by loading the generated mart tables directly.
Large real datasets should not be committed to the repository. The
sample_data/ folder should contain only tiny demo assets or lightweight
documentation.
For larger datasets:
- use local files or direct file URLs
- keep the data out of git history
- share external download links and local setup instructions instead of committing the files
Not included initially:
- enterprise authentication
- multi-tenant SaaS
- full AI autonomous pipelines
- large scale distributed computing
The product is successful if:
- pipelines are reproducible
- dataset preparation time is reduced
- KPI suggestions are relevant
- output datasets work directly in dashboard and BI tools
Features:
- CLI-first interface
- CSV ingestion
- profiling report
- cleaning pipeline
- mart creation
- dashboard-ready analytical marts
Add:
- config-driven dataset loaders
- GitHub automation
- improved metadata tracking
- better KPI prompt models
Add:
- Azure Fabric Lakehouse
- OneLake integration
- pipeline orchestration
- larger dataset support
Single-dataset project:
1 ingestion
2 profiling
3 cleaning
4 mart generation
5 KPI suggestion
6 dashboard-ready analytical outputs
Multi-dataset project:
1 config-driven ingestion
2 per-dataset profiling
3 per-dataset cleaning
4 optional integration into a shared project dataset
5 mart generation from the integrated layer
6 project README and metadata outputs
Example public demo: global_economic_indicators
Datasets:
- World GDP
- World inflation
- World unemployment
User question: How do global GDP, inflation, and unemployment evolve over time and how are they related?
Pipeline result:
1 config-driven ingestion
2 per-dataset profiling
3 per-dataset cleaning
4 integration into a shared economic indicators dataset
5 trend and correlation mart generation
6 project README and metadata outputs
7 dashboard-ready marts consumed by BI tools
Output:
integrated/master_dataset.parquetmart_gdp_trendmart_gdp_summary_statsmart_inflation_trendmart_inflation_summary_statsmart_unemployment_trendmart_unemployment_summary_statsmart_metric_correlationfinal_summary_report.jsonwith question, primary mart, and dataset coverage
This project demonstrates skills in:
- analytics engineering
- BI preparation
- data quality management
- AI assisted analysis
- reproducible pipelines