-
Notifications
You must be signed in to change notification settings - Fork 41
Description
name: Good First Issue
about: A beginner-friendly task perfect for first-time contributors
title: '[GOOD FIRST ISSUE] Enrich Console Output with Summary Statistics'
labels: 'good first issue, enhancement, user-experience'
assignees: ''
Welcome! 👋
This is a beginner-friendly issue perfect for first-time contributors to the Intugle project. We've designed this task to help you get familiar with our codebase while making a meaningful contribution.
Task Description
Enhance the console output during SemanticModel.build() to display rich summary statistics at each stage. Currently, the output shows basic progress messages, but users would benefit from seeing detailed statistics about what was processed.
Current output is basic:
Starting profiling and key identification stage...
Processing dataset: patients
Processing dataset: claims
Profiling and key identification complete.
Starting link prediction stage...
--- Comparing 'patients' <=> 'claims' ---
Found 2 potential link(s).
Link prediction complete.
We want informative summaries:
Starting profiling and key identification stage...
Processing dataset: patients
Processing dataset: claims
Profiling and key identification complete.
📊 Profiling Summary
╭────────────────────────────────────╮
│ Tables Profiled: 2 │
│ Total Columns: 45 │
│ Data Types Identified: 45 │
│ │
│ Distribution: │
│ • Dimensions: 28 (62%) │
│ • Measures: 17 (38%) │
│ │
│ Primary Keys Found: 2 │
╰────────────────────────────────────╯
Starting link prediction stage...
--- Comparing 'patients' <=> 'claims' ---
Found 2 potential link(s).
Link prediction complete.
🔗 Link Prediction Summary
╭────────────────────────────────────╮
│ Links Predicted: 2 │
│ Links Validated: 2 │
│ Success Rate: 100% │
│ │
│ Relationships: │
│ • patients → claims (1-to-many) │
│ • claims → encounters (many-to-1)│
╰────────────────────────────────────╯
This is just an example, feel free to make the stats richer if you have better ideas
Why This Matters
- User Feedback: Users see what's happening under the hood
- Quality Assurance: Statistics help users verify results
- Debugging: Summary info helps identify issues
- Professional: Rich output looks polished and informative
- Transparency: Users understand what the AI models are doing
What You'll Learn
- Using the Rich library for beautiful console output
- Working with Rich Tables, Panels, and formatting
- Aggregating statistics from data structures
- Calculating percentages and distributions
- Formatting numbers and creating visual summaries
Step-by-Step Guide
Prerequisites
- Python 3.10+ installed
- Git basics (clone, commit, push, pull request)
- Read our CONTRIBUTING.md guide
- Familiarity with the Rich library (optional but helpful)
Setup Instructions
-
Fork and clone the repository
git clone https://github.com/YOUR_USERNAME/data-tools.git cd data-tools -
Create a virtual environment
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install dependencies
pip install -e ".[dev]" -
Create a new branch
git checkout -b feat/enrich-console-output
-
Run a notebook to see current output
jupyter notebook notebooks/quickstart_healthcare.ipynb # Run through the sm.build() cell to see current output
Implementation Steps
Part 1: Add Profiling Summary
-
Open
src/intugle/semantic_model.py -
After line 70 (end of
profile()method), add a summary:
def profile(self, force_recreate: bool = False):
"""Run profiling, datatype identification, and key identification for all datasets."""
console.print(
"Starting profiling and key identification stage...", style="yellow"
)
for dataset in self.datasets.values():
# ... existing code ...
console.print(
"Profiling and key identification complete.", style="bold green"
)
# NEW: Add profiling summary
self._print_profiling_summary()
def _print_profiling_summary(self):
"""Display a summary of profiling results."""
...Part 2: Add Link Prediction Summary
- After line 85 (end of
predict_links()method), add:
def predict_links(self, force_recreate: bool = False):
"""Run link prediction across all datasets."""
# ... existing code ...
console.print("Link prediction complete.", style="bold green")
# NEW: Add link prediction summary
if hasattr(self, 'link_predictor') and self.links:
self._print_link_prediction_summary()
def _print_link_prediction_summary(self):
"""Display a summary of link prediction results."""
...Part 3: Add Glossary Generation Summary
- After line 102 (end of
generate_glossary()method), add:
def generate_glossary(self, force_recreate: bool = False):
"""Generate business glossary for all datasets."""
# ... existing code ...
console.print("Business glossary generation complete.", style="bold green")
# NEW: Add glossary summary
self._print_glossary_summary()
def _print_glossary_summary(self):
"""Display a summary of business glossary generation."""
...Part 4: Add Overall Build Summary (30 min)
- At the end of
build()method (after line 118), add a final summary:
def build(self, force_recreate: bool = False):
"""Run the full end-to-end knowledge building pipeline."""
# ... existing code ...
# NEW: Add final build summary
self._print_build_summary()
return self
def _print_build_summary(self):
"""Display overall build summary."""
...Files to Modify
- File:
src/intugle/semantic_model.py- Change: Add 4 new methods for summary display
Testing Your Changes
-
Run a notebook and check output:
jupyter notebook notebooks/quickstart_healthcare.ipynb # Execute the sm.build() cell and observe rich output -
Test with different datasets:
# Try with different numbers of tables python -c " from intugle import SemanticModel datasets = { 'patients': {'path': 'sample_data/healthcare/patients.csv', 'type': 'csv'}, 'claims': {'path': 'sample_data/healthcare/claims.csv', 'type': 'csv'}, } sm = SemanticModel(datasets, domain='Healthcare') sm.build() " # Check that statistics are correct
-
Verify calculations:
- Count tables/columns manually
- Verify percentages add up correctly
- Check link counts match reality
-
Run tests:
pytest tests/
Example Output
Before:
Starting profiling and key identification stage...
Processing dataset: patients
Processing dataset: claims
Profiling and key identification complete.
After (Just an example):
Starting profiling and key identification stage...
Processing dataset: patients
Processing dataset: claims
Profiling and key identification complete.
╭─────────────── 📊 Profiling Summary ───────────────╮
│ Tables Profiled: 2 │
│ Total Columns: 45 │
│ Data Types Identified: 45 │
│ │
│ Distribution: │
│ • Dimensions: 28 (62.2%) │
│ • Measures: 17 (37.8%) │
│ │
│ Primary Keys Found: 2 │
╰────────────────────────────────────────────────────╯
Submitting Your Work
-
Commit your changes
git add src/intugle/semantic_model.py git commit -m "feat: Add rich console summaries for profiling, links, and glossary" -
Push to your fork
git push origin feat/enrich-console-output
-
Create a Pull Request
- Go to the original repository
- Click "Pull Requests" → "New Pull Request"
- Select your branch
- Fill out the PR template
- Include screenshots showing the rich output
- Reference this issue with "Fixes #ISSUE_NUMBER"
Expected Outcome
After running sm.build(), users should see:
- ✅ Rich formatted summary panels
- ✅ Accurate statistics about profiling (tables, columns, types)
- ✅ Data type distribution with percentages
- ✅ Link prediction results with success rate
- ✅ Glossary generation coverage
- ✅ Final build summary with next steps
- ✅ Beautiful formatting using Rich library
Definition of Done
- Profiling summary added with statistics
- Link prediction summary added with relationship info
- Glossary summary added with coverage metrics
- Final build summary added with next steps
- All statistics calculated correctly
- Percentages formatted with one decimal place
- Rich panels used for formatting
- All tests pass
- Screenshots included in PR
- Pull request submitted
Bonus Enhancements (Optional)
If you want to go further:
- Add emoji indicators (✓, ✗, ⚠) for different states
- Use Rich Tables for more complex summaries
- Add color coding based on quality metrics (green for high coverage, yellow for medium, etc.)
- Show data type breakdown by category (text, numeric, datetime, etc.)
- Add execution time for each stage
- Show cardinality information for relationships
Resources
Need Help?
Don't hesitate to ask questions! We're here to help you succeed.
- Comment below with your questions
- Join our Discord for real-time support
- Tag maintainers: @raphael-intugle (if specific help needed)
Skills You'll Use
- Python basics
- Git and GitHub
- Rich library for terminal output
- Data aggregation and statistics
- Calculating percentages
- String formatting and layout
Thank you for contributing to Intugle!
Tips for Success:
- Start with Part 1 (profiling) as it's the easiest
- Test after each part to verify statistics are correct
- Use
console.print()with Rich markup for colors - Take screenshots to show the before/after difference
- Make the output informative but not overwhelming
- Have fun making beautiful terminal output!