Skip to content

[GOOD FIRST ISSUE] Enrich Console Output with Summary Statistics #134

@raphael-intugle

Description

@raphael-intugle

name: Good First Issue
about: A beginner-friendly task perfect for first-time contributors
title: '[GOOD FIRST ISSUE] Enrich Console Output with Summary Statistics'
labels: 'good first issue, enhancement, user-experience'
assignees: ''

Welcome! 👋

This is a beginner-friendly issue perfect for first-time contributors to the Intugle project. We've designed this task to help you get familiar with our codebase while making a meaningful contribution.

Task Description

Enhance the console output during SemanticModel.build() to display rich summary statistics at each stage. Currently, the output shows basic progress messages, but users would benefit from seeing detailed statistics about what was processed.

Current output is basic:

Starting profiling and key identification stage...
Processing dataset: patients
Processing dataset: claims
Profiling and key identification complete.

Starting link prediction stage...
--- Comparing 'patients' <=> 'claims' ---
Found 2 potential link(s).
Link prediction complete.

We want informative summaries:

Starting profiling and key identification stage...
Processing dataset: patients
Processing dataset: claims
Profiling and key identification complete.

📊 Profiling Summary
╭────────────────────────────────────╮
│ Tables Profiled: 2                 │
│ Total Columns: 45                  │
│ Data Types Identified: 45          │
│                                    │
│ Distribution:                      │
│   • Dimensions: 28 (62%)          │
│   • Measures: 17 (38%)            │
│                                    │
│ Primary Keys Found: 2              │
╰────────────────────────────────────╯

Starting link prediction stage...
--- Comparing 'patients' <=> 'claims' ---
Found 2 potential link(s).
Link prediction complete.

🔗 Link Prediction Summary
╭────────────────────────────────────╮
│ Links Predicted: 2                 │
│ Links Validated: 2                 │
│ Success Rate: 100%                 │
│                                    │
│ Relationships:                     │
│   • patients → claims (1-to-many) │
│   • claims → encounters (many-to-1)│
╰────────────────────────────────────╯

This is just an example, feel free to make the stats richer if you have better ideas

Why This Matters

  • User Feedback: Users see what's happening under the hood
  • Quality Assurance: Statistics help users verify results
  • Debugging: Summary info helps identify issues
  • Professional: Rich output looks polished and informative
  • Transparency: Users understand what the AI models are doing

What You'll Learn

  • Using the Rich library for beautiful console output
  • Working with Rich Tables, Panels, and formatting
  • Aggregating statistics from data structures
  • Calculating percentages and distributions
  • Formatting numbers and creating visual summaries

Step-by-Step Guide

Prerequisites

  • Python 3.10+ installed
  • Git basics (clone, commit, push, pull request)
  • Read our CONTRIBUTING.md guide
  • Familiarity with the Rich library (optional but helpful)

Setup Instructions

  1. Fork and clone the repository

    git clone https://github.com/YOUR_USERNAME/data-tools.git
    cd data-tools
  2. Create a virtual environment

    python -m venv .venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  3. Install dependencies

    pip install -e ".[dev]"
  4. Create a new branch

    git checkout -b feat/enrich-console-output
  5. Run a notebook to see current output

    jupyter notebook notebooks/quickstart_healthcare.ipynb
    # Run through the sm.build() cell to see current output

Implementation Steps

Part 1: Add Profiling Summary

  1. Open src/intugle/semantic_model.py

  2. After line 70 (end of profile() method), add a summary:

def profile(self, force_recreate: bool = False):
    """Run profiling, datatype identification, and key identification for all datasets."""
    console.print(
        "Starting profiling and key identification stage...", style="yellow"
    )
    for dataset in self.datasets.values():
        # ... existing code ...
    
    console.print(
        "Profiling and key identification complete.", style="bold green"
    )
    
    # NEW: Add profiling summary
    self._print_profiling_summary()

def _print_profiling_summary(self):
    """Display a summary of profiling results."""
    ...

Part 2: Add Link Prediction Summary

  1. After line 85 (end of predict_links() method), add:
def predict_links(self, force_recreate: bool = False):
    """Run link prediction across all datasets."""
    # ... existing code ...
    
    console.print("Link prediction complete.", style="bold green")
    
    # NEW: Add link prediction summary
    if hasattr(self, 'link_predictor') and self.links:
        self._print_link_prediction_summary()

def _print_link_prediction_summary(self):
    """Display a summary of link prediction results."""
    ...

Part 3: Add Glossary Generation Summary

  1. After line 102 (end of generate_glossary() method), add:
def generate_glossary(self, force_recreate: bool = False):
    """Generate business glossary for all datasets."""
    # ... existing code ...
    
    console.print("Business glossary generation complete.", style="bold green")
    
    # NEW: Add glossary summary
    self._print_glossary_summary()

def _print_glossary_summary(self):
    """Display a summary of business glossary generation."""
    ...

Part 4: Add Overall Build Summary (30 min)

  1. At the end of build() method (after line 118), add a final summary:
def build(self, force_recreate: bool = False):
    """Run the full end-to-end knowledge building pipeline."""
    # ... existing code ...
    
    # NEW: Add final build summary
    self._print_build_summary()
    
    return self

def _print_build_summary(self):
    """Display overall build summary."""
    ...

Files to Modify

  • File: src/intugle/semantic_model.py
    • Change: Add 4 new methods for summary display

Testing Your Changes

  1. Run a notebook and check output:

    jupyter notebook notebooks/quickstart_healthcare.ipynb
    # Execute the sm.build() cell and observe rich output
  2. Test with different datasets:

    # Try with different numbers of tables
    python -c "
    from intugle import SemanticModel
    datasets = {
        'patients': {'path': 'sample_data/healthcare/patients.csv', 'type': 'csv'},
        'claims': {'path': 'sample_data/healthcare/claims.csv', 'type': 'csv'},
    }
    sm = SemanticModel(datasets, domain='Healthcare')
    sm.build()
    "
    # Check that statistics are correct
  3. Verify calculations:

    • Count tables/columns manually
    • Verify percentages add up correctly
    • Check link counts match reality
  4. Run tests:

    pytest tests/

Example Output

Before:

Starting profiling and key identification stage...
Processing dataset: patients
Processing dataset: claims
Profiling and key identification complete.

After (Just an example):

Starting profiling and key identification stage...
Processing dataset: patients
Processing dataset: claims
Profiling and key identification complete.

╭─────────────── 📊 Profiling Summary ───────────────╮
│ Tables Profiled: 2                                 │
│ Total Columns: 45                                  │
│ Data Types Identified: 45                          │
│                                                    │
│ Distribution:                                      │
│   • Dimensions: 28 (62.2%)                        │
│   • Measures: 17 (37.8%)                          │
│                                                    │
│ Primary Keys Found: 2                              │
╰────────────────────────────────────────────────────╯

Submitting Your Work

  1. Commit your changes

    git add src/intugle/semantic_model.py
    git commit -m "feat: Add rich console summaries for profiling, links, and glossary"
  2. Push to your fork

    git push origin feat/enrich-console-output
  3. Create a Pull Request

    • Go to the original repository
    • Click "Pull Requests" → "New Pull Request"
    • Select your branch
    • Fill out the PR template
    • Include screenshots showing the rich output
    • Reference this issue with "Fixes #ISSUE_NUMBER"

Expected Outcome

After running sm.build(), users should see:

  • ✅ Rich formatted summary panels
  • ✅ Accurate statistics about profiling (tables, columns, types)
  • ✅ Data type distribution with percentages
  • ✅ Link prediction results with success rate
  • ✅ Glossary generation coverage
  • ✅ Final build summary with next steps
  • ✅ Beautiful formatting using Rich library

Definition of Done

  • Profiling summary added with statistics
  • Link prediction summary added with relationship info
  • Glossary summary added with coverage metrics
  • Final build summary added with next steps
  • All statistics calculated correctly
  • Percentages formatted with one decimal place
  • Rich panels used for formatting
  • All tests pass
  • Screenshots included in PR
  • Pull request submitted

Bonus Enhancements (Optional)

If you want to go further:

  • Add emoji indicators (✓, ✗, ⚠) for different states
  • Use Rich Tables for more complex summaries
  • Add color coding based on quality metrics (green for high coverage, yellow for medium, etc.)
  • Show data type breakdown by category (text, numeric, datetime, etc.)
  • Add execution time for each stage
  • Show cardinality information for relationships

Resources

Need Help?

Don't hesitate to ask questions! We're here to help you succeed.

  • Comment below with your questions
  • Join our Discord for real-time support
  • Tag maintainers: @raphael-intugle (if specific help needed)

Skills You'll Use

  • Python basics
  • Git and GitHub
  • Rich library for terminal output
  • Data aggregation and statistics
  • Calculating percentages
  • String formatting and layout

Thank you for contributing to Intugle!

Tips for Success:

  • Start with Part 1 (profiling) as it's the easiest
  • Test after each part to verify statistics are correct
  • Use console.print() with Rich markup for colors
  • Take screenshots to show the before/after difference
  • Make the output informative but not overwhelming
  • Have fun making beautiful terminal output!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions