π GSoC Progress Tracker: Complete project timeline and accountability document used throughout GSoC 2025: View Progress Tracker
PyPI Β· Google Summer of Code Β· Google DeepMind Python Β· MIT License A Google DeepMind GSoC 2025 Project | Powered by Google Gemini AI | Built with Python
Mentors: Paige Bailey Program: Google Summer of Code Organization: Google DeepMind Student: Aryan Saboo Email: aryansaboo2005@gmail.com Duration: May 2025 β September 2025
Processing long-form transcripts with LLMs like Gemini is computationally expensive, redundant, and error-prone. Traditional approaches suffer from:
- High API costs due to naive chunking and excessive token use
- Poor scalability for transcripts exceeding context length
- Weak context preservation leading to fragmented answers
- Lack of intelligent re-use of previously processed content
- Design a smart chunking engine that segments transcripts based on semantic boundaries, keywords, and named entities.
- Create a weighted retrieval system balancing semantic similarity, entropy (information density), and recency of use.
- Implement a caching system for both chunks and Q&A pairs to eliminate redundant processing.
- Build an interactive graph UI showing chunk usage, promptβanswer links, and similarity relations.
- Deliver a production-ready Python package with CLI and PyPI distribution.
- LLM-guided chunking: Gemini converts transcripts into structured JSON
{title, content, keywords, entities, timestamps}. - Resilient fallback splitter: Ensures progress even if model output is malformed.
- Local embeddings: Sentence-transformers support local vectorization.
- Weighted scoring: Combines semantic similarity, entropy, and recency.
- Entropy estimator: Uses lexical diversity, NER density, and semantic variance.
- Recency-aware ranking: Chunks recently used in answers get priority.
- Overfetching & normalization: Improves selection by broad candidate sampling.
- Usage cache: CSV-based system with rich metadata (title, keywords, usage_count, last_used).
- PromptβAnswer memory: Stores Q&A pairs with chunk IDs in Pinecone.
- Automatic upgrades: Legacy CSV schemas auto-migrate.
- Interactive UI: Built with pygame for exploring chunks and Q&A relationships.
- Node types: Chunk nodes and promptβanswer nodes.
- Edge types: Similarity, temporal adjacency, and usage connections.
- Layouts: Force-directed, circular, and hierarchical modes.
- Processor orchestrator: High-level API to
process_transcriptandquery. - CLI commands:
process,query, andgraphfor quick interaction. - Cross-platform compatibility: Verified on Windows, macOS, Linux.
| Component | Status | Details |
|---|---|---|
| Core Chunking | β Complete | LLM-guided with fallback |
| Weighted Retrieval | β Complete | Semantic + entropy + recency |
| Usage Cache | β Complete | CSV store with upgrades |
| Q&A Memory | β Complete | Pinecone namespace |
| Graph UI | β Complete | Interactive pygame visualizer |
| CLI | β Complete | Process, query, visualize |
| PyPI Package | β Complete | Published as echogem |
| Cross-Platform | β Complete | Works on Win/macOS/Linux |
- Transcript ingestion & chunking
- Weighted chunk retrieval & scoring
- Q&A memory and caching
- Interactive chunk/answer visualization
- CLI and Python API
- PyPI Package: https://pypi.org/project/echogem/
- Install:
pip install echogem - Repository: GitHub (public)
- License: MIT
- Coherence-aware selection: Ensure diverse, non-redundant chunk sets.
- Context reuse: Prioritize chunks from successful past answers.
- Adaptive normalization: Smarter entropy weighting.
- Streaming integration: Handle live transcripts.
- Thread-unsafety of Pinecone/Sylvan: Had to design cache and async logic carefully.
- Entropy heuristics: NER density + lexical diversity proved more reliable than token length.
- CLI vs Processor mismatch: Adjusted expectations; Processor is the stable API.
- Graph rendering: Needed multiple layouts to handle large transcript visualizations.
| Metric | Traditional | EchoGem | Improvement |
|---|---|---|---|
| Context reuse | None | Cached Q&A pairs | Major |
| Chunk selection | Similarity only | Similarity + entropy + recency | Higher precision |
| Scalability | Linear | Overfetch + normalization | Efficient |
| Visualization | None | Interactive graph | Debuggable, transparent |
Phase 1: Foundation (May 2025) β Repository setup & API key config β Basic chunking + embeddings
Phase 2: Core Systems (June 2025) β Weighted retrieval + entropy estimator β Usage cache + Q&A store
Phase 3: Visualization & CLI (July 2025) β Graph visualizer β CLI commands (process, query, graph)
Phase 4: Release (August 2025)
β
PyPI publication (echogem)
β
Documentation & testing
β
Final code review & GSoC submission
| Deliverable | Description | Status |
|---|---|---|
| Core Package | Production-ready Python package | β
echogem/ |
| PyPI Release | Published worldwide | β
PyPI: echogem |
| Documentation | Full guides, API reference | β
docs/ |
| Graph Visualizer | Interactive chunk/answer explorer | β
graphe.py |
| CLI | Command-line interface | β
cli.py |
| Progress Tracker | Week-by-week logs | β Completed |
If you use EchoGem in research, please cite:
@software{saboo2025echogem,
author = {Saboo, Aryan},
title = {EchoGem: Teaching Gemini to Think in Batches by Prioritizing What Matters},
year = {2025},
publisher = {Google DeepMind},
journal = {Google Summer of Code 2025},
url = {https://github.com/aryansaboo/echogem}
}pip install echogemexport PINECONE_API_KEY=...
export GOOGLE_API_KEY=...from echogem.processor import Processor
p = Processor()
p.process_transcript("transcript.txt", persist=True)
ans = p.query("What did the speaker say about scaling?", k=5)
print(ans.answer)echogem process --transcript transcript.txt --persist
echogem query --question "Key takeaways?"
echogem graphLicense: MIT Developed by: Aryan Saboo during Google Summer of Code 2025 at Google DeepMind Repository: https://github.com/aryansaboo/echogem
Acknowledgments:
- Google Summer of Code program
- Google DeepMind for mentorship
- Gemini Team for API access
- Open Source Community for embeddings, Pinecone, and pygame
π GSoC 2025 Success Story π From research proposal to production-ready PyPI package, EchoGem makes long-context retrieval practical, efficient, and transparent.