Skip to content

Saivineeth147/LLM-Compass

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 

Repository files navigation

logo

Welcome to LLM Compass

The ultimate collection of resources for building, evaluating, and mastering Large Language Models.


📚 Libraries & Frameworks

  • Haystack – Production-ready framework for building search engines, RAG systems, and question-answering applications.
  • Hugging Face Transformers – Hugely popular NLP library providing thousands of pre-trained models for text generation, classification, translation, and fine-tuning.
  • LangChain – Flexible framework for building real-world LLM-powered applications such as RAG, agents, and pipelines.
  • LLaMA – Meta’s family of open-source LLMs that provide strong performance for research and downstream tasks.
  • llama.cpp – Highly efficient inference engine for LLaMA models on CPU, optimized for local deployment.
  • OpenAI GPT API – Official API for integrating GPT models into apps, chatbots, and workflows with robust support.

🧪 Evaluation & Testing Tools

  • FastChat – An open platform for training, serving, and evaluating large language model based chatbots.
  • Helm – Stanford’s holistic evaluation suite for analyzing accuracy, robustness, calibration, and fairness of LLMs.
  • llm-testlab – Comprehensive toolkit for evaluating LLM responses on hallucinations, consistency, safety, and semantic similarity.
  • OpenAI Evals – Framework for creating, sharing, and running benchmarks to track LLM performance across tasks.

📊 Datasets

  • Dolly 15k – High-quality open dataset of instruction-following examples by Databricks.
  • HelpSteer – Human preference dataset for guiding LLMs toward helpful, safe, and ethical outputs.
  • OpenWebText – Open-source reproduction of the WebText dataset used to train GPT models.
  • Pile – Massive 825GB dataset covering diverse domains for training robust large-scale models.
  • RedPajama – Large-scale dataset replicating the training data for state-of-the-art LLMs.
  • Stanford Alpaca – Instruction-following dataset built on LLaMA for research in alignment and fine-tuning.

🎓 Tutorials & Guides


📄 Research Papers


🚀 Example Projects

  • Auto-GPT – Autonomous GPT-4 agent capable of planning and executing multi-step tasks automatically.
  • BabyAGI – Lightweight autonomous agent using LLMs for iterative goal-setting and task execution.
  • ChatGPT-Next-Web – Self-hosted ChatGPT-like web app with customizable UI and backend.
  • GPT Engineer – Tool for generating complete codebases from natural language project descriptions.
  • PrivateGPT – Privacy-focused tool for chatting with documents locally without internet or cloud access.

🌍 Communities


🏆 Top LLMs & Benchmarks (2025)

  • Claude Opus 4 (Anthropic) – Strengths: Advanced reasoning, coding, and multimodal capabilities | Benchmarks: GPQA Science 79.6%, LiveCodeBench 72%, USAMO 21.7%, HMMT 58.3%, AIME 75.5%, ARC-AGI-2 8.6% | Notes: Anthropic's most capable model yet, setting new standards in reasoning, coding, and complex math.
  • Claude Sonnet 4 (Anthropic) – Strengths: Efficient performance for everyday tasks | Benchmarks: GPQA Science 79.6%, LiveCodeBench 72%, USAMO 21.7%, HMMT 58.3%, AIME 75.5%, ARC-AGI-2 8.6% | Notes: Smart, efficient model for everyday use.
  • DeepSeek-V3.1 – Strengths: Coding and reasoning-focused tasks | Benchmarks: MMLU-Redux 91.8%, SWE-Bench 66% | Notes: Optimized for hybrid thinking and agentic workflows, strong in coding challenges.
  • Grok 4 (xAI) – Strengths: General reasoning and structured output | Benchmarks: GPQA Science 86.4%, LiveCodeBench 79%, USAMO 37.5%, HMMT 90%, AIME 91.7%, ARC-AGI-2 15.9% | Notes: Balanced model for math, reasoning, and coding.
  • Grok 4 Heavy w/ Python (xAI) – Strengths: Top coding, reasoning, and math performance | Benchmarks: GPQA Science 88.4%, LiveCodeBench 79.4%, USAMO 61.9%, HMMT 96.7%, AIME 100%, ARC-AGI-2 15.9% | Notes: Best-in-class Grok 4 variant optimized for Python-heavy tasks.
  • Grok 4 w/ Python (xAI) – Strengths: Strong coding and reasoning with Python | Benchmarks: GPQA Science 87.5%, LiveCodeBench 79.3%, USAMO 37.5%, HMMT 93.9%, AIME 98.8%, ARC-AGI-2 8.6% | Notes: Efficient for programming-intensive tasks.
  • GPT-5 (OpenAI) – Strengths: Exceptional reasoning, coding, and multimodal capabilities | Benchmarks: MMLU 91.2%, GPQA 79.3%, SWE-Bench 54.6% | Notes: OpenAI's latest flagship model with a large context window and advanced agentic capabilities.
  • Gemini 2.5 Pro (Google DeepMind) – Strengths: Multimodal reasoning, translation, and math | Benchmarks: GPQA Science 83.3%, LiveCodeBench 74.2%, USAMO 34.5%, HMMT 82.5%, AIME 88.9%, ARC-AGI-2 4.9% | Notes: Excels at complex interactive and reasoning tasks.
  • Llama 4 (Meta) – Strengths: Cost-efficient, local deployment, flexible fine-tuning | Benchmarks: MMLU 85%, GPQA 80%, SWE-Bench 69.4% | Notes: Open-source LLM ideal for research, local inference, and instruction-following.
  • o3 (Open LLM) – Strengths: Reasoning & math tasks | Benchmarks: GPQA Science 79.6%, LiveCodeBench 72%, USAMO 21.7%, HMMT 58.3%, AIME 88.9%, ARC-AGI-2 6.5% | Notes: Competitive math and reasoning model.
  • Qwen 3 (Alibaba) – Strengths: Coding, reasoning, and multilingual support | Benchmarks: SWE-Bench High, AIME 2025 93.3% | Notes: Designed for both language and multimodal tasks with strong domain versatility.

🤝 Contributing

Contributions welcome! If you find a valuable LLM resource or have an Open Source Project, open a PR.


📜 License

MIT License

About

The ultimate collection of resources for building, evaluating, and mastering Large Language Models.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published