- Haystack – Production-ready framework for building search engines, RAG systems, and question-answering applications.
- Hugging Face Transformers – Hugely popular NLP library providing thousands of pre-trained models for text generation, classification, translation, and fine-tuning.
- LangChain – Flexible framework for building real-world LLM-powered applications such as RAG, agents, and pipelines.
- LLaMA – Meta’s family of open-source LLMs that provide strong performance for research and downstream tasks.
- llama.cpp – Highly efficient inference engine for LLaMA models on CPU, optimized for local deployment.
- OpenAI GPT API – Official API for integrating GPT models into apps, chatbots, and workflows with robust support.
- FastChat – An open platform for training, serving, and evaluating large language model based chatbots.
- Helm – Stanford’s holistic evaluation suite for analyzing accuracy, robustness, calibration, and fairness of LLMs.
- llm-testlab – Comprehensive toolkit for evaluating LLM responses on hallucinations, consistency, safety, and semantic similarity.
- OpenAI Evals – Framework for creating, sharing, and running benchmarks to track LLM performance across tasks.
- Dolly 15k – High-quality open dataset of instruction-following examples by Databricks.
- HelpSteer – Human preference dataset for guiding LLMs toward helpful, safe, and ethical outputs.
- OpenWebText – Open-source reproduction of the WebText dataset used to train GPT models.
- Pile – Massive 825GB dataset covering diverse domains for training robust large-scale models.
- RedPajama – Large-scale dataset replicating the training data for state-of-the-art LLMs.
- Stanford Alpaca – Instruction-following dataset built on LLaMA for research in alignment and fine-tuning.
- FreeCodeCamp Guide – Beginner’s guide to LLMs with practical examples and simple explanations.
- Google: Intro to LLMs – Accessible guide to understanding LLMs, transformers, and training basics.
- Hugging Face LLM Course – Practical, hands-on course to learn transformers, fine-tuning, and deployment.
- LangChain Tutorials – Official tutorials on building advanced LLM pipelines and AI applications.
- Microsoft Generative AI for Beginners – Beginner-friendly video series explaining generative AI concepts and use cases.
- mlabonne/llm-course – Open-source curriculum teaching LLM theory, fine-tuning, and applications.
- OpenAI Cookbook – Collection of examples, patterns, and recipes for leveraging GPT effectively.
- Stanford Lecture: Intro to LLMs – Detailed lecture explaining the architecture, training, and applications of LLMs.
- Attention Is All You Need – Seminal paper introducing the Transformer architecture that underpins modern LLMs.
- Language Models are Few-Shot Learners (GPT-3) – Landmark paper on GPT-3 demonstrating few-shot learning capabilities.
- LLM Evaluation Surveys – Comprehensive survey of evaluation strategies for large language models.
- RLHF: Training Language Models to Follow Instructions – Research introducing Reinforcement Learning with Human Feedback for alignment.
- Stanford Alpaca Paper – Study on fine-tuning LLaMA with lightweight instruction datasets.
- Auto-GPT – Autonomous GPT-4 agent capable of planning and executing multi-step tasks automatically.
- BabyAGI – Lightweight autonomous agent using LLMs for iterative goal-setting and task execution.
- ChatGPT-Next-Web – Self-hosted ChatGPT-like web app with customizable UI and backend.
- GPT Engineer – Tool for generating complete codebases from natural language project descriptions.
- PrivateGPT – Privacy-focused tool for chatting with documents locally without internet or cloud access.
- Discord: AI Exchange – Discussion hub for generative AI trends, tools, and project showcases.
- Discord: EleutherAI – Research collective collaborating on open LLMs, datasets, and reproducibility.
- Discord: Hugging Face – Official Hugging Face server with channels for models, datasets, and developer support.
- Discord: LangChain – Active server for developers working with LangChain to share projects and solve issues.
- Discord: OpenAccess AI Collective – Group focused on democratizing AI and sharing open-source resources.
- Reddit: r/ArtificialInteligence – Active subreddit covering AI news, breakthroughs, and applications.
- Reddit: r/ChatGPT – Dedicated community discussing ChatGPT use cases, tips, and creative experiments.
- Reddit: r/LocalLLaMA – Focused community for running LLaMA and open-source models locally on personal hardware.
- Reddit: r/MachineLearning – One of the largest ML/AI research communities with discussions on models, papers, and breakthroughs.
- Claude Opus 4 (Anthropic) – Strengths: Advanced reasoning, coding, and multimodal capabilities | Benchmarks: GPQA Science 79.6%, LiveCodeBench 72%, USAMO 21.7%, HMMT 58.3%, AIME 75.5%, ARC-AGI-2 8.6% | Notes: Anthropic's most capable model yet, setting new standards in reasoning, coding, and complex math.
- Claude Sonnet 4 (Anthropic) – Strengths: Efficient performance for everyday tasks | Benchmarks: GPQA Science 79.6%, LiveCodeBench 72%, USAMO 21.7%, HMMT 58.3%, AIME 75.5%, ARC-AGI-2 8.6% | Notes: Smart, efficient model for everyday use.
- DeepSeek-V3.1 – Strengths: Coding and reasoning-focused tasks | Benchmarks: MMLU-Redux 91.8%, SWE-Bench 66% | Notes: Optimized for hybrid thinking and agentic workflows, strong in coding challenges.
- Grok 4 (xAI) – Strengths: General reasoning and structured output | Benchmarks: GPQA Science 86.4%, LiveCodeBench 79%, USAMO 37.5%, HMMT 90%, AIME 91.7%, ARC-AGI-2 15.9% | Notes: Balanced model for math, reasoning, and coding.
- Grok 4 Heavy w/ Python (xAI) – Strengths: Top coding, reasoning, and math performance | Benchmarks: GPQA Science 88.4%, LiveCodeBench 79.4%, USAMO 61.9%, HMMT 96.7%, AIME 100%, ARC-AGI-2 15.9% | Notes: Best-in-class Grok 4 variant optimized for Python-heavy tasks.
- Grok 4 w/ Python (xAI) – Strengths: Strong coding and reasoning with Python | Benchmarks: GPQA Science 87.5%, LiveCodeBench 79.3%, USAMO 37.5%, HMMT 93.9%, AIME 98.8%, ARC-AGI-2 8.6% | Notes: Efficient for programming-intensive tasks.
- GPT-5 (OpenAI) – Strengths: Exceptional reasoning, coding, and multimodal capabilities | Benchmarks: MMLU 91.2%, GPQA 79.3%, SWE-Bench 54.6% | Notes: OpenAI's latest flagship model with a large context window and advanced agentic capabilities.
- Gemini 2.5 Pro (Google DeepMind) – Strengths: Multimodal reasoning, translation, and math | Benchmarks: GPQA Science 83.3%, LiveCodeBench 74.2%, USAMO 34.5%, HMMT 82.5%, AIME 88.9%, ARC-AGI-2 4.9% | Notes: Excels at complex interactive and reasoning tasks.
- Llama 4 (Meta) – Strengths: Cost-efficient, local deployment, flexible fine-tuning | Benchmarks: MMLU 85%, GPQA 80%, SWE-Bench 69.4% | Notes: Open-source LLM ideal for research, local inference, and instruction-following.
- o3 (Open LLM) – Strengths: Reasoning & math tasks | Benchmarks: GPQA Science 79.6%, LiveCodeBench 72%, USAMO 21.7%, HMMT 58.3%, AIME 88.9%, ARC-AGI-2 6.5% | Notes: Competitive math and reasoning model.
- Qwen 3 (Alibaba) – Strengths: Coding, reasoning, and multilingual support | Benchmarks: SWE-Bench High, AIME 2025 93.3% | Notes: Designed for both language and multimodal tasks with strong domain versatility.
Contributions welcome! If you find a valuable LLM resource or have an Open Source Project, open a PR.
