The purpose of this curriculum is to help new Elicit employees learn background in machine learning, with a focus on language models. I’ve tried to strike a balance between papers that are relevant for deploying ML in production and techniques that matter for longer-term scalability.
If you don’t work at Elicit yet - we’re hiring ML and software engineers.
Recommended reading order:
- Read “Tier 1” for all topics
- Read “Tier 2” for all topics
- Etc
✨ Added after 2024/4/1
Tier 1
- A short introduction to machine learning
- But what is a neural network?
- Gradient descent, how neural networks learn
Tier 2
- ✨ An intuitive understanding of backpropagation
- What is backpropagation really doing?
- An introduction to deep reinforcement learning
Tier 3
- The spelled-out intro to neural networks and backpropagation: building micrograd
- Backpropagation calculus
Tier 1
- ✨ But what is a GPT? Visual intro to transformers
- ✨ Attention in transformers, visually explained
- ✨ Attention? Attention!
- The Illustrated Transformer
- The Illustrated GPT-2 (Visualizing Transformer Language Models)
Tier 2
- ✨ Let's build the GPT Tokenizer
- ✨ Neural Machine Translation by Jointly Learning to Align and Translate
- The Annotated Transformer
- Attention Is All You Need
Tier 3
- A Practical Survey on Faster and Lighter Transformers
- TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second
- Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
- A Mathematical Framework for Transformer Circuits
Tier 4+
Tier 1
- Language Models are Unsupervised Multitask Learners (GPT-2)
- Language Models are Few-Shot Learners (GPT-3)
Tier 2
- ✨ LLaMA: Open and Efficient Foundation Language Models (LLaMA)
- ✨ Efficiently Modeling Long Sequences with Structured State Spaces (video) (S4)
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5)
- Evaluating Large Language Models Trained on Code (OpenAI Codex)
- Training language models to follow instructions with human feedback (OpenAI Instruct)
Tier 3
- ✨ Mistral 7B (Mistral)
- ✨ Mixtral of Experts (Mixtral)
- ✨ Gemini: A Family of Highly Capable Multimodal Models (Gemini)
- ✨ Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Mamba)
- Scaling Instruction-Finetuned Language Models (Flan)
Tier 4+
- ✨ Consistency Models
- ✨ Model Card and Evaluations for Claude Models (Claude 2)
- ✨ OLMo: Accelerating the Science of Language Models
- ✨ PaLM 2 Technical Report (Palm 2)
- ✨ Textbooks Are All You Need II: phi-1.5 technical report (phi 1.5)
- ✨ Visual Instruction Tuning (LLaVA)
- A General Language Assistant as a Laboratory for Alignment
- Finetuned Language Models Are Zero-Shot Learners (Google Instruct)
- Galactica: A Large Language Model for Science
- LaMDA: Language Models for Dialog Applications (Google Dialog)
- OPT: Open Pre-trained Transformer Language Models (Meta GPT-3)
- PaLM: Scaling Language Modeling with Pathways (PaLM)
- Program Synthesis with Large Language Models (Google Codex)
- Scaling Language Models: Methods, Analysis & Insights from Training Gopher (Gopher)
- Solving Quantitative Reasoning Problems with Language Models (Minerva)
- UL2: Unifying Language Learning Paradigms (UL2)
Tier 2
- ✨ Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
- Learning to summarise with human feedback
- Training Verifiers to Solve Math Word Problems
Tier 3
- ✨ Pretraining Language Models with Human Preferences
- ✨ Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
- Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning
- LoRA: Low-Rank Adaptation of Large Language Models
- Unsupervised Neural Machine Translation with Generative Language Models Only
Tier 4+
- ✨ Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models
- ✨ Improving Code Generation by Training with Natural Language Feedback
- ✨ Language Modeling Is Compression
- ✨ LIMA: Less Is More for Alignment
- ✨ Learning to Compress Prompts with Gist Tokens
- ✨ Lost in the Middle: How Language Models Use Long Contexts
- ✨ QLoRA: Efficient Finetuning of Quantized LLMs
- ✨ Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
- ✨ Reinforced Self-Training (ReST) for Language Modeling
- ✨ Solving olympiad geometry without human demonstrations
- ✨ Tell, don't show: Declarative facts influence how LLMs generalize
- ✨ Textbooks Are All You Need
- ✨ TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
- ✨ Training Language Models with Language Feedback at Scale
- ✨ Turing Complete Transformers: Two Transformers Are More Powerful Than One
- ByT5: Towards a token-free future with pre-trained byte-to-byte models
- Data Distributional Properties Drive Emergent In-Context Learning in Transformers
- Diffusion-LM Improves Controllable Text Generation
- ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
- Efficient Training of Language Models to Fill in the Middle
- ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning
- Prefix-Tuning: Optimizing Continuous Prompts for Generation
- Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning
- True Few-Shot Learning with Prompts -- A Real-World Perspective
Tier 1
- Machine Learning in Python: Main developments and technology trends in data science, machine learning, and AI
- Machine Learning: The High Interest Credit Card of Technical Debt
Tier 2