A lot of people haved asked how they can learn more about modern ML and LLMs, so I thought I'd compile an (incomplete) selection of papers for getting started.
1. Antebellum
- NNLM (Bengio 2003, UMontréal) - A Neural Probabilistic Language Model
- Linguistic Regularities (Mikolov 2013, Microsoft Research) - Linguistic Regularities in Continuous Space Word Representations
- Word2Vec (Mikolov 2013, Microsoft Research) - Distributed Representations of Words and Phrases and their Compositionality
- GloVe (Pennington 2014, Stanford NLP) - GloVe: Global Vectors for Word Representation
- ELMo (Peters 2018, UWash) - Deep Contextualized Word Representations
2. Early Transformers
- Transformers (Vaswani 2017, Google) - Attention Is All You Need
- GPT1 (Radford 2018, OpenAI) - Improving Language Understanding by Generative Pre-Training
- BERT (Devlin 2018, Google) - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- GPT2 (Radford 2019, OpenAI) - Language Models are Unsupervised Multitask Learners
- RoBERTa (Liu 2019, FAIR) - RoBERTa: A Robustly Optimized BERT Pretraining Approach
- ALBERT (Lan 2019, Google) - ALBERT: A Lite BERT for Self-supervised Learning of Language Representation
- DistilBERT (Sanh 2019, Hugging Face) - DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
- Lottery hypothesis (Frankle 2018, MIT) - The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
- | Magnitude Pruning (Han 2015, Stanford & NVIDIA) - Learning both Weights and Connections for Efficient Neural Networks
- | Sparse models (Gupta 2017, Google) - To prune, or not to prune: exploring the efficacy of pruning for model compression
3. Scaling
- T5 (Raffel 2019, Google) - Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
- Double descent (Nakkiran 2019, Harvard & OpenAI) - Deep Double Descent: Where Bigger Models and More Data Hurt
- Scaling Laws (Kaplan 2020, OpenAI) - Scaling Laws for Neural Language Models
- GPT-3 (Brown 2020, OpenAI) - Language Models are Few-Shot Learners
- | Locally Banded Sparse Attention (Child 2019, OpenAI) - Generating Long Sequences with Sparse Transformers
- | Gradient Noise Scale (McCandlish 2018, OpenAI) - An Empirical Model of Large-Batch Training
- Rotary Encodings (Su 2021, Zhuiyi) - RoFormer: Enhanced Transformer with Rotary Position Embedding
- Chinchilla (Hoffmann 2022, DeepMind) - Training Compute-Optimal Large Language Model
- MoE (Shazeer 2017, Google) - Outrageously Large Neural Networks: The Sparsely-Gated Mixture-Of-Experts Layer
- GShard (Lepikhin 2020, Google) - GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
- Switch Transformer (Fedus 2021, Google) - Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
- Transformer-XL (Dai 2019, CMU & Google) - Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
- Longformer (Beltagy 2020, AI2) - Longformer: The Long-Document Transformer
- Reformer (Kitaev 2020, Google) - Reformer: The Efficient Transformer
- BigBird (Zaheer 2020, Google) - Big Bird: Transformers for Longer Sequences
- Performer (Choromanski 2021, Google) - Rethinking Attention with Performers
Graveyard:
4. Alignment and Agentic
- RLHF (Ouyang 2022, OpenAI) - Training language models to follow instructions with human feedback
- DPO (Rafailov 2024, Stanford) - Direct Preference Optimization: Your Language Model is Secretly a Reward Model
- GRPO (Shao 2024, DeepSeek) - DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
- Scratchpad (Nye 2021, MIT & Google) - Show Your Work: Scratchpads for Intermediate Computation with Language Models
- Chain-of-Thought (Wei 2023, Google) - Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- Toolformer (Schick 2023, Meta AI) - Toolformer: Language Models Can Teach Themselves to Use Tools
5. Engineering Tricks
- DeepSpeed (Rasley 2020, Microsoft) - ZeRO optimizer; trillion-scale feasibility
- NVLAMB Optimizer (Hsieh 2020, NVIDIA) - Large-batch training stabilizer
- Mesh-TensorFlow (Ginsburg 2020, Google) - Multi-axis distributed tensor computation
- TPU v3 Scaling (Jouppi 2020, Google) - Hardware-level optimizations for LLM throughput
- FSDP / FairScale (Baines 2021, FAIR) - Sharded data-parallel training
- LoRA (Hu 2021, Microsoft) - LoRA: Low-Rank Adaptation of Large Language Models
- Quantization (Dettmers 2022, UWash & FAIR) - LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
- Quantization Scaling Laws (Dettmers 2022, UWash) - The case for 4-bit precision: k-bit Inference Scaling Laws
- Flash Attention (Dao 2022, Stanford) - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- | CUDA Programming Guide (NVIDIA)
- | Deep Learning Performance (NVIDIA)
- | Making Deep Learning Go Brrrr From First Principles (He 2022)
- Flash Attention 2 (Dao 2023, Stanford) - FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
- Flash Attention 3 (Shah 2024, Stanford) - FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
- GQA (Ainslie 2023, Google) - GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
- KV Cache (Pope 2022, Google) - Efficiently Scaling Transformer Inference
- vLLM / Paged Attention (Kwon 2023, Google) - Efficient Memory Management for Large Language Model Serving with PagedAttention
- Muon (Liu 2025, Moonshot AI & UCLA) - Muon Is Scalable For Llm Training