A lot of people haved asked how they can learn more about modern ML and LLMs, so I thought I'd compile an (incomplete) selection of papers for getting started.
1. Antebellum
- NNLM (Bengio 2003, UMontréal) - A Neural Probabilistic Language Model
- Linguistic Regularities (Mikolov 2013, Microsoft Research) - Linguistic Regularities in Continuous Space Word Representations
- Word2Vec (Mikolov 2013, Microsoft Research) - Distributed Representations of Words and Phrases and their Compositionality
- GloVe (Pennington 2014, Stanford NLP) - GloVe: Global Vectors for Word Representation
- ELMo (Peters 2018, UWash) - Deep Contextualized Word Representations
2. Early Transformers
- Transformers (Vaswani 2017, Google) - Attention Is All You Need
- GPT1 (Radford 2018, OpenAI) - Improving Language Understanding by Generative Pre-Training
- BERT (Devlin 2018, Google) - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- GPT2 (Radford 2019, OpenAI) - Language Models are Unsupervised Multitask Learners
- RoBERTa (Liu 2019, FAIR) - RoBERTa: A Robustly Optimized BERT Pretraining Approach
- ALBERT (Lan 2019, Google) - ALBERT: A Lite BERT for Self-supervised Learning of Language Representation
- DistilBERT (Sanh 2019, Hugging Face) - DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
- | Distilation (Hinton 2015, Google) - Distilling the Knowledge in a Neural Network
- Sparse models (Gupta 2017, Google) - To prune, or not to prune: exploring the efficacy of pruning for model compression
- | Optimal Brain Damage (Le Cun 1989, AT&T Bell Labs) - Optimal Brain Damage
- | Magnitude Pruning (Han 2015, Stanford, NVIDIA) - Learning both Weights and Connections for Efficient Neural Networks
3. Scaling
- T5 (Raffel 2019, Google) - Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
- Double descent (Nakkiran 2019, Harvard & OpenAI) - Deep Double Descent: Where Bigger Models and More Data Hurt
- Scaling Laws (Kaplan 2020, OpenAI) - Scaling Laws for Neural Language Models
- GPT-3 (Brown 2020, OpenAI) - Language Models are Few-Shot Learners
- | Locally Banded Sparse Attention (Child 2019, OpenAI) - Generating Long Sequences with Sparse Transformers
- | Gradient Noise Scale (McCandlish 2018, OpenAI) - An Empirical Model of Large-Batch Training
- Rotary Encodings (Su 2021, Zhuiyi) - RoFormer: Enhanced Transformer with Rotary Position Embedding
- Chinchilla (Hoffmann 2022, DeepMind) - Training Compute-Optimal Large Language Model
- MoE (Shazeer 2017, Google) - Outrageously Large Neural Networks: The Sparsely-Gated Mixture-Of-Experts Layer
- GShard (Lepikhin 2020, Google) - GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
- Switch Transformer (Fedus 2021, Google) - Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
- Transformer-XL (Dai 2019, CMU & Google) - Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
- Longformer (Beltagy 2020, AI2) - Longformer: The Long-Document Transformer
- Reformer (Kitaev 2020, Google) - Reformer: The Efficient Transformer
- BigBird (Zaheer 2020, Google) - Big Bird: Transformers for Longer Sequences
- Performer (Choromanski 2021, Google) - Rethinking Attention with Performers
Graveyard:
4. Alignment
- LoRA (Hu 2019, Microsoft & CMU) - LoRA: Low-Rank Adaptation of Large Language Models
- PPO (Schulman 2017, OpenAI) - Proximal Policy Optimization Algorithms
- InstructGPT: RLHF (Ouyang 2022, OpenAI) - Training language models to follow instructions with human feedback
- DPO (Rafailov 2023, Stanford) - Direct Preference Optimization: Your Language Model is Secretly a Reward Model
- DeepSeekMath: GRPO (Shao 2024, DeepSeek) - DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
- DeepSeek-R1: RLVR (Guo 2025, DeepSeek) - DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
- The Art of Scaling Reinforcement Learning Compute for LLMs (Khatri 2025, Meta)
5. Agentic
- RAG (Lewis 2020, Meta) - Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
- Toolformer (Schick 2023, Meta) - Toolformer: Language Models Can Teach Themselves to Use Tools
- Scratchpad (Nye 2021, MIT & Google) - Show Your Work: Scratchpads for Intermediate Computation with Language Models
- Chain-of-Thought (Wei 2022, Google) - Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- CoCoNut (Hao 2024, Meta) - Training Language Models to Reason in a Latent Space
ML Systems
- FlashAttention (Dao 2022, Stanford) - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- | CUDA Programming Guide (NVIDIA)
- | Learning CUDA by optimizing softmax (Pandya 2025)
- | Basic idea behind flash attention
- | Making Deep Learning Go Brrrr From First Principles (He 2022)
- | What Shapes Do Matrix Multiplications Like? (He 2024)
- | NVIDIA Deep Learning Performance (NVIDIA)
- | Building Machine Learning Systems for a Trillion Trillion FLOPs (He 2024)
- | Basic facts about GPUs
- | Tips for Optimizing GPU Performance Using Tensor Cores (NVIDIA)
- FlashAttention2 (Dao 2023, Stanford) - FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
- FlashAttention3 (Shah 2024, Stanford) - FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
- BFloat16 (Wang 2019, Google) - BFloat16: The secret to high performance on Cloud TPUs
- Microscaling Formats (Rouhani 2023, Microsoft & AMD & Intel & Meta & NVIDIA & Qualcomm) - Microscaling Data Formats for Deep Learning
- LLM.int8() (Dettmers 2022, UWash & Meta & HuggingFace) - LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
- SmoothQuant (Xiao 2022, MIT & NVIDIA) - SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
- GPTQ (Frantar 2022, ETH Zurich) - GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
- LLM-QAT (Liu 2023, Meta) - LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
- AWQ (Lin 2023, Google) - Attacking Large Language Models with Projected Gradient Descent
- QLoRA (Dettmers 2023, UWash) - QLoRA: Efficient Finetuning of Quantized LLMs
- BitNet (Wang 2023, Microsoft) - BitNet: Scaling 1-bit Transformers for Large Language Models
- 1.58 BitNet (Ma 2024, Microsoft) - The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
- Quantization Scaling Laws (Dettmers 2022, UWash) - The case for 4-bit precision: k-bit Inference Scaling Laws
- KIVI (Liu 2024, CMU) - KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
- Introduction to 8-bit Matrix Multiplication for transformers - A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale
- Paged Attention / vLLM (Kwon 2023, Berkeley et al.) - Efficient Memory Management for Large Language Model Serving with PagedAttention
- Multi-Head Latent Attention (Liu 2024, DeepSeek) - DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
- | How Attention Got So Efficient
- | Understanding Multi-Head Latent Attention
- DeepSeek Sparse Attention / Lightning Indexer (Liu 2025, DeepSeek) - DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
Quantization
KV-Cache
Optimizers
- SGD (Robbins 1951) - A Stochastic Approximation Method
- Momentum - Some methods of speeding up the convergence of iteration methods
- AdaGrad (Duchi 2011) - Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
- RMSProp (Hinton 2012, Coursera) - rmsprop: Divide the Gradient by a Running Average of Its Recent Magnitude
- Adam (Kingma 2015, OpenAI) - Adam: A Method for Stochastic Optimization
- AdamW (Loshchilov 2019) - Decoupled Weight Decay Regularization
- Shampoo (Gupta 2018, Google) - Shampoo: Preconditioned Stochastic Tensor Optimization
- Lion (Chen 2023, Google) - Symbolic Discovery of Optimization Algorithms
- Muon (Jordan 2024) - Muon: An Optimizer for Hidden Layers in Neural Networks
- | Newton-Schulz - Iterative Newton-Schulz Orthogonalization
- | Deriving Muon (Bernstein 2025) - Deriving Muon
- | Whitening optimizer (Frans 2025, BAIR) - What really matters in matrix-whitening optimizers?