A lot of people haved asked how they can learn more about modern ML and LLMs, so I thought I'd compile an (incomplete) selection of papers for getting started.
1. Antebellum
- NNLM (Bengio 2003, UMontréal) - A Neural Probabilistic Language Model
- Linguistic Regularities (Mikolov 2013, Microsoft Research) - Linguistic Regularities in Continuous Space Word Representations
- Word2Vec (Mikolov 2013, Microsoft Research) - Efficient Estimation of Word Representations in Vector Space
- Word2Vec (Mikolov 2013, Microsoft Research) - Distributed Representations of Words and Phrases and their Compositionality
- GloVe (Pennington 2014, Stanford NLP) - GloVe: Global Vectors for Word Representation
- ELMo (Peters 2018, UWash) - Deep Contextualized Word Representations
2. Early Transformers
- Transformers (Vaswani 2017, Google) - Attention Is All You Need
- GPT1 (Radford 2018, OpenAI) - Improving Language Understanding by Generative Pre-Training
- BERT (Devlin 2018, Google) - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- GPT2 (Radford 2019, OpenAI) - Language Models are Unsupervised Multitask Learners
- RoBERTa (Liu 2019, FAIR) - RoBERTa: A Robustly Optimized BERT Pretraining Approach
- ALBERT (Lan 2019, Google) - ALBERT: A Lite BERT for Self-supervised Learning of Language Representation
- DistilBERT (Sanh 2019, Hugging Face) - DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
- | Distilation (Hinton 2015, Google) - Distilling the Knowledge in a Neural Network
- Sparse models (Gupta 2017, Google) - To prune, or not to prune: exploring the efficacy of pruning for model compression
- | Optimal Brain Damage (Le Cun 1989, AT&T Bell Labs) - Optimal Brain Damage
- | Magnitude Pruning (Han 2015, Stanford, NVIDIA) - Learning both Weights and Connections for Efficient Neural Networks
3. Scaling
3.0 Scaling Theory- Lottery hypothesis (Frankle 2018, MIT) - The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
- Double descent (Nakkiran 2019, Harvard, OpenAI) - Deep Double Descent: Where Bigger Models and More Data Hurt
- Groking (Power 2022, OpenAI) - Grokking: Generalization Beyond Overfitting On Small Algorithmic Datasets
- Scaling Laws (Kaplan 2020, OpenAI) - Scaling Laws for Neural Language Models
- mu-p (Yang 2022, Microsoft) - Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
- | The Practitioner's Guide to the Maximal Update Parameterization
- | Some Math behind Neural Tangent Kernel
- Chinchilla (Hoffmann 2022, DeepMind) - Training Compute-Optimal Large Language Model
- How to Scale
- T5 (Raffel 2019, Google) - Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
- GPT3 (Brown 2020, OpenAI) - Language Models are Few-Shot Learners
- | Locally Banded Sparse Attention (Child 2019, OpenAI) - Generating Long Sequences with Sparse Transformers
- | Gradient Noise Scale (McCandlish 2018, OpenAI) - An Empirical Model of Large-Batch Training
- Codex (Chen 2019, OpenAI) - Evaluating Large Language Models Trained on Code
- Transformer-XL (Dai 2019) - Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
- | Effective Receptive Field (Khandelwal 2018)
- | Character-Level Transformers (Al-Rfou 2018)
- Longformer (Beltagy 2020, AI2) - Longformer: The Long-Document Transformer
- Reformer (Kitaev 2020, Google) - Reformer: The Efficient Transformer
- BigBird (Zaheer 2020, Google) - Big Bird: Transformers for Longer Sequences
- Performer (Choromanski 2021, Google) - Rethinking Attention with Performers
- MoE (Shazeer 2017, Google) - Outrageously Large Neural Networks: The Sparsely-Gated Mixture-Of-Experts Layer
- GShard (Lepikhin 2020, Google) - GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
- Switch Transformer (Fedus 2021, Google) - Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
4. Alignment, RL, Reasoning
4.1 Traditional RL- MDP framework (Bellman 1957) - Markov Decision Processes
- Bellman equations (Bellman 1957) - Dynamic Programming
- Q-learning (Watkins 1989) - Learning from Delayed Rewards
- DQN (Mnih 2013, Deepmind) - Playing Atari with Deep Reinforcement Learning
- REINFORCE (Williams 1992) - Simple Statistical Gradient-Following Algorithms
- Actor-Critic (Konda 2000, MIT) - Actor-Critic Algorithms
- A3C (Mnih 2016, Google) - Asynchronous Methods for Deep Reinforcement Learning
- A2C (Wu 2017, OpenAI) - Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation
- PPO (Schulman 2017, OpenAI) - Proximal Policy Optimization Algorithms
- SAC (Haarnoja 2018, UC Berkeley) - Soft Actor-Critic
- AlphaStar (Vinyals 2019, Google) - Grandmaster level in StarCraft II using multi-agent reinforcement learning
- LoRA (Hu 2019, Microsoft & CMU) - LoRA: Low-Rank Adaptation of Large Language Models
- RLHF (InstructGPT) (Ouyang 2022, OpenAI) - Training language models to follow instructions with human feedback
- DPO (Rafailov 2023, Stanford) - Direct Preference Optimization: Your Language Model is Secretly a Reward Model
- Let's verify step by step (Lightman 2023, OpenAI) - Let's Verify Step by Step
- GRPO (DeepSeekMath) (Shao 2024, DeepSeek) - DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
- The Art of Scaling Reinforcement Learning Compute for LLMs (Khatri 2025, Meta)
- APRIL (Zhou 2025, CMU) - APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation
- PODS (Xu 2025, CMU) - Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning
- SPICE (Liu 2025, Meta) - SPICE: Self-Play In Corpus Environments Improves Reasoning
- SPIRAL (Liu 2025) - SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning
- Absolute Zero (Zhao 2025, Tsinghua) - Absolute Zero: Reinforced Self-play Reasoning with Zero Data
5. Agentic, Inference-time compute
Thinking- Chain-of-Thought (Wei 2022, Google) - Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- Scratchpad (Nye 2021, MIT & Google) - Show Your Work: Scratchpads for Intermediate Computation with Language Models
- Let's think step by step (Kojima 2022, Google) - Large Language Models are Zero-Shot Reasoners
- CoCoNut (Hao 2024, Meta) - Training Language Models to Reason in a Latent Space
- STAR (Zelikman 2022, Stanford) - STaR: Bootstrapping Reasoning With Reasoning
- RLVR (DeepSeek-R1) (Guo 2025, DeepSeek) - DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
- RAG (Lewis 2020, Meta) - Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
- Toolformer (Schick 2023, Meta) - Toolformer: Language Models Can Teach Themselves to Use Tools
- Agent Swarm (KimiK2.5) (Bai 2025, Moonshot AI) - Kimi K2.5: Visual Agentic Intelligence
6. Research Revival
- Engram (Cheng 2025, Deepseek) - Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
- Attention Residuals (Chen 2026, Moonshot AI) - Attention Residuals
- | mHC: Manifold-Constrained Hyper-Connections (Xie 2025, Deepseek) - mHC: Manifold-Constrained Hyper-Connections
ML Systems
- FlashAttention (Dao 2022, Stanford) - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- | CUDA Programming Guide (NVIDIA)
- | Learning CUDA by optimizing softmax (Pandya 2025)
- | Basic idea behind flash attention
- | Making Deep Learning Go Brrrr From First Principles (He 2022)
- | What Shapes Do Matrix Multiplications Like? (He 2024)
- | NVIDIA Deep Learning Performance (NVIDIA)
- | Building Machine Learning Systems for a Trillion Trillion FLOPs (He 2024)
- | Basic facts about GPUs
- | Tips for Optimizing GPU Performance Using Tensor Cores (NVIDIA)
- FlashAttention2 (Dao 2023, Stanford) - FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
- FlashAttention3 (Shah 2024, Stanford) - FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
- BFloat16 (Wang 2019, Google) - BFloat16: The secret to high performance on Cloud TPUs
- Microscaling Formats (Rouhani 2023, Microsoft & AMD & Intel & Meta & NVIDIA & Qualcomm) - Microscaling Data Formats for Deep Learning
- LLM.int8() (Dettmers 2022, UWash & Meta & HuggingFace) - LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
- SmoothQuant (Xiao 2022, MIT & NVIDIA) - SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
- GPTQ (Frantar 2022, ETH Zurich) - GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
- LLM-QAT (Liu 2023, Meta) - LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
- AWQ (Lin 2023, Google) - Attacking Large Language Models with Projected Gradient Descent
- QLoRA (Dettmers 2023, UWash) - QLoRA: Efficient Finetuning of Quantized LLMs
- BitNet (Wang 2023, Microsoft) - BitNet: Scaling 1-bit Transformers for Large Language Models
- 1.58 BitNet (Ma 2024, Microsoft) - The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
- Quantization Scaling Laws (Dettmers 2022, UWash) - The case for 4-bit precision: k-bit Inference Scaling Laws
- KIVI (Liu 2024, CMU) - KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
- TurboQuant (Zandieh 2025, Google) - TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
- Introduction to 8-bit Matrix Multiplication for transformers - A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale
- Paged Attention / vLLM (Kwon 2023, Berkeley et al.) - Efficient Memory Management for Large Language Model Serving with PagedAttention
- Deepseek-V2: Multi-Head Latent Attention (Liu 2024, DeepSeek) - DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
- | How Attention Got So Efficient
- | Understanding Multi-Head Latent Attention
- DeepSeek-V3.2: DSA Lightning Indexer (Liu 2025, DeepSeek) - DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
- Orca (Yu) - Orca: A Distributed Serving System for Transformer-Based Generative Models
- Survey of efficient inference for LLMs (Zhou 2024)
- Speculative Decoding (Leviathan 2022, Google) - Fast Inference from Transformers via Speculative Decoding
- | Looking back on speculative decoding (Google) - Looking back on speculative decoding
- Speculative speculative decoding (Kumar 2026, Google) - Speculative Speculative Decoding
Vision / Multimodal
- Vision Transformers (ViT) (Dosovitskiy 2020, Google) - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- Data-efficient Image Transformers (DeiT) (Touvron 2021, Meta) - Training data-efficient image transformers & distillation through attention
- LLaVa (Liu 2023, Microsoft) - Visual Instruction Tuning
Optimizers
- SGD (Robbins 1951) - A Stochastic Approximation Method
- Momentum - Some methods of speeding up the convergence of iteration methods
- AdaGrad (Duchi 2011) - Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
- RMSProp (Hinton 2012, Coursera) - rmsprop: Divide the Gradient by a Running Average of Its Recent Magnitude
- Adam (Kingma 2015, OpenAI) - Adam: A Method for Stochastic Optimization
- AdamW (Loshchilov 2019) - Decoupled Weight Decay Regularization
- Shampoo (Gupta 2018, Google) - Shampoo: Preconditioned Stochastic Tensor Optimization
- Lion (Chen 2023, Google) - Symbolic Discovery of Optimization Algorithms
- Muon (Jordan 2024) - Muon: An Optimizer for Hidden Layers in Neural Networks
- | Muon is Scalable for LLM Training
- | Newton-Schulz - Iterative Newton-Schulz Orthogonalization
- | Deriving Muon (Bernstein 2025) - Deriving Muon
- | Whitening optimizer (Frans 2025, BAIR) - What really matters in matrix-whitening optimizers?
- MuonClip (Kimi K2) (Bai 2025, Moonshot AI) - Kimi K2: Open Agentic Intelligence
Footnotes
- Byte-pair tokenization (Sennrich 2016, UEdinburgh) - Neural Machine Translation of Rare Words with Subword Units
- Unembedding Tying (Inan 2016, Stanford NLP) - Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling
- Unembedding Tying (Press 2016, UTel-Aviv) - Using the Output Embedding to Improve Language Models
- RoPE (Su 2021) - RoFormer: Enhanced Transformer with Rotary Position Embedding
- | You could have designed state of the art positional encoding
- Reasoning is fake (Pfau 2024, NYU) - Let's Think Dot by Dot: Hidden Computation in Transformer Language Models
- Model Nondeterminism - Defeating Nondeterminism in LLM Inference
- Streaming LLM Attention Sinks - Efficient Streaming Language Models with Attention Sinks
- Circuits (Ameisen 2025, Anthropic) - Circuit Tracing: Revealing Computational Graphs in Language Models
- H-neurons (Gao 2025, Tsinghua) - H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs
- Emotions (Sofroniew 2026, Anthropic) - Emotion Concepts and their Function in a Large Language Model