LLM Resources

LLM Resources

A lot of people haved asked how they can learn more about modern ML and LLMs, so I thought I'd compile an (incomplete) selection of papers for getting started.

1. Antebellum

NNLM (Bengio 2003, UMontréal) - A Neural Probabilistic Language Model
Linguistic Regularities (Mikolov 2013, Microsoft Research) - Linguistic Regularities in Continuous Space Word Representations
Word2Vec (Mikolov 2013, Microsoft Research) - Distributed Representations of Words and Phrases and their Compositionality
GloVe (Pennington 2014, Stanford NLP) - GloVe: Global Vectors for Word Representation
ELMo (Peters 2018, UWash) - Deep Contextualized Word Representations

2. Early Transformers

Transformers (Vaswani 2017, Google) - Attention Is All You Need
GPT1 (Radford 2018, OpenAI) - Improving Language Understanding by Generative Pre-Training
BERT (Devlin 2018, Google) - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
GPT2 (Radford 2019, OpenAI) - Language Models are Unsupervised Multitask Learners
RoBERTa (Liu 2019, FAIR) - RoBERTa: A Robustly Optimized BERT Pretraining Approach
ALBERT (Lan 2019, Google) - ALBERT: A Lite BERT for Self-supervised Learning of Language Representation
DistilBERT (Sanh 2019, Hugging Face) - DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
| Distilation (Hinton 2015, Google) - Distilling the Knowledge in a Neural Network
Sparse models (Gupta 2017, Google) - To prune, or not to prune: exploring the efficacy of pruning for model compression
| Optimal Brain Damage (Le Cun 1989, AT&T Bell Labs) - Optimal Brain Damage
| Magnitude Pruning (Han 2015, Stanford, NVIDIA) - Learning both Weights and Connections for Efficient Neural Networks

3. Scaling

T5 (Raffel 2019, Google) - Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Double descent (Nakkiran 2019, Harvard & OpenAI) - Deep Double Descent: Where Bigger Models and More Data Hurt
Scaling Laws (Kaplan 2020, OpenAI) - Scaling Laws for Neural Language Models
GPT-3 (Brown 2020, OpenAI) - Language Models are Few-Shot Learners
| Locally Banded Sparse Attention (Child 2019, OpenAI) - Generating Long Sequences with Sparse Transformers
| Gradient Noise Scale (McCandlish 2018, OpenAI) - An Empirical Model of Large-Batch Training
Rotary Encodings (Su 2021, Zhuiyi) - RoFormer: Enhanced Transformer with Rotary Position Embedding
Chinchilla (Hoffmann 2022, DeepMind) - Training Compute-Optimal Large Language Model

MoE (Shazeer 2017, Google) - Outrageously Large Neural Networks: The Sparsely-Gated Mixture-Of-Experts Layer
GShard (Lepikhin 2020, Google) - GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Switch Transformer (Fedus 2021, Google) - Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

^Graveyard:

Transformer-XL (Dai 2019, CMU & Google) - Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Longformer (Beltagy 2020, AI2) - Longformer: The Long-Document Transformer
Reformer (Kitaev 2020, Google) - Reformer: The Efficient Transformer
BigBird (Zaheer 2020, Google) - Big Bird: Transformers for Longer Sequences
Performer (Choromanski 2021, Google) - Rethinking Attention with Performers

4. Alignment

LoRA (Hu 2019, Microsoft & CMU) - LoRA: Low-Rank Adaptation of Large Language Models
PPO (Schulman 2017, OpenAI) - Proximal Policy Optimization Algorithms
InstructGPT: RLHF (Ouyang 2022, OpenAI) - Training language models to follow instructions with human feedback
DPO (Rafailov 2023, Stanford) - Direct Preference Optimization: Your Language Model is Secretly a Reward Model
DeepSeekMath: GRPO (Shao 2024, DeepSeek) - DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
DeepSeek-R1: RLVR (Guo 2025, DeepSeek) - DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
The Art of Scaling Reinforcement Learning Compute for LLMs (Khatri 2025, Meta)

5. Agentic

RAG (Lewis 2020, Meta) - Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Toolformer (Schick 2023, Meta) - Toolformer: Language Models Can Teach Themselves to Use Tools
Scratchpad (Nye 2021, MIT & Google) - Show Your Work: Scratchpads for Intermediate Computation with Language Models
Chain-of-Thought (Wei 2022, Google) - Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
CoCoNut (Hao 2024, Meta) - Training Language Models to Reason in a Latent Space

ML Systems

FlashAttention (Dao 2022, Stanford) - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
| CUDA Programming Guide (NVIDIA)
| Learning CUDA by optimizing softmax (Pandya 2025)
| Basic idea behind flash attention
| Making Deep Learning Go Brrrr From First Principles (He 2022)
| What Shapes Do Matrix Multiplications Like? (He 2024)
| NVIDIA Deep Learning Performance (NVIDIA)
| Building Machine Learning Systems for a Trillion Trillion FLOPs (He 2024)
| Basic facts about GPUs
| Tips for Optimizing GPU Performance Using Tensor Cores (NVIDIA)
FlashAttention2 (Dao 2023, Stanford) - FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
FlashAttention3 (Shah 2024, Stanford) - FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

^Quantization

BFloat16 (Wang 2019, Google) - BFloat16: The secret to high performance on Cloud TPUs
Microscaling Formats (Rouhani 2023, Microsoft & AMD & Intel & Meta & NVIDIA & Qualcomm) - Microscaling Data Formats for Deep Learning
LLM.int8() (Dettmers 2022, UWash & Meta & HuggingFace) - LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
SmoothQuant (Xiao 2022, MIT & NVIDIA) - SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
GPTQ (Frantar 2022, ETH Zurich) - GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
LLM-QAT (Liu 2023, Meta) - LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
AWQ (Lin 2023, Google) - Attacking Large Language Models with Projected Gradient Descent
QLoRA (Dettmers 2023, UWash) - QLoRA: Efficient Finetuning of Quantized LLMs
BitNet (Wang 2023, Microsoft) - BitNet: Scaling 1-bit Transformers for Large Language Models
1.58 BitNet (Ma 2024, Microsoft) - The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Quantization Scaling Laws (Dettmers 2022, UWash) - The case for 4-bit precision: k-bit Inference Scaling Laws
KIVI (Liu 2024, CMU) - KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Introduction to 8-bit Matrix Multiplication for transformers - A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale

^KV-Cache

Paged Attention / vLLM (Kwon 2023, Berkeley et al.) - Efficient Memory Management for Large Language Model Serving with PagedAttention
Multi-Head Latent Attention (Liu 2024, DeepSeek) - DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
| How Attention Got So Efficient
| Understanding Multi-Head Latent Attention
DeepSeek Sparse Attention / Lightning Indexer (Liu 2025, DeepSeek) - DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Optimizers

SGD (Robbins 1951) - A Stochastic Approximation Method
Momentum - Some methods of speeding up the convergence of iteration methods
AdaGrad (Duchi 2011) - Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
RMSProp (Hinton 2012, Coursera) - rmsprop: Divide the Gradient by a Running Average of Its Recent Magnitude
Adam (Kingma 2015, OpenAI) - Adam: A Method for Stochastic Optimization
AdamW (Loshchilov 2019) - Decoupled Weight Decay Regularization
Shampoo (Gupta 2018, Google) - Shampoo: Preconditioned Stochastic Tensor Optimization
Lion (Chen 2023, Google) - Symbolic Discovery of Optimization Algorithms
Muon (Jordan 2024) - Muon: An Optimizer for Hidden Layers in Neural Networks
| Newton-Schulz - Iterative Newton-Schulz Orthogonalization
| Deriving Muon (Bernstein 2025) - Deriving Muon
| Whitening optimizer (Frans 2025, BAIR) - What really matters in matrix-whitening optimizers?