Commit 85ca8fb

committed

add diffusion LM notes

1 parent 853fb8a commit 85ca8fbCopy full SHA for 85ca8fb

File tree

1 file changed

+25

-6

lines changed

_notes/research_ovws
- ovw_llms.md

1 file changed

+25

-6

lines changed

`‎_notes/research_ovws/ovw_llms.md`

Lines changed: 25 additions & 6 deletions

Original file line number	Diff line number	Diff line change
`@@ -273,17 +273,36 @@ over time, ML has bounced from feature-engineering -> *architecture engineerin`
`273`	`273`	`- Waveformer: Linear-Time Attention with Forward and Backward Wavelet Transform ([zhuang...shang, 2022](https://arxiv.org/abs/2210.01989))`
`274`	`274`	`- White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is? ([yaodong yu...yi ma, 2023](https://arxiv.org/abs/2311.13110))`
`275`	`275`
`276`		`-## diffusion models (text)`
	`276`	`+## diffusion language models (DLMs)`
	`277`	`+`
	`278`	`+- Diffusion-LM Improves Controllable Text Generation ([lisa li, thickstun, gulrajani, liang, & hashimoto, 2022](https://arxiv.org/abs/2205.14217)) - continuous word vectors are progressively denoised from Gaussian noise`
	`279`	`+ - number of word vectors is fixed`
`277`	`280`
`278`		`-- Diffusion-LM Improves Controllable Text Generation ([lisa li, thickstun, gulrajani, liang, & hashimoto, 2022](https://arxiv.org/abs/2205.14217)) - continuous embeddings`
`279`		`-- Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution ([lou, meng, & ermon, 2024](https://arxiv.org/abs/2310.16834)) - model $p(\text{altered text}) / p(\text{orig text}),ドル and make alterations using word swaps at individual locations`
`280`		`- - From Denoising Diffusions to Denoising Markov Models ([benton...doucet, 2024](https://arxiv.org/abs/2211.03595))`
`281`		`- - Not clear that these are better than just iteratively masking/replacing a word with BERT`
`282`	`281`	`- Energy-Based Diffusion Language Models for Text Generation ([xu...leskovec, ermon, & vahdat, 2024](https://arxiv.org/abs/2410.21357))`
	`282`	`+ - Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution ([lou, meng, & ermon, 2024](https://arxiv.org/abs/2310.16834)) - model $p(\text{altered text}) / p(\text{orig text}),ドル and make alterations using word swaps at individual locations`
	`283`	`+ - From Denoising Diffusions to Denoising Markov Models ([benton...doucet, 2024](https://arxiv.org/abs/2211.03595))`
	`284`	`+ - Not clear that these are better than just iteratively masking/replacing a word with BERT`
	`285`	`+`
	`286`	`+- DLM capabilities`
	`287`	`+ - DLMs exhibit promising capabilities in intermediate token correction ([ye...kong, 2024](https://arxiv.org/abs/2402.07754))`
	`288`	`+ - PLANNER: Generating Diversified Paragraph via Latent Language Diffusion Model ([zhang...jaitly, 2023](https://proceedings.neurips.cc/paper_files/paper/2023/hash/fdba5e0a9b57fce03e89cc0cad0a24e9-Abstract-Conference.html))`
`283`	`289`	`- LLaDA: Large Language Diffusion Models ([nie, ..., li, 2025](https://arxiv.org/abs/2502.09992)) - effectively using masked language modeling`
`284`		`-- DiffuLLaMA ([gong...jiawei han, kong, 2025](https://openreview.net/pdf?id=j1tSLYKwg8)) - adapt LM by first removing causal mask then shifting logits to become a diffusion model`
	`290`	`+ - $t \in (0, 1),ドル each token is masked with prob $t,ドル and iteratively predicts masked tokens as $t$ moves from 1 to 0 (simultaneously predicts all masked tokens)`
	`291`	`+ - Simple and Effective Masked Diffusion Language Models ([sahoo...rush, kuleshov, 2024](https://proceedings.neurips.cc/paper_files/paper/2024/hash/eb0b13cc515724ab8015bc978fdde0ad-Abstract-Conference.html))`
	`292`	`+`
	`293`	`+- DiffuLLaMA ([gong...jiawei han, kong, 2025](https://openreview.net/pdf?id=j1tSLYKwg8)) - adapt LM by annealing the causal mask causal mask during training then slowly predicting a masked token's label rather than the next token (minor point about shifting: still have each head predict the label of the next token rather than the current token, since its more similar to what the original model was trianed for)`
	`294`	`+ - Diffusion Language Models Can Perform Many Tasks with Scaling and Instruction-Finetuning ([ye...quanquan gu, 2023](https://arxiv.org/abs/2308.12219)) - adapt LLaMA to DLM via masked language modeling, but lose skills during adaptation`
	`295`	`+`
`285`	`296`	`- Esoteric Language Models ([sahoo...vahdat, 2025](https://arxiv.org/abs/2506.01928)) - bridge AR and masked diffusion model (MDM) paradigms + introduce KV-caching for MDMs`
`286`	`297`	`- Accelerating Diffusion LLMs via Adaptive Parallel Decoding ([israel, van den broeck, grover, 2025](https://arxiv.org/abs/2506.00413)) - dynamically adjusts the number of tokens sampled in parallel using small autoregressive model to help (kind of like opposite of speculative decoding)`
	`298`	`+ - DiffuSeq-v2: Bridging Discrete and Continuous Text Spaces for Accelerated Seq2Seq Diffusion Models ([gong...kong, 2023](https://arxiv.org/abs/2310.05793)) - parallel text generation`
	`299`	`+`
	`300`	`+- Theory`
	`301`	`+ - Simplified and Generalized Masked Diffusion for Discrete Data ([shi...titsias, 2024](https://arxiv.org/abs/2406.04329))`
	`302`	`+ - Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow ([liu, gong, & liu, 2022](https://arxiv.org/abs/2209.03003))`
	`303`	`+ - Mean Flows for One-step Generative Modeling ([geng...kolter, he, 2025](https://arxiv.org/abs/2505.13447))`
	`304`	`+ - Fisher Flow Matching for Generative Modeling over Discrete Data ([davis...bronstei, bose, 2024](https://arxiv.org/abs/2405.14664))`
	`305`	`+`
`287`	`306`
`288`	`307`	`## mixture of experts (MoE) / routing`
`289`	`308`

0 commit comments

Comments

(0)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 85ca8fb

File tree

1 file changed

1 file changed

`‎_notes/research_ovws/ovw_llms.md`

0 commit comments