Denoising Diffusion Probabilistic Models - notes

Posted at 2025-09-04 # ML/AI

拡散モデルを VAE, GAN に並ぶ有力な手法に押し上げた決定的な論文. 結局, 拡散モデルをどのように実装すればうまくいくのかという話だが, 拡散モデルと Langevin 動力学を用いたデノイジングスコアマッチングの関連を見抜き, それに基づいてデザインされた手法を提示した点は革新的といえる.

拡散モデル (Sohl-Dickstein et al., 2015)

生成分布 $p_θ\left(\mathbf{x}_0\right) ≔ ∫p_θ\left(\mathbf{x}_{0:T}\right)\mathrm{d}\mathbf{x}_{1:T}$
逆拡散過程 $\begin{align} p_θ\left(\mathbf{x}_{0:T}\right) &≔ p\left(\mathbf{x}_T\right)∏_{t=1}^Tp_θ\left(\mathbf{x}_{t-1}|\mathbf{x}_t\right) \\ p_θ\left(\mathbf{x}_{t-1}|\mathbf{x}_t\right) &≔ \mathcal{N}\left(\mathbf{x}_{t-1};\mathbf{μ}_θ\left(\mathbf{x}_t,t\right),\mathbf{Σ}_θ\left(\mathbf{x}_t,t\right)\right) \\ p_θ\left(\mathbf{x}_T\right) &≔ \mathcal{N}\left(\mathbf{x}_T;\mathbf{0},\mathbf{I}\right) \end{align}$
拡散過程 $\begin{align} q\left(\mathbf{x}_{1:T}|\mathbf{x}_0\right) &≔ ∏_{t=1}^Tq\left(\mathbf{x}_t|\mathbf{x}_{t-1}\right) \\ q\left(\mathbf{x}_t|\mathbf{x}_{t-1}\right) &≔ \mathcal{N}\left(\mathbf{x}_t;\sqrt{1-β_t}\mathbf{x}_{t-1},β_t\mathbf{I}\right) \\ q\left(\mathbf{x}_t|\mathbf{x}_0\right) &= \mathcal{N}\left(\mathbf{x}_t;\sqrt{\bar{α}_t}\mathbf{x}_0,\left(1-\bar{α}_t\right)\mathbf{I}\right) \\ \text{where }&\bar{α}_t≔∏_{s=1}^tα_s\text{ and }α_s≔1-β_s \end{align}$
訓練 (負の対数尤度に対する変分上限 $L$ の最小化) $\begin{align} &\space\space\space\space\mathbb{E}_q\left[-\log p_{θ}\left(\mathbf{x}_0\right)\right] \\ &≤ \mathbb{E}_q\left[-\log\frac{p_{θ}\left(\mathbf{x}_{0:T}\right)}{q\left(\mathbf{x}_{1:T}|\mathbf{x}_0\right)}\right] \\ &= \mathbb{E}_q\left[-\log p\left(\mathbf{x}_T\right)-\sum_{t=1}^T\log\frac{p_{θ}\left(\mathbf{x}_{t-1}|\mathbf{x}_t\right)}{q\left(\mathbf{x}_t|\mathbf{x}_{t-1}\right)}\right] \\ &= \mathbb{E}_q\left[\underbrace{D_{\text{KL}}\left(q\left(\mathbf{x}_T|\mathbf{x}_0\right)\|p\left(\mathbf{x}_T\right)\right)}_{L_T}+\sum_{t=2}^T\underbrace{D_{\text{KL}}\left(q\left(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0\right)\|p_θ\left(\mathbf{x}_{t-1}|\mathbf{x}_t\right)\right)}_{L_{1:T}}\underbrace{-\log p_θ\left(\mathbf{x}_0|\mathbf{x}_1\right)}_{L_0}\right] \\ &≔ L \\ &\space\space\space\space q\left(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0\right) = \mathcal{N}\left(\mathbf{x}_{t-1};\tilde{\mathbf{μ}}_t\left(\mathbf{x}_t,\mathbf{x}_0\right),\tilde{β}_t\mathbf{I}\right) \\ &\text{where }\tilde{\mathbf{μ}}_t\left(\mathbf{x}_t,\mathbf{x}_0\right)≔\frac{\sqrt{\bar{α}_{t-1}}β_t}{1-\bar{α}_t}\mathbf{x}_0+\frac{\sqrt{α_t}\left(1-\tilde{α}_{t-1}\right)}{1-\tilde{α}_t}\mathbf{x}_t\text{ and }\tilde{β}_t≔\frac{1-\tilde{α}_{t-1}}{1-\tilde{α}_t}β_t \end{align}$

拡散モデルとデノイジングオートエンコーダー (DAE)

拡散過程と $L_T$

$q\left(\mathbf{x}_T|\mathbf{x}_0\right) = \mathcal{N}\left(\mathbf{x}_T;\sqrt{\bar{α}_T}\mathbf{x}_0,\left(1-\bar{α}_T\right)\mathbf{I}\right)$ , $\mathrm{i.e.}$ , スケジュール $β_1,\ldots,β_T$ の選び方.

本来, 拡散率 $β_t$ は学習可能な変数であるが, 今回はこれを定数とする. すると, $q\left(\mathbf{x}_T|\mathbf{x}_0\right)$ が学習可能な量を持たなくなるので, $L_T$ も訓練において定数となり無視できる.

逆拡散過程と $L_{1:T}$

$p_θ\left(\mathbf{x}_{t-1}|\mathbf{x}_t\right) ≔ \mathcal{N}\left(\mathbf{x}_{t-1};\mathbf{μ}_θ\left(\mathbf{x}_t,t\right),\mathbf{Σ}_θ\left(\mathbf{x}_t,t\right)\right)\text{ for }2≤t≤T$ , $\mathrm{i.e.}$ , $\mathbf{μ}_θ\left(\mathbf{x}_t,t\right)$ , および $\mathbf{Σ}_θ\left(\mathbf{x}_t,t\right)$ の選び方.

分散 $\mathbf{Σ}_θ\left(\mathbf{x}_t,t\right)$

$\mathbf{Σ}_θ\left(\mathbf{x}_t,t\right)=σ_t^2\mathbf{I}$ と, 時間に依存する定数にとる.

ここにおいて以下 2 つの極端な選び方が存在するが, 同様の結果が得られるようで, 結局のところ選べる値であれば何でもいいらしい.

$σ_t^2=β_t$ と拡散過程の分散にとる.

$\mathbf{x}_0∼\mathcal{N}\left(\mathbf{0},\mathbf{I}\right)$ のとき最適, Sohl-dickstein et al. (2015) で評価された逆拡散過程のエントロピーが上限を取る場合に対応.
$σ_t^2=\tilde{β}_t$ と事後分布の分散にとる.

$\mathbf{x}_0$ が確定した 1 点のとき最適, Sohl-dickstein et al. (2015) で評価された逆拡散過程のエントロピーが下限を取る場合に対応.

平均 $\mathbf{μ}_θ\left(\mathbf{x}_t,t\right)$

$\mathbf{x}_t$ をノイズ $\mathbf{ϵ}∼\mathcal{N}\left(\mathbf{0},\mathbf{I}\right)$ を用いて $\mathbf{x}_t\left(\mathbf{x}_0,\mathbf{ϵ}\right)=\sqrt{\bar{α}}_t\mathbf{x}_0+\sqrt{1-\bar{α}}_t\mathbf{ϵ}$ と再パラメータ化することにより, $\mathbf{μ}_θ\left(\mathbf{x}_t,t\right)$ を次のように選ぶことができる.

\mathbf{μ}_θ\left(\mathbf{x}_t,t\right) = \frac{1}{\sqrt{α_t}}\left(\mathbf{x}_t-\frac{β_t}{\sqrt{1-\bar{α}_t}}\mathbf{ϵ}_θ\left(\mathbf{x}_t,t\right)\right)

この着想は, 拡散モデルにおける訓練が (ノイズとスコアを同一視すれば) $t$ によって添字付けられた複数ノイズスケールでのデノイジングスコアマッチング (DSM) と等価になる, という次の洞察による.

\begin{align} &\space\space\space\space L_{1:T}(θ)-\mathrm{Const.} \\ &= \mathbb{E}_q\left[\frac{1}{2σ_t^2}\left\|\tilde{\mathbf{μ}}_t\left(\mathbf{x}_t,\mathbf{x}_0\right)-\mathbf{μ}_θ\left(\mathbf{x}_t,t\right)\right\|^2\right] \\ &= \mathbb{E}_{\mathbf{x}_0,\mathbf{ϵ}}\left[\frac{1}{2σ_t^2}\left\|\tilde{\mathbf{μ}}_t\left(\mathbf{x}_t\left(\mathbf{x}_0,\mathbf{ϵ}\right),\frac{1}{\sqrt{\bar{α}_t}}\left(\mathbf{x}_t\left(\mathbf{x}_0,\mathbf{ϵ}\right)-\sqrt{1-\bar{α}_t}\mathbf{ϵ}\right)\right)-\mathbf{μ}_θ\left(\mathbf{x}_t\left(\mathbf{x}_0,\mathbf{ϵ}\right),t\right)\right\|^2\right] \\ &= \mathbb{E}_{\mathbf{x}_0,\mathbf{ϵ}}\left[\frac{1}{2σ_t^2}\left\|\frac{1}{\sqrt{\bar{α}_t}}\left(\mathbf{x}_t\left(\mathbf{x}_0,\mathbf{ϵ}\right)-\frac{β_t}{\sqrt{1-\bar{α}_t}}\mathbf{ϵ}\right)-\mathbf{μ}_θ\left(\mathbf{x}_t\left(\mathbf{x}_0,\mathbf{ϵ}\right),t\right)\right\|^2\right] \\ &= \mathbb{E}_{\mathbf{x}_0,\mathbf{ϵ}}\left[\frac{β_t^2}{2σ_t^2α_t\left(1-\bar{α}_t\right)}\left\|\mathbf{ϵ}-\mathbf{ϵ}_θ\left(\sqrt{\bar{α}_t}\mathbf{x}_0+\sqrt{1-\bar{α}_t}\mathbf{ϵ},t\right)\right\|^2\right] \\ &= \mathbb{E}_{\mathbf{x}_0,\mathbf{ϵ}}\left[w_t\left\|\mathbf{ϵ}-\mathbf{ϵ}_θ\left(\sqrt{\bar{α}_t}\mathbf{x}_0+\sqrt{1-\bar{α}_t}\mathbf{ϵ},t\right)\right\|^2\right]\text{ where }w_t≔\frac{β_t^2}{2σ_t^2α_t\left(1-\bar{α}_t\right)} \end{align}

したがってこのとき, Langevin 動力学を用いたサンプリングは次で与えられる.

\mathbf{x}_{t-1}=\frac{1}{\sqrt{α_t}}\left(\mathbf{x}_t-\frac{β_t}{\sqrt{1-\bar{α}_t}}\mathbf{ϵ}_θ\left(\mathbf{x}_t,t\right)\right)+σ_t\mathbf{z}\space\space\text{ where }\mathbf{z}∼\mathcal{N}\left(\mathbf{0},\mathbf{I}\right)

データスケーリングと逆拡散過程デコーダー, および $L_0$

画像データの要素 $∈\lbrace 0,1,\ldots,255\rbrace$ を $\left\lbrace-1,-1+\frac{2}{255}\ldots,1\right\rbrace∈[-1,1]$ と線形にスケーリングし, 標準正規事前分布 $p\left(\mathbf{x}_T\right)$ から始まる逆拡散過程が一貫性をもって作用できることを保証しておく. そうすると, 最終ステップ ( $t=1$ ) では特別に, 連続 Gauss 分布から離散画像データの確率分布に変換する次の独立デコーダーが必要となる.

\begin{align} p_θ\left(\mathbf{x}_0|\mathbf{x}_1\right) &= ∏_{i=1}^D∫_{δ_-\left(x_0^i\right)}^{δ_+\left(x_0^i\right)}\mathcal{N}\left(x;μ_θ^i\left(\mathbf{x}_1,1\right),σ_1^2\right)\mathrm{d}x \\ δ_+\left(x_0^i\right) &= \begin{cases} ∞ & \text{if }x=1 \\ x+\frac{1}{255} & \text{if }x<1 \end{cases} \\ δ_-\left(x_0^i\right) &= \begin{cases} -∞ & \text{if }x=-1 \\ x-\frac{1}{255} & \text{if }x>-1 \end{cases} \end{align}

$L_{1:T}$ , および $L_0$ の簡略化

$L_{1:T}$ から重み $w_t$ を捨て, ステップを $t∼U\lbrace 1,\ldots,T\rbrace$ と一様サンプリングすれば, $L_0$ を含めた変分上限は次のように簡略化される.

L_{\text{simple}} ≔ \mathbb{E}_{\mathbf{x}_0,\mathbf{ϵ}}\left[\left\|\mathbf{ϵ}-\mathbf{ϵ}_θ\left(\sqrt{\bar{α}_t}\mathbf{x}_0+\sqrt{1-\bar{α}_t}\mathbf{ϵ},t\right)\right\|^2\right]

$t=1$ では $L_0$ と一致. $t≥2$ ではこれにより, ステップ初期のノイズが小さい場合の重みがステップ後半におけるノイズが大きい場合と比べて相対的に小さくなるため, モデルはノイズが大きい場合の難しいタスクにより多くのリソースを割けるようになり, 結果としてサンプルの品質も向上すると主張されている.

まとめると, 最終的に提示されたアルゴリズムは以下である. algorithm

supercalifragilisticexpialidocious

Koji Higasa Blog

Denoising Diffusion Probabilistic Models - notes

拡散モデル (Sohl-Dickstein et al., 2015)

拡散モデルとデノイジングオートエンコーダー (DAE)

拡散過程と $L_T$

逆拡散過程と $L_{1:T}$

分散 $\mathbf{Σ}_θ\left(\mathbf{x}_t,t\right)$

平均 $\mathbf{μ}_θ\left(\mathbf{x}_t,t\right)$

データスケーリングと逆拡散過程デコーダー, および $L_0$

$L_{1:T}$ , および $L_0$ の簡略化

拡散モデル (Sohl-Dickstein et al., 2015)

拡散モデルとデノイジングオートエンコーダー (DAE)

拡散過程と LTL_T

逆拡散過程と L1:TL_{1:T}

分散 Σθ(xt,t)\mathbf{Σ}_θ\left(\mathbf{x}_t,t\right)

平均 μθ(xt,t)\mathbf{μ}_θ\left(\mathbf{x}_t,t\right)

データスケーリングと逆拡散過程デコーダー, および L0L_0

L1:TL_{1:T}, および L0L_0 の簡略化

拡散過程と $L_T$

逆拡散過程と $L_{1:T}$

分散 $\mathbf{Σ}_θ\left(\mathbf{x}_t,t\right)$

平均 $\mathbf{μ}_θ\left(\mathbf{x}_t,t\right)$

データスケーリングと逆拡散過程デコーダー, および $L_0$

$L_{1:T}$ , および $L_0$ の簡略化