模型训练过程中大多数浮点运算都是矩阵乘法，对于一个 <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>m</mi><mo>×</mo><mi>n</mi></mrow><annotation encoding="application/x-tex">m \times n</annotation></semantics></math> 的矩阵 <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>A</mi></mrow><annotation encoding="application/x-tex">A</annotation></semantics></math> 和一个 <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>n</mi><mo>×</mo><mi>p</mi></mrow><annotation encoding="application/x-tex">n \times p</annotation></semantics></math> 的矩阵 <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>B</mi></mrow><annotation encoding="application/x-tex">B</annotation></semantics></math>，<math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>A</mi><mo>×</mo><mi>B</mi></mrow><annotation encoding="application/x-tex">A \times B</annotation></semantics></math> 需要 <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>m</mi><mo>×</mo><mi>n</mi><mo>×</mo><mi>p</mi></mrow><annotation encoding="application/x-tex">m \times n \times p</annotation></semantics></math> 次乘法和 <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>m</mi><mo>×</mo><mi>n</mi><mo>×</mo><mi>p</mi></mrow><annotation encoding="application/x-tex">m \times n \times p</annotation></semantics></math> 次加法，即需要 <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>2</mn><mi>m</mi><mi>n</mi><mi>p</mi></mrow><annotation encoding="application/x-tex">2mnp</annotation></semantics></math> FLOPs。
<h2>Transformer Architecture 的 FLOPs 计算</h2>
<img src="/transformer.png" alt="alt text">
<img src="/attention.png" alt="alt text">
<h3>Attention</h3>
Q，K，V transformation: <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>3</mn><mo>×</mo><mn>2</mn><mi>B</mi><mi>s</mi><msup><mi>h</mi><mn>2</mn></msup></mrow><annotation encoding="application/x-tex">3 \times 2Bsh^2</annotation></semantics></math>
<math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>Q</mi><msup><mi>K</mi><mi>T</mi></msup></mrow><annotation encoding="application/x-tex">QK^T</annotation></semantics></math>: <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>2</mn><mi>B</mi><msup><mi>s</mi><mn>2</mn></msup><mi>h</mi></mrow><annotation encoding="application/x-tex">2Bs^2h</annotation></semantics></math>
attention over values: <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>2</mn><mi>B</mi><msup><mi>s</mi><mn>2</mn></msup><mi>h</mi></mrow><annotation encoding="application/x-tex">2Bs^2h</annotation></semantics></math>
post-attention linear projection: <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>2</mn><mi>B</mi><mi>s</mi><msup><mi>h</mi><mn>2</mn></msup></mrow><annotation encoding="application/x-tex">2Bsh^2</annotation></semantics></math>
<h3>Feed Forward Network</h3>
linear h->4h: <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>8</mn><mi>B</mi><mi>s</mi><msup><mi>h</mi><mn>2</mn></msup></mrow><annotation encoding="application/x-tex">8Bsh^2</annotation></semantics></math>
linear 4h->h: <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>8</mn><mi>B</mi><mi>s</mi><msup><mi>h</mi><mn>2</mn></msup></mrow><annotation encoding="application/x-tex">8Bsh^2</annotation></semantics></math>
<h3>Total</h3>
forward: <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo stretchy="false">(</mo><mn>6</mn><mo>+</mo><mn>2</mn><mo>+</mo><mn>8</mn><mo>+</mo><mn>8</mn><mo stretchy="false">)</mo><mi>B</mi><mi>s</mi><msup><mi>h</mi><mn>2</mn></msup><mo>+</mo><mo stretchy="false">(</mo><mn>2</mn><mo>+</mo><mn>2</mn><mo stretchy="false">)</mo><mi>B</mi><msup><mi>s</mi><mn>2</mn></msup><mi>h</mi><mo>=</mo><mn>24</mn><mi>B</mi><mi>s</mi><msup><mi>h</mi><mn>2</mn></msup><mo>+</mo><mn>4</mn><mi>B</mi><msup><mi>s</mi><mn>2</mn></msup><mi>h</mi></mrow><annotation encoding="application/x-tex">(6 + 2 + 8 + 8)Bsh^2 + (2 + 2)Bs^2h = 24Bsh^2 + 4Bs^2h</annotation></semantics></math>
backward 的 FLOPs 大致是 forward 的 2 倍，所以 forward + backward 的 FLOPs 大致是 <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>72</mn><mi>B</mi><mi>s</mi><msup><mi>h</mi><mn>2</mn></msup><mo>+</mo><mn>12</mn><mi>B</mi><msup><mi>s</mi><mn>2</mn></msup><mi>h</mi></mrow><annotation encoding="application/x-tex">72Bsh^2 + 12Bs^2h</annotation></semantics></math>
参考 <a href="https://arxiv.org/pdf/2104.04473.pdf">Megatron-LM 2</a> APPENDIX

FLOPs 的计算

Transformer Architecture 的 FLOPs 计算

Attention

Feed Forward Network

Total