my pretrain history
pretrain
.py:2134] 2024-10-02 22:46:14,378 >> ***** Running training *****
[INFO|trainer.py:2135] 2024-10-02 22:46:14,378 >> Num examples = 16,527,707
[INFO|trainer.py:2136] 2024-10-02 22:46:14,378 >> Num Epochs = 3
[INFO|trainer.py:2137] 2024-10-02 22:46:14,378 >> Instantaneous batch size per device = 8
[INFO|trainer.py:2140] 2024-10-02 22:46:14,378 >> Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:2141] 2024-10-02 22:46:14,378 >> Gradient Accumulation steps = 1
[INFO|trainer.py:2142] 2024-10-02 22:46:14,378 >> Total optimization steps = 6,197,892
[INFO|trainer.py:2143] 2024-10-02 22:46:14,380 >> Number of trainable parameters = 1,235,814,400
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
16,527,707 * 3 /8 = 6,197,892
25gb
py:2134] 2024-10-04 08:58:31,915 » *** Running training *** [INFO|trainer.py:2135] 2024-10-04 08:58:31,915 » Num examples = 826,279 [INFO|trainer.py:2136] 2024-10-04 08:58:31,915 » Num Epochs = 3 [INFO|trainer.py:2137] 2024-10-04 08:58:31,915 » Instantaneous batch size per device = 1 [INFO|trainer.py:2140] 2024-10-04 08:58:31,915 » Total train batch size (w. parallel, distributed & accumulation) = 8 [INFO|trainer.py:2141] 2024-10-04 08:58:31,915 » Gradient Accumulation steps = 8 [INFO|trainer.py:2142] 2024-10-04 08:58:31,915 » Total optimization steps = 309,852 [INFO|trainer.py:2143] 2024-10-04 08:58:31,916 » Number of trainable parameters = 1,235,814,400
输出格式
训练资源占用
100M 模型/ V100
V100是15 TFLOPS FP32算力。
- 当cutoff_len为1时,算力没有跑满,内存瓶颈。
- 当cutoff_len为8192时,算力应该差不多跑满了。此时算力为6(参数量) token数 * 2.34(speed) = 11T。接近15T.(没有用到Tensor Core,seq_len太多,在attention引入了更多计算)
- 当cutoff_len为1024,gradient_accumulation_steps=8时,理论上速度与上一条相同,但实际慢了一些,可能是因为多了7轮内存访问。
- 当cutoff_len为1024, batch_size为8时,因为seq_length小,计算量小了些的缘故。按6 * 100 * 1024 * 8 * 3.3 = 16T,比15T大一些。
task | fp16 | speed | num_gpu | batch size per gpu | gradient_accumulation_steps | cutoff_len | gpu memory usage(GB) | memory bandwidth |
---|---|---|---|---|---|---|---|---|
pretrain | true | 12.9 it/s | 1 | 1 | 1 | 1 | 2.579 | |
pretrain | true | 12.7 it/s | 1 | 1 | 1 | 64 | 2.617 | |
pretrain | true | 12.4 it/s | 1 | 1 | 1 | 128 | 2.729 | |
pretrain | true | 12.4 it/s | 1 | 1 | 1 | 256 | 2.951 | |
pretrain | true | 12.4 it/s | 1 | 1 | 1 | 512 | 3.183 | |
pretrain | true | 12.3it/s | 1 | 1 | 1 | 1024 | 4.5 | |
pretrain | true | 9.03 it/s | 1 | 1 | 1 | 2048 | 6.583 | |
pretrain | true | 5 it/s | 1 | 1 | 1 | 4096 | 9.647 | |
pretrain | true | 2.34 it/s | 1 | 1 | 1 | 8192 | 18.067 | |
==pretrain== | ==true== | ==12.3it/s== | ==1== | ==1== | ==1== | ==1024== | ==4.5== | |
==pretrain== | ==true== | ==1.82it/s== | ==1== | ==1== | ==8== | ==1024== | ==4.529== | |
==pretrain== | ==true== | ==1.2s/it== | ==1== | ==1== | ==16== | ==1024== | ==4.529== | |
pretrain | true | 2.26 it/s | 1 | 12 | 1 | 1024 | 26.353 | |
pretrain | true | 3.3it/s | 1 | 8 | 1 | 1024 | 18.8 | 70% |
pretrain | true | 2.5s/it | 1 | 8 | 8 | 1024 | 18.0 | |
pretrain | true | 4.8s/it | 1 | 8 | 16 | 1024 | 18.0 |
A100:
- Training with DataParallel so batch size has been adjusted to: 8, 所以没法把batch size设成16或32,通过修改gradient_accumulation_steps来提高并行度,没有走作用。
- 计算算力是:6 * 100 * 1024 * 8 * 6.99 = 34 TFLOPS,A100的fp engine是19.5 FLOPS,是不够的,主要用的是tensor core。
- 看起来是内存瓶颈了,50%的内存带宽是1TB/s,1s计算的token数是1024 * 8 * 6.99 = 57,000个, 平均一个token访存17.5GB。这和memory usage是对得上的。但这个内存占用是1024 * 8才用到的,并不应该每个token都用到所有其他的token相关的内存。反向传播的内存访存是非常大的?怎么算?
- 对于较小的模型来讲,gradient_accumulation_steps应该设置为1 ,设大的主要作的是减少权重更新次数?但小模型权重更新写内存本来就不多。
task | fp16 | speed | num_gpu | batch size per gpu | gradient_accumulation_steps | cutoff_len | gpu memory usage(GB) | memory bandwidth | tensor core | fp engine |
---|---|---|---|---|---|---|---|---|---|---|
pretrain | true | 6.99 it/s | 1 | 8 | 1 | 1024 | 18.8 | 50% | 12.9% | 3% |
pretrain | true | 1.21 s/it | 1 | 8 | 8 | 1024 | 18.8 | 50% | 12.9% | 3% |
1B 模型/ V100 *6
num_gpu | batch size per gpu | gradient_accumulation_steps | cutoff_len | gpu memory usage(GB) | 算力 |
---|---|---|---|---|---|
6 | 16 | 1 | 1024 | ||
1B 模型/ A100
| task | speed(s/it) | num_gpu | batch size per gpu | gradient_accumulation_steps | cutoff_len | gpu memory usage(GB) | memory bandwidth | fp32 | fp32算力(TFLOPS) | | ——– | ———– | ——- | —————— | ————————— | ———- | ——————– | —————- | —- | ————– | | pretrain | 8.9 | 1 | 16 | 1 | 1024 | 42.73 | 10%(200GB/s) | 90% | 17.6 | | | | | | | | | | | | | | | | | | | | | | | forward算力:2.4GFLOPS * 16 * 1024 = 38.4 TFLOPS(约需2s),如果gradient_accumulation_steps设成8,那就是19s左右。