002. LLM的模型架构和资源需求

LLM concepts and architecture

Attention is all you needs

Attention

MHA

多头与不多头不同之处在于，每个k*q都会产生一个seq_len * seq_len的矩阵。如果没有多头的话，只有一个seq_len * seq_len的矩阵。

@startuml
skinparam sequence {
    ArrowColor #0079BF
    LifeLineBorderColor #0079BF
    LifeLineBackgroundColor #E0F2FF
    ParticipantBorderColor #0079BF
    ParticipantBackgroundColor #E0F2FF
    ArrowThickness 2
}
skinparam backgroundColor #F5F5F5
skinparam title "多头注意力机制（Multi-Head Attention）流程图"


actor "输入序列 (Query, Key, Value)" as input_seq
rectangle "线性变换层" as linear_layer {
    component "Query 线性变换" as query_linear
    component "Key 线性变换" as key_linear
    component "Value 线性变换" as value_linear
}
rectangle "多头拆分" as split_heads {
    component "拆分 Query" as split_query
    component "拆分 Key" as split_key
    component "拆分 Value" as split_value
}
rectangle "注意力计算" as attention_calculation {
    component "计算 Query 和 Key 的点积" as dot_product
    component "缩放操作" as scale
    component "Softmax 归一化" as softmax_op
    component "与 Value 相乘并求和" as multiply_sum
}
rectangle "多头合并" as merge_heads {
    component "合并多头结果" as merge_result
}
rectangle "最终线性变换" as final_linear_layer {
    component "最终线性变换" as final_linear
}
actor "输出序列" as output_seq

input_seq --> query_linear : Query
input_seq --> key_linear : Key
input_seq --> value_linear : Value

query_linear --> split_query : 变换后的 Query
key_linear --> split_key : 变换后的 Key
value_linear --> split_value : 变换后的 Value

split_query --> dot_product : 多头 Query
split_key --> dot_product : 多头 Key

dot_product --> scale : 点积结果
scale --> softmax_op : 缩放后的分数
softmax_op --> multiply_sum : 归一化分数
split_value --> multiply_sum : 多头 Value
multiply_sum --> merge_result : 多头加权结果

merge_result --> final_linear : 合并后的结果
final_linear --> output_seq : 输出

@enduml

拆分过程：

@startuml
skinparam sequence {
    ArrowColor #0079BF
    LifeLineBorderColor #0079BF
    LifeLineBackgroundColor #E0F2FF
    ParticipantBorderColor #0079BF
    ParticipantBackgroundColor #E0F2FF
    ArrowThickness 2
}
skinparam backgroundColor #F5F5F5
skinparam title "多头注意力机制中拆分与合并过程 (dim=4096, heads=8)"


actor "线性变换后的 Query [batch_size, seq_length, 4096]" as query_transformed
actor "线性变换后的 Key [batch_size, seq_length, 4096]" as key_transformed
actor "线性变换后的 Value [batch_size, seq_length, 4096]" as value_transformed

rectangle "拆分操作" as split_operation {
    component "拆分 Query 为 8 个头" as split_query
    component "拆分 Key 为 8 个头" as split_key
    component "拆分 Value 为 8 个头" as split_value
}

actor "8 个头的 Query [batch_size, seq_length, 8, 512]" as multi_head_query
actor "8 个头的 Key [batch_size, seq_length, 8, 512]" as multi_head_key
actor "8 个头的 Value [batch_size, seq_length, 8, 512]" as multi_head_value

rectangle "注意力计算" as attention_calculation {
    component "每个头计算注意力" as attn_calc
}

actor "8 个头的输出 [batch_size, seq_length, 8, 512]" as multi_head_output

rectangle "合并操作" as merge_operation {
    component "合并 8 个头的输出" as merge_output
}

actor "合并后的输出 [batch_size, seq_length, 4096]" as final_output

query_transformed --> split_query : 线性变换后的 Query
split_query --> multi_head_query : 8 个头的 Query

key_transformed --> split_key : 线性变换后的 Key
split_key --> multi_head_key : 8 个头的 Key

value_transformed --> split_value : 线性变换后的 Value
split_value --> multi_head_value : 8 个头的 Value

multi_head_query --> attn_calc : 8 个头的 Query
multi_head_key --> attn_calc : 8 个头的 Key
multi_head_value --> attn_calc : 8 个头的 Value

attn_calc --> multi_head_output : 8 个头的输出

multi_head_output --> merge_output : 8 个头的输出
merge_output --> final_output : 合并后的输出

@enduml

MQA

GQA

MLA

Normalization

LayerNorm

定义

Layer Normalization（层归一化）是一种常用的归一化方法，其计算公式如下：

对于一个形状为$(N, D)$的输入张量$x$，其中$N$是批量大小，$D$是特征维度。

首先，计算每个样本在特征维度上的均值$\mu$和方差$\sigma^{2}$：

$\mu=\frac{1}{D}\sum_{i = 1}^{D}x_{i}$
$\sigma^{2}=\frac{1}{D}\sum_{i = 1}^{D}(x_{i}-\mu)^{2}$

然后，对输入进行归一化：

$\hat{x}=\frac{x_{i}-\mu}{\sqrt{\sigma^{2}+\epsilon}}$

其中，$\epsilon$是一个很小的常数，通常取$1e - 5$或$1e - 8$，用于防止分母为零。

最后，通过可学习的参数$\gamma$和$\beta$对归一化后的结果进行缩放和平移：

$y=\gamma\hat{x}+\beta$

$\gamma$和$\beta$是可学习的参数，通过训练来调整，以使得模型能够更好地学习数据的特征。

BatchNorm对比

层归一化（Layer Normalization）和批量归一化（Batch Normalization）有以下区别：

归一化对象

Batch Normalization：对一个批次的数据在特征维度上进行归一化，即对不同样本的同一特征进行归一化。
Layer Normalization：对单个样本在所有特征维度上进行归一化，是在一个样本的内部对其所有特征进行操作。

计算方式

Batch Normalization：计算一个批次数据的均值和方差，然后对该批次内的所有样本进行归一化。
Layer Normalization：独立计算每个样本的均值和方差，然后对该样本进行归一化。

应用场景

Batch Normalization：适用于计算机视觉等领域，在处理图像数据时，能够有效加速模型收敛，减少梯度消失和爆炸问题。
Layer Normalization：在自然语言处理中表现较好，对于变长序列数据，能够更好地适应不同长度的输入，对每个样本独立归一化，不受批次中其他样本的影响。

对模型训练的影响

Batch Normalization：由于依赖批次统计信息，在训练和推理时的行为有所不同，可能需要进行一些额外的处理来保证模型的稳定性。
Layer Normalization：在训练和推理时的行为一致，因为它只依赖于单个样本的统计信息，不需要进行特殊的处理来适应不同的阶段。

超参数敏感性

Batch Normalization：对批次大小较为敏感，批次大小的变化可能会影响归一化的效果和模型的性能。
Layer Normalization：对批次大小不敏感，更适合于批次大小较小或者动态变化的情况。

与RMS Norm对比

RMS Norm（Root Mean Square Normalization）即均方根归一化，是一种归一化方法，与Layer Norm和Batch Norm有相似之处，但也有不同。以下是其相关介绍：

计算公式对输入张量$x$，其计算公式为$y=\frac{x}{\sqrt{E[x^{2}]+\epsilon}}$，其中$E[x^{2}]$是$x$元素平方的均值，$\epsilon$是一个小常数，防止分母为零。

与其他归一化方法的区别

计算方式：Layer Norm计算均值和方差时是在特征维度上对所有元素进行操作，Batch Norm是在一个批次内对不同样本的同一特征进行计算，而RMS Norm主要关注元素平方的均值，计算相对更简单，不涉及减去均值的操作。
对数据的影响：Layer Norm和Batch Norm会将数据归一化到均值为0、方差为1的分布，而RMS Norm主要是对数据的尺度进行调整，使其元素的均方根值处于一定范围，不一定改变数据的均值。
应用场景：RMS Norm在一些自然语言处理任务，如Transformer架构中有所应用，能在一定程度上提高模型的稳定性和泛化能力。与Layer Norm类似，它也适用于处理序列数据，但在一些具体任务中的表现可能与Layer Norm有所不同，具体使用哪种需要根据实际情况进行实验和选择。

使用场景

Pre-input Layer Normalization: (1)缓解输入数据波动较大的情况，（2）输入数据分布在重要信息也会被平滑掉一些。 Post-output Layer Normalization：稳定输出，对模型能力影响不大，主要是对分类、回归任务进行前处理。 Inter-Layer Normalization：（1）稳定数据，防止梯度爆炸或消失，（2）提高模型泛化能力，（3）计算开销大。

各模型应用情况：

GPT 在Attention和FFN之间使用Layer Norm
BERT 在每一层的输入输出使用Layer Norm
LLama在Attention和FFN之间使用RMS Norm
千问2, post_attetion, input使用RMS Norm
Whisper使用attention_ln, mlp_ln, ln_post(最后一层的后面)
Gemma，有input_layernorm, post_attention, pre_ffn, post_ffn，都是RMS Norm。
minicpm，在q, kv分别使用layer norm， attenttion, 和input。都是RMS Norm。

Activation functions

Common Activation Functions

在语言模型（LM）和大型语言模型（LLM）中，常用的激活函数有以下几种：

ReLU（Rectified Linear Unit）
- 公式：$f(x)=\max(0,x)$。
- 特点：计算简单高效，能有效缓解梯度消失问题，加快模型收敛速度。在处理自然语言中的稀疏数据时表现良好，能使模型自动学习哪些特征是重要的，将不重要的特征置为0，起到特征选择的作用。但它在输入小于0时梯度为0，可能导致部分神经元在训练过程中"死亡"，即不再被激活。
- 应用：广泛应用于各种LLM的神经网络架构中，如Transformer的前馈神经网络部分。
GELU（Gaussian Error Linear Unit）
- 公式：$GELU(x)=x\Phi(x)$，其中$\Phi(x)$是标准正态分布的累积分布函数。
- 特点：它是一种平滑的非线性激活函数，比ReLU更加柔和。GELU考虑了输入的整体分布情况，能根据输入的概率分布来调整输出，具有更好的正则化效果，有助于提高模型的泛化能力。
- 应用：在许多现代LLM中被广泛使用，如BERT、GPT - 3等都采用了GELU激活函数，以提高模型的性能和稳定性。
Swish(也叫SiLU)
- 公式：$Swish(x)=x\times\sigma(x)$，其中$sigma(x)$是Sigmoid函数。
- 特点：Swish是一种自门控激活函数，它结合了Sigmoid函数的门控机制和线性函数的特性。具有平滑、非单调的特性，在不同的输入范围内表现出不同的行为，能够更好地适应复杂的语言数据分布。同时，它在训练过程中能够保持较好的梯度流，有助于模型的收敛。
- 应用：在一些LLM的研究和实践中也有应用，例如在一些对模型性能有较高要求的自然语言处理任务中，Swish激活函数能够帮助模型更好地学习语言的复杂模式。
Softmax
- 公式：$Softmax(x_i)=\frac{e^{x_i}}{\sum_{j=1}^{n}e^{x_j}}$，用于将一个数值向量转换为表示各个类别概率的概率分布向量。
- 特点：它将输入值转换为0到1之间的概率值，且所有概率值之和为1，能够很好地表示分类任务中各个类别的可能性。
- 应用：通常用于LLM的输出层，将模型的输出转换为各个可能词汇或标签的概率分布，以便进行分类或生成任务。例如，在语言生成任务中，通过Softmax函数可以得到下一个单词的概率分布，从而根据概率采样或选择概率最高的单词作为生成结果。

从图中可以看出，GELU和SiLU(Swish)差不多，但是GELU计算量会小一些。

Adaptation By Models

以下是这些模型常用的激活函数：

Qwen2、Llama3：在MLP中使用SiLU。
Gemma2：PytorchGELUTanh

FFN

在Transformer架构里，前馈神经网络（FFN）有时会采用门控机制，像GLU（Gated Linear Unit）这种形式，此时会涉及到gate、up、down矩阵。接下来我会详细说明计算过程，并且借助mermaid为你绘制计算流程示意图。

计算过程

假设输入为向量 $x$，其维度是 $d_{model}$。FFN里的门控机制一般包含以下计算步骤：

线性变换：对输入 $x$ 分别做三次线性变换，得到 gate、up、down 矩阵的中间结果。这里会用到三个不同的权重矩阵 $W_{gate}$、$W_{up}$ 和 $W_{down}$，以及对应的偏置向量 $b_{gate}$、$b_{up}$ 和 $b_{down}$。
- $gate = \text{Linear}{gate}(x)=W{gate}x + b_{gate}$
- $up = \text{Linear}{up}(x)=W{up}x + b_{up}$
- $down = \text{Linear}{down}(x)=W{down}x + b_{down}$
门控操作：使用激活函数（例如Sigmoid）对 gate 进行处理，然后和 up 逐元素相乘，这一过程起到了门控的作用，能够控制信息的流通。
- $gated = \sigma(gate)\odot up$ 其中，$\sigma$ 代表激活函数，$\odot$ 表示逐元素相乘。
最终输出：把门控操作的结果和 down 相加，就得到了FFN的最终输出 (y)。
- $y = \text{ReLU}(gated + down)$

示意图

此图展示了带有门控机制的FFN的计算流程：

输入 (x) 经过线性变换得到 gate、up、down。
gate 经过Sigmoid激活后和 up 逐元素相乘，得到 gated。
gated 和 down 相加，再经过ReLU激活，最终得到输出 (y)。 ```plantuml @startuml skinparam monochrome true skinparam backgroundColor #EEEEEE skinparam sequence { ArrowColor DeepSkyBlue LifeLineBorderColor DeepSkyBlue LifeLineBackgroundColor #A9DCDF ParticipantBorderColor DeepSkyBlue ParticipantBackgroundColor #E5F6FF }

actor "输入 x" as input rectangle "线性变换" as linear { component "gate = W_gate * x + b_gate" as gate component "up = W_up * x + b_up" as up component "down = W_down * x + b_down" as down } rectangle "门控操作" as gate_op { component "Sigmoid激活" as sigmoid component "逐元素相乘" as mul component "gated = σ(gate) ⊙ up" as gated } rectangle "最终输出" as final { component "相加" as add component "ReLU激活" as relu component "输出 y" as output }

input –> gate : 线性变换 input –> up : 线性变换 input –> down : 线性变换 gate –> sigmoid : Sigmoid激活 sigmoid –> mul : σ(gate) up –> mul : up mul –> gated : gated gated –> add : gated down –> add : down add –> relu : gated + down relu –> output : ReLU激活

@enduml

## MoE

## LoRA

## ways to dive into model architecture and flops

### print(model)

```python
print(model)

whisper model

Whisper(
  (encoder): AudioEncoder(
    (conv1): Conv1d(80, 512, kernel_size=(3,), stride=(1,), padding=(1,))
    (conv2): Conv1d(512, 512, kernel_size=(3,), stride=(2,), padding=(1,))
    (blocks): ModuleList(
      (0-5): 6 x ResidualAttentionBlock(
        (attn): MultiHeadAttention(
          (query): Linear(in_features=512, out_features=512, bias=True)
          (key): Linear(in_features=512, out_features=512, bias=False)
          (value): Linear(in_features=512, out_features=512, bias=True)
          (out): Linear(in_features=512, out_features=512, bias=True)
        )
        (attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (mlp): Sequential(
          (0): Linear(in_features=512, out_features=2048, bias=True)
          (1): GELU(approximate='none')
          (2): Linear(in_features=2048, out_features=512, bias=True)
        )
        (mlp_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
    )
    (ln_post): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (decoder): TextDecoder(
    (token_embedding): Embedding(51865, 512)
    (blocks): ModuleList(
      (0-5): 6 x ResidualAttentionBlock(
        (attn): MultiHeadAttention(
          (query): Linear(in_features=512, out_features=512, bias=True)
          (key): Linear(in_features=512, out_features=512, bias=False)
          (value): Linear(in_features=512, out_features=512, bias=True)
          (out): Linear(in_features=512, out_features=512, bias=True)
        )
        (attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (cross_attn): MultiHeadAttention(
          (query): Linear(in_features=512, out_features=512, bias=True)
          (key): Linear(in_features=512, out_features=512, bias=False)
          (value): Linear(in_features=512, out_features=512, bias=True)
          (out): Linear(in_features=512, out_features=512, bias=True)
        )
        (cross_attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (mlp): Sequential(
          (0): Linear(in_features=512, out_features=2048, bias=True)
          (1): GELU(approximate='none')
          (2): Linear(in_features=2048, out_features=512, bias=True)
        )
        (mlp_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
    )
    (ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
)
  

vqgan_imagenet_f16_1024

参数量：89,623,492

VQModel(
  (encoder): Encoder(
    (conv_in): Conv2d(3, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (down): ModuleList(
      (0-1): 2 x Module(
        (block): ModuleList(
          (0-1): 2 x ResnetBlock(
            (norm1): GroupNorm(32, 128, eps=1e-06, affine=True)
            (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (norm2): GroupNorm(32, 128, eps=1e-06, affine=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
        )
        (attn): ModuleList()
        (downsample): Downsample(
          (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2))
        )
      )
      (2): Module(
        (block): ModuleList(
          (0): ResnetBlock(
            (norm1): GroupNorm(32, 128, eps=1e-06, affine=True)
            (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (norm2): GroupNorm(32, 256, eps=1e-06, affine=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (nin_shortcut): Conv2d(128, 256, kernel_size=(1, 1), stride=(1, 1))
          )
          (1): ResnetBlock(
            (norm1): GroupNorm(32, 256, eps=1e-06, affine=True)
            (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (norm2): GroupNorm(32, 256, eps=1e-06, affine=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
        )
        (attn): ModuleList()
        (downsample): Downsample(
          (conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2))
        )
      )
      (3): Module(
        (block): ModuleList(
          (0-1): 2 x ResnetBlock(
            (norm1): GroupNorm(32, 256, eps=1e-06, affine=True)
            (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (norm2): GroupNorm(32, 256, eps=1e-06, affine=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
        )
        (attn): ModuleList()
        (downsample): Downsample(
          (conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2))
        )
      )
      (4): Module(
        (block): ModuleList(
          (0): ResnetBlock(
            (norm1): GroupNorm(32, 256, eps=1e-06, affine=True)
            (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (norm2): GroupNorm(32, 512, eps=1e-06, affine=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (nin_shortcut): Conv2d(256, 512, kernel_size=(1, 1), stride=(1, 1))
          )
          (1): ResnetBlock(
            (norm1): GroupNorm(32, 512, eps=1e-06, affine=True)
            (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (norm2): GroupNorm(32, 512, eps=1e-06, affine=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
        )
        (attn): ModuleList(
          (0-1): 2 x AttnBlock(
            (norm): GroupNorm(32, 512, eps=1e-06, affine=True)
            (q): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
            (k): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
            (v): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
            (proj_out): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
          )
        )
      )
    )
    (mid): Module(
      (block_1): ResnetBlock(
        (norm1): GroupNorm(32, 512, eps=1e-06, affine=True)
        (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (norm2): GroupNorm(32, 512, eps=1e-06, affine=True)
        (dropout): Dropout(p=0.0, inplace=False)
        (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      )
      (attn_1): AttnBlock(
        (norm): GroupNorm(32, 512, eps=1e-06, affine=True)
        (q): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
        (k): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
        (v): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
        (proj_out): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
      )
      (block_2): ResnetBlock(
        (norm1): GroupNorm(32, 512, eps=1e-06, affine=True)
        (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (norm2): GroupNorm(32, 512, eps=1e-06, affine=True)
        (dropout): Dropout(p=0.0, inplace=False)
        (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      )
    )
    (norm_out): GroupNorm(32, 512, eps=1e-06, affine=True)
    (conv_out): Conv2d(512, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  )
  (decoder): Decoder(
    (conv_in): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (mid): Module(
      (block_1): ResnetBlock(
        (norm1): GroupNorm(32, 512, eps=1e-06, affine=True)
        (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (norm2): GroupNorm(32, 512, eps=1e-06, affine=True)
        (dropout): Dropout(p=0.0, inplace=False)
        (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      )
      (attn_1): AttnBlock(
        (norm): GroupNorm(32, 512, eps=1e-06, affine=True)
        (q): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
        (k): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
        (v): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
        (proj_out): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
      )
      (block_2): ResnetBlock(
        (norm1): GroupNorm(32, 512, eps=1e-06, affine=True)
        (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (norm2): GroupNorm(32, 512, eps=1e-06, affine=True)
        (dropout): Dropout(p=0.0, inplace=False)
        (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      )
    )
    (up): ModuleList(
      (0): Module(
        (block): ModuleList(
          (0-2): 3 x ResnetBlock(
            (norm1): GroupNorm(32, 128, eps=1e-06, affine=True)
            (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (norm2): GroupNorm(32, 128, eps=1e-06, affine=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
        )
        (attn): ModuleList()
      )
      (1): Module(
        (block): ModuleList(
          (0): ResnetBlock(
            (norm1): GroupNorm(32, 256, eps=1e-06, affine=True)
            (conv1): Conv2d(256, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (norm2): GroupNorm(32, 128, eps=1e-06, affine=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (nin_shortcut): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1))
          )
          (1-2): 2 x ResnetBlock(
            (norm1): GroupNorm(32, 128, eps=1e-06, affine=True)
            (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (norm2): GroupNorm(32, 128, eps=1e-06, affine=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
        )
        (attn): ModuleList()
        (upsample): Upsample(
          (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        )
      )
      (2): Module(
        (block): ModuleList(
          (0-2): 3 x ResnetBlock(
            (norm1): GroupNorm(32, 256, eps=1e-06, affine=True)
            (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (norm2): GroupNorm(32, 256, eps=1e-06, affine=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
        )
        (attn): ModuleList()
        (upsample): Upsample(
          (conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        )
      )
      (3): Module(
        (block): ModuleList(
          (0): ResnetBlock(
            (norm1): GroupNorm(32, 512, eps=1e-06, affine=True)
            (conv1): Conv2d(512, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (norm2): GroupNorm(32, 256, eps=1e-06, affine=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (nin_shortcut): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
          )
          (1-2): 2 x ResnetBlock(
            (norm1): GroupNorm(32, 256, eps=1e-06, affine=True)
            (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (norm2): GroupNorm(32, 256, eps=1e-06, affine=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
        )
        (attn): ModuleList()
        (upsample): Upsample(
          (conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        )
      )
      (4): Module(
        (block): ModuleList(
          (0-2): 3 x ResnetBlock(
            (norm1): GroupNorm(32, 512, eps=1e-06, affine=True)
            (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (norm2): GroupNorm(32, 512, eps=1e-06, affine=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
        )
        (attn): ModuleList(
          (0-2): 3 x AttnBlock(
            (norm): GroupNorm(32, 512, eps=1e-06, affine=True)
            (q): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
            (k): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
            (v): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
            (proj_out): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
          )
        )
        (upsample): Upsample(
          (conv): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        )
      )
    )
    (norm_out): GroupNorm(32, 128, eps=1e-06, affine=True)
    (conv_out): Conv2d(128, 3, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  )
  (loss): VQLPIPSWithDiscriminator(
    (perceptual_loss): LPIPS(
      (scaling_layer): ScalingLayer()
      (net): vgg16(
        (slice1): Sequential(
          (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): ReLU(inplace=True)
          (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (3): ReLU(inplace=True)
        )
        (slice2): Sequential(
          (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
          (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (6): ReLU(inplace=True)
          (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (8): ReLU(inplace=True)
        )
        (slice3): Sequential(
          (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
          (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (11): ReLU(inplace=True)
          (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (13): ReLU(inplace=True)
          (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (15): ReLU(inplace=True)
        )
        (slice4): Sequential(
          (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
          (17): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (18): ReLU(inplace=True)
          (19): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (20): ReLU(inplace=True)
          (21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (22): ReLU(inplace=True)
        )
        (slice5): Sequential(
          (23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
          (24): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (25): ReLU(inplace=True)
          (26): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (27): ReLU(inplace=True)
          (28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (29): ReLU(inplace=True)
        )
      )
      (lin0): NetLinLayer(
        (model): Sequential(
          (0): Dropout(p=0.5, inplace=False)
          (1): Conv2d(64, 1, kernel_size=(1, 1), stride=(1, 1), bias=False)
        )
      )
      (lin1): NetLinLayer(
        (model): Sequential(
          (0): Dropout(p=0.5, inplace=False)
          (1): Conv2d(128, 1, kernel_size=(1, 1), stride=(1, 1), bias=False)
        )
      )
      (lin2): NetLinLayer(
        (model): Sequential(
          (0): Dropout(p=0.5, inplace=False)
          (1): Conv2d(256, 1, kernel_size=(1, 1), stride=(1, 1), bias=False)
        )
      )
      (lin3): NetLinLayer(
        (model): Sequential(
          (0): Dropout(p=0.5, inplace=False)
          (1): Conv2d(512, 1, kernel_size=(1, 1), stride=(1, 1), bias=False)
        )
      )
      (lin4): NetLinLayer(
        (model): Sequential(
          (0): Dropout(p=0.5, inplace=False)
          (1): Conv2d(512, 1, kernel_size=(1, 1), stride=(1, 1), bias=False)
        )
      )
    )
    (discriminator): NLayerDiscriminator(
      (main): Sequential(
        (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
        (1): LeakyReLU(negative_slope=0.2, inplace=True)
        (2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
        (3): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (4): LeakyReLU(negative_slope=0.2, inplace=True)
        (5): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
        (6): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (7): LeakyReLU(negative_slope=0.2, inplace=True)
        (8): Conv2d(256, 512, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1), bias=False)
        (9): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (10): LeakyReLU(negative_slope=0.2, inplace=True)
        (11): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))
      )
    )
  )
  (quantize): VectorQuantizer2(
    (embedding): Embedding(1024, 256)
  )
  (quant_conv): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
  (post_quant_conv): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
)

  

vqgan_imagenet_f16_16384

参数量：91,453,380

VQModel(
  (encoder): Encoder(
    (conv_in): Conv2d(3, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (down): ModuleList(
      (0-1): 2 x Module(
        (block): ModuleList(
          (0-1): 2 x ResnetBlock(
            (norm1): GroupNorm(32, 128, eps=1e-06, affine=True)
            (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (norm2): GroupNorm(32, 128, eps=1e-06, affine=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
        )
        (attn): ModuleList()
        (downsample): Downsample(
          (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2))
        )
      )
      (2): Module(
        (block): ModuleList(
          (0): ResnetBlock(
            (norm1): GroupNorm(32, 128, eps=1e-06, affine=True)
            (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (norm2): GroupNorm(32, 256, eps=1e-06, affine=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (nin_shortcut): Conv2d(128, 256, kernel_size=(1, 1), stride=(1, 1))
          )
          (1): ResnetBlock(
            (norm1): GroupNorm(32, 256, eps=1e-06, affine=True)
            (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (norm2): GroupNorm(32, 256, eps=1e-06, affine=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
        )
        (attn): ModuleList()
        (downsample): Downsample(
          (conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2))
        )
      )
      (3): Module(
        (block): ModuleList(
          (0-1): 2 x ResnetBlock(
            (norm1): GroupNorm(32, 256, eps=1e-06, affine=True)
            (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (norm2): GroupNorm(32, 256, eps=1e-06, affine=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
        )
        (attn): ModuleList()
        (downsample): Downsample(
          (conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2))
        )
      )
      (4): Module(
        (block): ModuleList(
          (0): ResnetBlock(
            (norm1): GroupNorm(32, 256, eps=1e-06, affine=True)
            (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (norm2): GroupNorm(32, 512, eps=1e-06, affine=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (nin_shortcut): Conv2d(256, 512, kernel_size=(1, 1), stride=(1, 1))
          )
          (1): ResnetBlock(
            (norm1): GroupNorm(32, 512, eps=1e-06, affine=True)
            (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (norm2): GroupNorm(32, 512, eps=1e-06, affine=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
        )
        (attn): ModuleList(
          (0-1): 2 x AttnBlock(
            (norm): GroupNorm(32, 512, eps=1e-06, affine=True)
            (q): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
            (k): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
            (v): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
            (proj_out): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
          )
        )
      )
    )
    (mid): Module(
      (block_1): ResnetBlock(
        (norm1): GroupNorm(32, 512, eps=1e-06, affine=True)
        (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (norm2): GroupNorm(32, 512, eps=1e-06, affine=True)
        (dropout): Dropout(p=0.0, inplace=False)
        (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      )
      (attn_1): AttnBlock(
        (norm): GroupNorm(32, 512, eps=1e-06, affine=True)
        (q): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
        (k): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
        (v): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
        (proj_out): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
      )
      (block_2): ResnetBlock(
        (norm1): GroupNorm(32, 512, eps=1e-06, affine=True)
        (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (norm2): GroupNorm(32, 512, eps=1e-06, affine=True)
        (dropout): Dropout(p=0.0, inplace=False)
        (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      )
    )
    (norm_out): GroupNorm(32, 512, eps=1e-06, affine=True)
    (conv_out): Conv2d(512, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  )
  (decoder): Decoder(
    (conv_in): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (mid): Module(
      (block_1): ResnetBlock(
        (norm1): GroupNorm(32, 512, eps=1e-06, affine=True)
        (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (norm2): GroupNorm(32, 512, eps=1e-06, affine=True)
        (dropout): Dropout(p=0.0, inplace=False)
        (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      )
      (attn_1): AttnBlock(
        (norm): GroupNorm(32, 512, eps=1e-06, affine=True)
        (q): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
        (k): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
        (v): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
        (proj_out): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
      )
      (block_2): ResnetBlock(
        (norm1): GroupNorm(32, 512, eps=1e-06, affine=True)
        (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (norm2): GroupNorm(32, 512, eps=1e-06, affine=True)
        (dropout): Dropout(p=0.0, inplace=False)
        (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      )
    )
    (up): ModuleList(
      (0): Module(
        (block): ModuleList(
          (0-2): 3 x ResnetBlock(
            (norm1): GroupNorm(32, 128, eps=1e-06, affine=True)
            (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (norm2): GroupNorm(32, 128, eps=1e-06, affine=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
        )
        (attn): ModuleList()
      )
      (1): Module(
        (block): ModuleList(
          (0): ResnetBlock(
            (norm1): GroupNorm(32, 256, eps=1e-06, affine=True)
            (conv1): Conv2d(256, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (norm2): GroupNorm(32, 128, eps=1e-06, affine=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (nin_shortcut): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1))
          )
          (1-2): 2 x ResnetBlock(
            (norm1): GroupNorm(32, 128, eps=1e-06, affine=True)
            (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (norm2): GroupNorm(32, 128, eps=1e-06, affine=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
        )
        (attn): ModuleList()
        (upsample): Upsample(
          (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        )
      )
      (2): Module(
        (block): ModuleList(
          (0-2): 3 x ResnetBlock(
            (norm1): GroupNorm(32, 256, eps=1e-06, affine=True)
            (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (norm2): GroupNorm(32, 256, eps=1e-06, affine=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
        )
        (attn): ModuleList()
        (upsample): Upsample(
          (conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        )
      )
      (3): Module(
        (block): ModuleList(
          (0): ResnetBlock(
            (norm1): GroupNorm(32, 512, eps=1e-06, affine=True)
            (conv1): Conv2d(512, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (norm2): GroupNorm(32, 256, eps=1e-06, affine=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (nin_shortcut): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
          )
          (1-2): 2 x ResnetBlock(
            (norm1): GroupNorm(32, 256, eps=1e-06, affine=True)
            (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (norm2): GroupNorm(32, 256, eps=1e-06, affine=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
        )
        (attn): ModuleList()
        (upsample): Upsample(
          (conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        )
      )
      (4): Module(
        (block): ModuleList(
          (0-2): 3 x ResnetBlock(
            (norm1): GroupNorm(32, 512, eps=1e-06, affine=True)
            (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (norm2): GroupNorm(32, 512, eps=1e-06, affine=True)
            (dropout): Dropout(p=0.0, inplace=False)
            (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
        )
        (attn): ModuleList(
          (0-2): 3 x AttnBlock(
            (norm): GroupNorm(32, 512, eps=1e-06, affine=True)
            (q): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
            (k): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
            (v): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
            (proj_out): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
          )
        )
        (upsample): Upsample(
          (conv): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        )
      )
    )
    (norm_out): GroupNorm(32, 128, eps=1e-06, affine=True)
    (conv_out): Conv2d(128, 3, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  )
  (loss): VQLPIPSWithDiscriminator(
    (perceptual_loss): LPIPS(
      (scaling_layer): ScalingLayer()
      (net): vgg16(
        (slice1): Sequential(
          (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): ReLU(inplace=True)
          (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (3): ReLU(inplace=True)
        )
        (slice2): Sequential(
          (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
          (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (6): ReLU(inplace=True)
          (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (8): ReLU(inplace=True)
        )
        (slice3): Sequential(
          (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
          (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (11): ReLU(inplace=True)
          (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (13): ReLU(inplace=True)
          (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (15): ReLU(inplace=True)
        )
        (slice4): Sequential(
          (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
          (17): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (18): ReLU(inplace=True)
          (19): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (20): ReLU(inplace=True)
          (21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (22): ReLU(inplace=True)
        )
        (slice5): Sequential(
          (23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
          (24): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (25): ReLU(inplace=True)
          (26): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (27): ReLU(inplace=True)
          (28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (29): ReLU(inplace=True)
        )
      )
      (lin0): NetLinLayer(
        (model): Sequential(
          (0): Dropout(p=0.5, inplace=False)
          (1): Conv2d(64, 1, kernel_size=(1, 1), stride=(1, 1), bias=False)
        )
      )
      (lin1): NetLinLayer(
        (model): Sequential(
          (0): Dropout(p=0.5, inplace=False)
          (1): Conv2d(128, 1, kernel_size=(1, 1), stride=(1, 1), bias=False)
        )
      )
      (lin2): NetLinLayer(
        (model): Sequential(
          (0): Dropout(p=0.5, inplace=False)
          (1): Conv2d(256, 1, kernel_size=(1, 1), stride=(1, 1), bias=False)
        )
      )
      (lin3): NetLinLayer(
        (model): Sequential(
          (0): Dropout(p=0.5, inplace=False)
          (1): Conv2d(512, 1, kernel_size=(1, 1), stride=(1, 1), bias=False)
        )
      )
      (lin4): NetLinLayer(
        (model): Sequential(
          (0): Dropout(p=0.5, inplace=False)
          (1): Conv2d(512, 1, kernel_size=(1, 1), stride=(1, 1), bias=False)
        )
      )
    )
    (discriminator): NLayerDiscriminator(
      (main): Sequential(
        (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
        (1): LeakyReLU(negative_slope=0.2, inplace=True)
        (2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
        (3): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (4): LeakyReLU(negative_slope=0.2, inplace=True)
        (5): Conv2d(128, 256, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1), bias=False)
        (6): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (7): LeakyReLU(negative_slope=0.2, inplace=True)
        (8): Conv2d(256, 1, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))
      )
    )
  )
  (quantize): VectorQuantizer2(
    (embedding): Embedding(16384, 256)
  )
  (quant_conv): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
  (post_quant_conv): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
)
  

dalle_encoder

参数量：53,786,240

Encoder(
  (blocks): Sequential(
    (input): Conv2d(n_in=3, n_out=256, kw=7, use_float16=True, device=device(type='cpu'), requires_grad=False)
    (group_1): Sequential(
      (block_1): EncoderBlock(
        (id_path): Identity()
        (res_path): Sequential(
          (relu_1): ReLU()
          (conv_1): Conv2d(n_in=256, n_out=64, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_2): ReLU()
          (conv_2): Conv2d(n_in=64, n_out=64, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_3): ReLU()
          (conv_3): Conv2d(n_in=64, n_out=64, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_4): ReLU()
          (conv_4): Conv2d(n_in=64, n_out=256, kw=1, use_float16=True, device=device(type='cpu'), requires_grad=False)
        )
      )
      (block_2): EncoderBlock(
        (id_path): Identity()
        (res_path): Sequential(
          (relu_1): ReLU()
          (conv_1): Conv2d(n_in=256, n_out=64, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_2): ReLU()
          (conv_2): Conv2d(n_in=64, n_out=64, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_3): ReLU()
          (conv_3): Conv2d(n_in=64, n_out=64, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_4): ReLU()
          (conv_4): Conv2d(n_in=64, n_out=256, kw=1, use_float16=True, device=device(type='cpu'), requires_grad=False)
        )
      )
      (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    )
    (group_2): Sequential(
      (block_1): EncoderBlock(
        (id_path): Conv2d(n_in=256, n_out=512, kw=1, use_float16=True, device=device(type='cpu'), requires_grad=False)
        (res_path): Sequential(
          (relu_1): ReLU()
          (conv_1): Conv2d(n_in=256, n_out=128, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_2): ReLU()
          (conv_2): Conv2d(n_in=128, n_out=128, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_3): ReLU()
          (conv_3): Conv2d(n_in=128, n_out=128, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_4): ReLU()
          (conv_4): Conv2d(n_in=128, n_out=512, kw=1, use_float16=True, device=device(type='cpu'), requires_grad=False)
        )
      )
      (block_2): EncoderBlock(
        (id_path): Identity()
        (res_path): Sequential(
          (relu_1): ReLU()
          (conv_1): Conv2d(n_in=512, n_out=128, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_2): ReLU()
          (conv_2): Conv2d(n_in=128, n_out=128, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_3): ReLU()
          (conv_3): Conv2d(n_in=128, n_out=128, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_4): ReLU()
          (conv_4): Conv2d(n_in=128, n_out=512, kw=1, use_float16=True, device=device(type='cpu'), requires_grad=False)
        )
      )
      (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    )
    (group_3): Sequential(
      (block_1): EncoderBlock(
        (id_path): Conv2d(n_in=512, n_out=1024, kw=1, use_float16=True, device=device(type='cpu'), requires_grad=False)
        (res_path): Sequential(
          (relu_1): ReLU()
          (conv_1): Conv2d(n_in=512, n_out=256, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_2): ReLU()
          (conv_2): Conv2d(n_in=256, n_out=256, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_3): ReLU()
          (conv_3): Conv2d(n_in=256, n_out=256, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_4): ReLU()
          (conv_4): Conv2d(n_in=256, n_out=1024, kw=1, use_float16=True, device=device(type='cpu'), requires_grad=False)
        )
      )
      (block_2): EncoderBlock(
        (id_path): Identity()
        (res_path): Sequential(
          (relu_1): ReLU()
          (conv_1): Conv2d(n_in=1024, n_out=256, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_2): ReLU()
          (conv_2): Conv2d(n_in=256, n_out=256, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_3): ReLU()
          (conv_3): Conv2d(n_in=256, n_out=256, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_4): ReLU()
          (conv_4): Conv2d(n_in=256, n_out=1024, kw=1, use_float16=True, device=device(type='cpu'), requires_grad=False)
        )
      )
      (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    )
    (group_4): Sequential(
      (block_1): EncoderBlock(
        (id_path): Conv2d(n_in=1024, n_out=2048, kw=1, use_float16=True, device=device(type='cpu'), requires_grad=False)
        (res_path): Sequential(
          (relu_1): ReLU()
          (conv_1): Conv2d(n_in=1024, n_out=512, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_2): ReLU()
          (conv_2): Conv2d(n_in=512, n_out=512, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_3): ReLU()
          (conv_3): Conv2d(n_in=512, n_out=512, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_4): ReLU()
          (conv_4): Conv2d(n_in=512, n_out=2048, kw=1, use_float16=True, device=device(type='cpu'), requires_grad=False)
        )
      )
      (block_2): EncoderBlock(
        (id_path): Identity()
        (res_path): Sequential(
          (relu_1): ReLU()
          (conv_1): Conv2d(n_in=2048, n_out=512, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_2): ReLU()
          (conv_2): Conv2d(n_in=512, n_out=512, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_3): ReLU()
          (conv_3): Conv2d(n_in=512, n_out=512, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_4): ReLU()
          (conv_4): Conv2d(n_in=512, n_out=2048, kw=1, use_float16=True, device=device(type='cpu'), requires_grad=False)
        )
      )
    )
    (output): Sequential(
      (relu): ReLU()
      (conv): Conv2d(n_in=2048, n_out=8192, kw=1, use_float16=False, device=device(type='cpu'), requires_grad=False)
    )
  )
)
  

decoder_dalle

参数量:43,829,766

Decoder(
  (blocks): Sequential(
    (input): Conv2d(n_in=8192, n_out=128, kw=1, use_float16=False, device=device(type='cpu'), requires_grad=False)
    (group_1): Sequential(
      (block_1): DecoderBlock(
        (id_path): Conv2d(n_in=128, n_out=2048, kw=1, use_float16=True, device=device(type='cpu'), requires_grad=False)
        (res_path): Sequential(
          (relu_1): ReLU()
          (conv_1): Conv2d(n_in=128, n_out=512, kw=1, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_2): ReLU()
          (conv_2): Conv2d(n_in=512, n_out=512, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_3): ReLU()
          (conv_3): Conv2d(n_in=512, n_out=512, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_4): ReLU()
          (conv_4): Conv2d(n_in=512, n_out=2048, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
        )
      )
      (block_2): DecoderBlock(
        (id_path): Identity()
        (res_path): Sequential(
          (relu_1): ReLU()
          (conv_1): Conv2d(n_in=2048, n_out=512, kw=1, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_2): ReLU()
          (conv_2): Conv2d(n_in=512, n_out=512, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_3): ReLU()
          (conv_3): Conv2d(n_in=512, n_out=512, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_4): ReLU()
          (conv_4): Conv2d(n_in=512, n_out=2048, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
        )
      )
      (upsample): Upsample(scale_factor=2.0, mode='nearest')
    )
    (group_2): Sequential(
      (block_1): DecoderBlock(
        (id_path): Conv2d(n_in=2048, n_out=1024, kw=1, use_float16=True, device=device(type='cpu'), requires_grad=False)
        (res_path): Sequential(
          (relu_1): ReLU()
          (conv_1): Conv2d(n_in=2048, n_out=256, kw=1, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_2): ReLU()
          (conv_2): Conv2d(n_in=256, n_out=256, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_3): ReLU()
          (conv_3): Conv2d(n_in=256, n_out=256, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_4): ReLU()
          (conv_4): Conv2d(n_in=256, n_out=1024, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
        )
      )
      (block_2): DecoderBlock(
        (id_path): Identity()
        (res_path): Sequential(
          (relu_1): ReLU()
          (conv_1): Conv2d(n_in=1024, n_out=256, kw=1, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_2): ReLU()
          (conv_2): Conv2d(n_in=256, n_out=256, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_3): ReLU()
          (conv_3): Conv2d(n_in=256, n_out=256, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_4): ReLU()
          (conv_4): Conv2d(n_in=256, n_out=1024, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
        )
      )
      (upsample): Upsample(scale_factor=2.0, mode='nearest')
    )
    (group_3): Sequential(
      (block_1): DecoderBlock(
        (id_path): Conv2d(n_in=1024, n_out=512, kw=1, use_float16=True, device=device(type='cpu'), requires_grad=False)
        (res_path): Sequential(
          (relu_1): ReLU()
          (conv_1): Conv2d(n_in=1024, n_out=128, kw=1, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_2): ReLU()
          (conv_2): Conv2d(n_in=128, n_out=128, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_3): ReLU()
          (conv_3): Conv2d(n_in=128, n_out=128, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_4): ReLU()
          (conv_4): Conv2d(n_in=128, n_out=512, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
        )
      )
      (block_2): DecoderBlock(
        (id_path): Identity()
        (res_path): Sequential(
          (relu_1): ReLU()
          (conv_1): Conv2d(n_in=512, n_out=128, kw=1, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_2): ReLU()
          (conv_2): Conv2d(n_in=128, n_out=128, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_3): ReLU()
          (conv_3): Conv2d(n_in=128, n_out=128, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_4): ReLU()
          (conv_4): Conv2d(n_in=128, n_out=512, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
        )
      )
      (upsample): Upsample(scale_factor=2.0, mode='nearest')
    )
    (group_4): Sequential(
      (block_1): DecoderBlock(
        (id_path): Conv2d(n_in=512, n_out=256, kw=1, use_float16=True, device=device(type='cpu'), requires_grad=False)
        (res_path): Sequential(
          (relu_1): ReLU()
          (conv_1): Conv2d(n_in=512, n_out=64, kw=1, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_2): ReLU()
          (conv_2): Conv2d(n_in=64, n_out=64, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_3): ReLU()
          (conv_3): Conv2d(n_in=64, n_out=64, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_4): ReLU()
          (conv_4): Conv2d(n_in=64, n_out=256, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
        )
      )
      (block_2): DecoderBlock(
        (id_path): Identity()
        (res_path): Sequential(
          (relu_1): ReLU()
          (conv_1): Conv2d(n_in=256, n_out=64, kw=1, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_2): ReLU()
          (conv_2): Conv2d(n_in=64, n_out=64, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_3): ReLU()
          (conv_3): Conv2d(n_in=64, n_out=64, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
          (relu_4): ReLU()
          (conv_4): Conv2d(n_in=64, n_out=256, kw=3, use_float16=True, device=device(type='cpu'), requires_grad=False)
        )
      )
    )
    (output): Sequential(
      (relu): ReLU()
      (conv): Conv2d(n_in=256, n_out=6, kw=1, use_float16=True, device=device(type='cpu'), requires_grad=False)
    )
  )
)
  

torchview

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto
from torchview import draw_graph
import torch


from calflops import calculate_flops
from transformers import AutoModel
from transformers import AutoTokenizer

batch_size, max_seq_length = 1, 128

path="/mnt/bn/znzx-public/models/Qwen2-1.5B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    path,
    torch_dtype="auto",
    device_map="auto"
)


tokenizer = AutoTokenizer.from_pretrained(path)


flops, macs, params = calculate_flops(model=model,
                                      input_shape=(batch_size,max_seq_length),
                                      transformer_tokenizer=tokenizer)
print("FLOPs:%s   MACs:%s   Params:%s \n" %(flops, macs, params))

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

# print(model.model.layers)
# help(model.model.layers)

model.model.layers = torch.nn.ModuleList(model.model.layers[0:2]) # 只保留两层，防止输出太长



model_graph = draw_graph(model, input_data=model_inputs.input_ids, device=device, save_graph=True)



generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)

#out = model(model_inputs.input_ids)
#make_dot(out)


#model_graph.visual_graph

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

  

calcflops

sum([param.nelement() for param in model.parameters()])
  

model = model1024.encoder
print([ name for name, item in model1024.named_children()])
num_params = sum([param.nelement() for param in model.parameters()])
print(f"参数量:{num_params}")
#print(model)
url = "/content/drive/MyDrive/images/IMG_0567.PNG"
#x_dalle = preprocess(PIL.Image.open(url)
x_vqgan = preprocess(PIL.Image.open(url), target_image_size=1024,
map_dalle=False)
#x_dalle = x_dalle.to(DEVICE)
x_vqgan = x_vqgan.to(DEVICE)
print(x_vqgan.shape)
from thop import profile,clever_format
flops,params = profile(model, inputs=(x_vqgan,), verbose=True)
flops,params = clever_format([flops, params], "%.3f")
print("flops:", flops, "params:", params)


from calflops import calculate_flops
flops, macs, params = calculate_flops(model=model, 
                                      input_shape=(1, 3, 1024,1024),
                                      output_as_string=True,
                                      output_precision=4)
print("FLOPs:%s   MACs:%s   Params:%s \n" %(flops, macs, params))
  

model architecture and flops

code to print architecture and flops

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(0)

from calflops import calculate_flops
from transformers import AutoModel
from transformers import AutoTokenizer

batch_size, max_seq_length = 1, 128
#model_name = ""
#model_save = "../pretrain_models/" + model_name
path = 'openbmb/MiniCPM-2B-dpo-bf16'
model_save=path
#model = AutoModel.from_pretrained(model_save)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_save)

flops, macs, params = calculate_flops(model=model,
                                      input_shape=(batch_size,max_seq_length),
                                      transformer_tokenizer=tokenizer)
print("Bert(hfl/chinese-roberta-wwm-ext) FLOPs:%s   MACs:%s   Params:%s \n" %(flops, macs, params))
  

output examples(minicpm)

 python3 minicpm.py
/root/anaconda3/envs/minicpm/lib/python3.11/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/root/anaconda3/envs/minicpm/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2654: FutureWarning: The `truncation_strategy` argument is deprecated and will be removed in a future version, use `truncation=True` to truncate examples to a max length. You can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to truncate to the maximal input size of the model (e.g. 512 for Bert).  If you have pairs of inputs, you can give a specific truncation strategy selected among `truncation='only_first'` (will only truncate the first sentence in the pairs) `truncation='only_second'` (will only truncate the second sentence in the pairs) or `truncation='longest_first'` (will iteratively remove tokens from the longest sentence in the pairs).
  warnings.warn(
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

------------------------------------- Calculate Flops Results -------------------------------------
Notations:
number of parameters (Params), number of multiply-accumulate operations(MACs),
number of floating-point operations (FLOPs), floating-point operations per second (FLOPS),
fwd FLOPs (model forward propagation FLOPs), bwd FLOPs (model backward propagation FLOPs),
default model backpropagation takes 2.00 times as much computation as forward propagation.

Total Training Params:                                                  2.72 B
fwd MACs:                                                               348.76 GMACs
fwd FLOPs:                                                              697.55 GFLOPS
fwd+bwd MACs:                                                           1.05 TMACs
fwd+bwd FLOPs:                                                          2.09 TFLOPS

-------------------------------- Detailed Calculated FLOPs Results --------------------------------
Each module caculated is listed after its name in the following order:
params, percentage of total params, MACs, percentage of total MACs, FLOPS, percentage of total FLOPs

Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss).
 They are not counted as submodules in calflops and not to be printed out. However they make up the difference between a parent's MACs and the sum of its submodules'.
2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.

MiniCPMForCausalLM(
  2.72 B = 100% Params, 348.76 GMACs = 100% MACs, 697.55 GFLOPS = 100% FLOPs
  (model): MiniCPMModel(
    2.72 B = 100% Params, 312.56 GMACs = 89.62% MACs, 625.15 GFLOPS = 89.62% FLOPs
    (embed_tokens): Embedding(282.82 M = 10.38% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, 122753, 2304)
    (layers): ModuleList(
      (0-39): 40 x MiniCPMDecoderLayer(
        61.05 M = 2.24% Params, 7.81 GMACs = 2.24% MACs, 15.63 GFLOPS = 2.24% FLOPs
        (self_attn): MiniCPMFlashAttention2(
          21.23 M = 0.78% Params, 2.72 GMACs = 0.78% MACs, 5.44 GFLOPS = 0.78% FLOPs
          (q_proj): Linear(5.31 M = 0.19% Params, 679.48 MMACs = 0.19% MACs, 1.36 GFLOPS = 0.19% FLOPs, in_features=2304, out_features=2304, bias=False)
          (k_proj): Linear(5.31 M = 0.19% Params, 679.48 MMACs = 0.19% MACs, 1.36 GFLOPS = 0.19% FLOPs, in_features=2304, out_features=2304, bias=False)
          (v_proj): Linear(5.31 M = 0.19% Params, 679.48 MMACs = 0.19% MACs, 1.36 GFLOPS = 0.19% FLOPs, in_features=2304, out_features=2304, bias=False)
          (o_proj): Linear(5.31 M = 0.19% Params, 679.48 MMACs = 0.19% MACs, 1.36 GFLOPS = 0.19% FLOPs, in_features=2304, out_features=2304, bias=False)
          (rotary_emb): MiniCPMRotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): MiniCPMMLP(
          39.81 M = 1.46% Params, 5.1 GMACs = 1.46% MACs, 10.19 GFLOPS = 1.46% FLOPs
          (gate_proj): Linear(13.27 M = 0.49% Params, 1.7 GMACs = 0.49% MACs, 3.4 GFLOPS = 0.49% FLOPs, in_features=2304, out_features=5760, bias=False)
          (up_proj): Linear(13.27 M = 0.49% Params, 1.7 GMACs = 0.49% MACs, 3.4 GFLOPS = 0.49% FLOPs, in_features=2304, out_features=5760, bias=False)
          (down_proj): Linear(13.27 M = 0.49% Params, 1.7 GMACs = 0.49% MACs, 3.4 GFLOPS = 0.49% FLOPs, in_features=5760, out_features=2304, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 737.28 KFLOPS = 0% FLOPs)
        )
        (input_layernorm): MiniCPMRMSNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (post_attention_layernorm): MiniCPMRMSNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
    )
    (norm): MiniCPMRMSNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
  )
  (lm_head): Linear(282.82 M = 10.38% Params, 36.2 GMACs = 10.38% MACs, 72.4 GFLOPS = 10.38% FLOPs, in_features=2304, out_features=122753, bias=False)
)
---------------------------------------------------------------------------------------------------
Bert(hfl/chinese-roberta-wwm-ext) FLOPs:697.55 GFLOPS   MACs:348.76 GMACs   Params:2.72 B
  

figures explanation

统计的实际是计算128个token的flops，所以平均一个token是697/128=5.445 GFLOPS，即2.72GMACs，与参数个数基本一致。也就是说平均一个参数参与一个乘加的计算。

从图中也可以看出参数量分布：

| 名称 | 参数量 | 份数 | 总数 | | —————– | ——- | — | —— | | embedding | 282.82M | 1 | 0.282G | | transformer block | 61.05M | 40 | 2.442G | | lm_head | 282.82M | 1 | 0.282G | | 总数 | – | – | 2.724G | minicpm是embedding和lm head共享权重的。

如果我们把seq length设置成4096呢？

/root/anaconda3/envs/minicpm/lib/python3.11/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/root/anaconda3/envs/minicpm/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2654: FutureWarning: The `truncation_strategy` argument is deprecated and will be removed in a future version, use `truncation=True` to truncate examples to a max length. You can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to truncate to the maximal input size of the model (e.g. 512 for Bert).  If you have pairs of inputs, you can give a specific truncation strategy selected among `truncation='only_first'` (will only truncate the first sentence in the pairs) `truncation='only_second'` (will only truncate the second sentence in the pairs) or `truncation='longest_first'` (will iteratively remove tokens from the longest sentence in the pairs).
  warnings.warn(
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

------------------------------------- Calculate Flops Results -------------------------------------
Notations:
number of parameters (Params), number of multiply-accumulate operations(MACs),
number of floating-point operations (FLOPs), floating-point operations per second (FLOPS),
fwd FLOPs (model forward propagation FLOPs), bwd FLOPs (model backward propagation FLOPs),
default model backpropagation takes 2.00 times as much computation as forward propagation.

Total Training Params:                                                  2.72 B
fwd MACs:                                                               11.16 TMACs
fwd FLOPs:                                                              22.32 TFLOPS
fwd+bwd MACs:                                                           33.48 TMACs
fwd+bwd FLOPs:                                                          66.96 TFLOPS

-------------------------------- Detailed Calculated FLOPs Results --------------------------------
Each module caculated is listed after its name in the following order:
params, percentage of total params, MACs, percentage of total MACs, FLOPS, percentage of total FLOPs

Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss).
 They are not counted as submodules in calflops and not to be printed out. However they make up the difference between a parent's MACs and the sum of its submodules'.
2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.

MiniCPMForCausalLM(
  2.72 B = 100% Params, 11.16 TMACs = 100% MACs, 22.32 TFLOPS = 100% FLOPs
  (model): MiniCPMModel(
    2.72 B = 100% Params, 10 TMACs = 89.62% MACs, 20 TFLOPS = 89.62% FLOPs
    (embed_tokens): Embedding(282.82 M = 10.38% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, 122753, 2304)
    (layers): ModuleList(
      (0-39): 40 x MiniCPMDecoderLayer(
        61.05 M = 2.24% Params, 250.05 GMACs = 2.24% MACs, 500.12 GFLOPS = 2.24% FLOPs
        (self_attn): MiniCPMFlashAttention2(
          21.23 M = 0.78% Params, 86.97 GMACs = 0.78% MACs, 173.95 GFLOPS = 0.78% FLOPs
          (q_proj): Linear(5.31 M = 0.19% Params, 21.74 GMACs = 0.19% MACs, 43.49 GFLOPS = 0.19% FLOPs, in_features=2304, out_features=2304, bias=False)
          (k_proj): Linear(5.31 M = 0.19% Params, 21.74 GMACs = 0.19% MACs, 43.49 GFLOPS = 0.19% FLOPs, in_features=2304, out_features=2304, bias=False)
          (v_proj): Linear(5.31 M = 0.19% Params, 21.74 GMACs = 0.19% MACs, 43.49 GFLOPS = 0.19% FLOPs, in_features=2304, out_features=2304, bias=False)
          (o_proj): Linear(5.31 M = 0.19% Params, 21.74 GMACs = 0.19% MACs, 43.49 GFLOPS = 0.19% FLOPs, in_features=2304, out_features=2304, bias=False)
          (rotary_emb): MiniCPMRotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): MiniCPMMLP(
          39.81 M = 1.46% Params, 163.07 GMACs = 1.46% MACs, 326.17 GFLOPS = 1.46% FLOPs
          (gate_proj): Linear(13.27 M = 0.49% Params, 54.36 GMACs = 0.49% MACs, 108.72 GFLOPS = 0.49% FLOPs, in_features=2304, out_features=5760, bias=False)
          (up_proj): Linear(13.27 M = 0.49% Params, 54.36 GMACs = 0.49% MACs, 108.72 GFLOPS = 0.49% FLOPs, in_features=2304, out_features=5760, bias=False)
          (down_proj): Linear(13.27 M = 0.49% Params, 54.36 GMACs = 0.49% MACs, 108.72 GFLOPS = 0.49% FLOPs, in_features=5760, out_features=2304, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 23.59 MFLOPS = 0% FLOPs)
        )
        (input_layernorm): MiniCPMRMSNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (post_attention_layernorm): MiniCPMRMSNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
    )
    (norm): MiniCPMRMSNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
  )
  (lm_head): Linear(282.82 M = 10.38% Params, 1.16 TMACs = 10.38% MACs, 2.32 TFLOPS = 10.38% FLOPs, in_features=2304, out_features=122753, bias=False)
)
---------------------------------------------------------------------------------------------------
Bert(hfl/chinese-roberta-wwm-ext) FLOPs:22.32 TFLOPS   MACs:11.16 TMACs   Params:2.72 B
  

算出来也是2.72GMACs。

real models

phi-3-mini

Phi3ForCausalLM(
  3.82 B = 100% Params, 479.69 GMACs = 100% MACs, 959.42 GFLOPS = 100% FLOPs
  (model): Phi3Model(
    3.72 B = 97.42% Params, 467.08 GMACs = 97.37% MACs, 934.21 GFLOPS = 97.37% FLOPs
    (embed_tokens): Embedding(98.5 M = 2.58% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, 32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        113.25 M = 2.96% Params, 14.6 GMACs = 3.04% MACs, 29.19 GFLOPS = 3.04% FLOPs
        (self_attn): Phi3Attention(
          37.75 M = 0.99% Params, 4.93 GMACs = 1.03% MACs, 9.87 GFLOPS = 1.03% FLOPs
          (o_proj): Linear(9.44 M = 0.25% Params, 1.21 GMACs = 0.25% MACs, 2.42 GFLOPS = 0.25% FLOPs, in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(28.31 M = 0.74% Params, 3.62 GMACs = 0.76% MACs, 7.25 GFLOPS = 0.76% FLOPs, in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3MLP(
          75.5 M = 1.98% Params, 9.66 GMACs = 2.01% MACs, 19.33 GFLOPS = 2.01% FLOPs
          (gate_up_proj): Linear(50.33 M = 1.32% Params, 6.44 GMACs = 1.34% MACs, 12.88 GFLOPS = 1.34% FLOPs, in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(25.17 M = 0.66% Params, 3.22 GMACs = 0.67% MACs, 6.44 GFLOPS = 0.67% FLOPs, in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.05 MFLOPS = 0% FLOPs)
        )
        (input_layernorm): Phi3RMSNorm(3.07 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (resid_attn_dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm(3.07 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
    )
    (norm): Phi3RMSNorm(3.07 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
  )
  (lm_head): Linear(98.5 M = 2.58% Params, 12.61 GMACs = 2.63% MACs, 25.22 GFLOPS = 2.63% FLOPs, in_features=3072, out_features=32064, bias=False)
)
---------------------------------------------------------------------------------------------------
 FLOPs:959.42 GFLOPS   MACs:479.69 GMACs   Params:3.82 B 
  

phi-3.5-mini

架构没有变化

Total Training Params:                                                  3.82 B  
fwd MACs:                                                               479.69 GMACs
fwd FLOPs:                                                              959.42 GFLOPS
fwd+bwd MACs:                                                           1.44 TMACs
fwd+bwd FLOPs:                                                          2.88 TFLOPS

-------------------------------- Detailed Calculated FLOPs Results --------------------------------
Each module caculated is listed after its name in the following order: 
params, percentage of total params, MACs, percentage of total MACs, FLOPS, percentage of total FLOPs

Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). 
 They are not counted as submodules in calflops and not to be printed out. However they make up the difference between a parent's MACs and the sum of its submodules'.
2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.

Phi3ForCausalLM(
  3.82 B = 100% Params, 479.69 GMACs = 100% MACs, 959.42 GFLOPS = 100% FLOPs
  (model): Phi3Model(
    3.72 B = 97.42% Params, 467.08 GMACs = 97.37% MACs, 934.21 GFLOPS = 97.37% FLOPs
    (embed_tokens): Embedding(98.5 M = 2.58% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, 32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        113.25 M = 2.96% Params, 14.6 GMACs = 3.04% MACs, 29.19 GFLOPS = 3.04% FLOPs
        (self_attn): Phi3Attention(
          37.75 M = 0.99% Params, 4.93 GMACs = 1.03% MACs, 9.87 GFLOPS = 1.03% FLOPs
          (o_proj): Linear(9.44 M = 0.25% Params, 1.21 GMACs = 0.25% MACs, 2.42 GFLOPS = 0.25% FLOPs, in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(28.31 M = 0.74% Params, 3.62 GMACs = 0.76% MACs, 7.25 GFLOPS = 0.76% FLOPs, in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3LongRoPEScaledRotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3MLP(
          75.5 M = 1.98% Params, 9.66 GMACs = 2.01% MACs, 19.33 GFLOPS = 2.01% FLOPs
          (gate_up_proj): Linear(50.33 M = 1.32% Params, 6.44 GMACs = 1.34% MACs, 12.88 GFLOPS = 1.34% FLOPs, in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(25.17 M = 0.66% Params, 3.22 GMACs = 0.67% MACs, 6.44 GFLOPS = 0.67% FLOPs, in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.05 MFLOPS = 0% FLOPs)
        )
        (input_layernorm): Phi3RMSNorm(3.07 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (resid_attn_dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm(3.07 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
    )
    (norm): Phi3RMSNorm(3.07 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
  )
  (lm_head): Linear(98.5 M = 2.58% Params, 12.61 GMACs = 2.63% MACs, 25.22 GFLOPS = 2.63% FLOPs, in_features=3072, out_features=32064, bias=False)
)
---------------------------------------------------------------------------------------------------
 FLOPs:959.42 GFLOPS   MACs:479.69 GMACs   Params:3.82 B 
  

qwen2-1.5B

Total Training Params:                                                  1.54 B  
fwd MACs:                                                               197.58 GMACs
fwd FLOPs:                                                              395.19 GFLOPS
fwd+bwd MACs:                                                           592.73 GMACs
fwd+bwd FLOPs:                                                          1.19 TFLOPS

-------------------------------- Detailed Calculated FLOPs Results --------------------------------
Each module caculated is listed after its name in the following order: 
params, percentage of total params, MACs, percentage of total MACs, FLOPS, percentage of total FLOPs

Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). 
 They are not counted as submodules in calflops and not to be printed out. However they make up the difference between a parent's MACs and the sum of its submodules'.
2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.

Qwen2ForCausalLM(
  1.54 B = 100% Params, 197.58 GMACs = 100% MACs, 395.19 GFLOPS = 100% FLOPs
  (model): Qwen2Model(
    1.54 B = 100% Params, 167.71 GMACs = 84.88% MACs, 335.44 GFLOPS = 84.88% FLOPs
    (embed_tokens): Embedding(233.37 M = 15.12% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, 151936, 1536)
    (layers): ModuleList(
      (0-27): 28 x Qwen2DecoderLayer(
        46.8 M = 3.03% Params, 5.99 GMACs = 3.03% MACs, 11.98 GFLOPS = 3.03% FLOPs
        (self_attn): Qwen2SdpaAttention(
          5.51 M = 0.36% Params, 704.64 MMACs = 0.36% MACs, 1.41 GFLOPS = 0.36% FLOPs
          (q_proj): Linear(2.36 M = 0.15% Params, 301.99 MMACs = 0.15% MACs, 603.98 MFLOPS = 0.15% FLOPs, in_features=1536, out_features=1536, bias=True)
          (k_proj): Linear(393.47 K = 0.03% Params, 50.33 MMACs = 0.03% MACs, 100.66 MFLOPS = 0.03% FLOPs, in_features=1536, out_features=256, bias=True)
          (v_proj): Linear(393.47 K = 0.03% Params, 50.33 MMACs = 0.03% MACs, 100.66 MFLOPS = 0.03% FLOPs, in_features=1536, out_features=256, bias=True)
          (o_proj): Linear(2.36 M = 0.15% Params, 301.99 MMACs = 0.15% MACs, 603.98 MFLOPS = 0.15% FLOPs, in_features=1536, out_features=1536, bias=False)
          (rotary_emb): Qwen2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Qwen2MLP(
          41.29 M = 2.67% Params, 5.28 GMACs = 2.67% MACs, 10.57 GFLOPS = 2.67% FLOPs
          (gate_proj): Linear(13.76 M = 0.89% Params, 1.76 GMACs = 0.89% MACs, 3.52 GFLOPS = 0.89% FLOPs, in_features=1536, out_features=8960, bias=False)
          (up_proj): Linear(13.76 M = 0.89% Params, 1.76 GMACs = 0.89% MACs, 3.52 GFLOPS = 0.89% FLOPs, in_features=1536, out_features=8960, bias=False)
          (down_proj): Linear(13.76 M = 0.89% Params, 1.76 GMACs = 0.89% MACs, 3.52 GFLOPS = 0.89% FLOPs, in_features=8960, out_features=1536, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.15 MFLOPS = 0% FLOPs)
        )
        (input_layernorm): Qwen2RMSNorm(1.54 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1536,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm(1.54 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1536,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm(1.54 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1536,), eps=1e-06)
  )
  (lm_head): Linear(233.37 M = 15.12% Params, 29.87 GMACs = 15.12% MACs, 59.74 GFLOPS = 15.12% FLOPs, in_features=1536, out_features=151936, bias=False)
)
---------------------------------------------------------------------------------------------------
FLOPs:395.19 GFLOPS   MACs:197.58 GMACs   Params:1.54 B 
  

phi-3-small

  
------------------------------------- Calculate Flops Results -------------------------------------
Notations:
number of parameters (Params), number of multiply-accumulate operations(MACs),
number of floating-point operations (FLOPs), floating-point operations per second (FLOPS),
fwd FLOPs (model forward propagation FLOPs), bwd FLOPs (model backward propagation FLOPs),
default model backpropagation takes 2.00 times as much computation as forward propagation.

Total Training Params:                                                  7.39 B  
fwd MACs:                                                               945.97 GMACs
fwd FLOPs:                                                              1.89 TFLOPS
fwd+bwd MACs:                                                           2.84 TMACs
fwd+bwd FLOPs:                                                          5.68 TFLOPS

-------------------------------- Detailed Calculated FLOPs Results --------------------------------
Each module caculated is listed after its name in the following order: 
params, percentage of total params, MACs, percentage of total MACs, FLOPS, percentage of total FLOPs

Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). 
 They are not counted as submodules in calflops and not to be printed out. However they make up the difference between a parent's MACs and the sum of its submodules'.
2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.

Phi3SmallForCausalLM(
  7.39 B = 100% Params, 945.97 GMACs = 100% MACs, 1.89 TFLOPS = 100% FLOPs
  (model): Phi3SmallModel(
    7.39 B = 100% Params, 893.35 GMACs = 94.44% MACs, 1.79 TFLOPS = 94.44% FLOPs
    (embed_tokens): Embedding(411.04 M = 5.56% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, 100352, 4096)
    (embedding_dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
    (layers): ModuleList(
      (0): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (_blocksparse_layer): BlockSparseAttentionLayer(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (1): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (2): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (_blocksparse_layer): BlockSparseAttentionLayer(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (3): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (4): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (_blocksparse_layer): BlockSparseAttentionLayer(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (5): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (6): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (_blocksparse_layer): BlockSparseAttentionLayer(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (7): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (8): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (_blocksparse_layer): BlockSparseAttentionLayer(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (9): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (10): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (_blocksparse_layer): BlockSparseAttentionLayer(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (11): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (12): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (_blocksparse_layer): BlockSparseAttentionLayer(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (13): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (14): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (_blocksparse_layer): BlockSparseAttentionLayer(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (15): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (16): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (_blocksparse_layer): BlockSparseAttentionLayer(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (17): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (18): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (_blocksparse_layer): BlockSparseAttentionLayer(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (19): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (20): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (_blocksparse_layer): BlockSparseAttentionLayer(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (21): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (22): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (_blocksparse_layer): BlockSparseAttentionLayer(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (23): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (24): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (_blocksparse_layer): BlockSparseAttentionLayer(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (25): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (26): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (_blocksparse_layer): BlockSparseAttentionLayer(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (27): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (28): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (_blocksparse_layer): BlockSparseAttentionLayer(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (29): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (30): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (_blocksparse_layer): BlockSparseAttentionLayer(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
      (31): Phi3SmallDecoderLayer(
        218.16 M = 2.95% Params, 27.92 GMACs = 2.95% MACs, 55.84 GFLOPS = 2.95% FLOPs
        (self_attn): Phi3SmallSelfAttention(
          41.95 M = 0.57% Params, 5.37 GMACs = 0.57% MACs, 10.74 GFLOPS = 0.57% FLOPs
          (query_key_value): Linear(25.17 M = 0.34% Params, 3.22 GMACs = 0.34% MACs, 6.44 GFLOPS = 0.34% FLOPs, in_features=4096, out_features=6144, bias=True)
          (dense): Linear(16.78 M = 0.23% Params, 2.15 GMACs = 0.23% MACs, 4.29 GFLOPS = 0.23% FLOPs, in_features=4096, out_features=4096, bias=True)
          (rotary_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3SmallMLP(
          176.19 M = 2.38% Params, 22.55 GMACs = 2.38% MACs, 45.1 GFLOPS = 2.38% FLOPs
          (up_proj): Linear(117.47 M = 1.59% Params, 15.03 GMACs = 1.59% MACs, 30.06 GFLOPS = 1.59% FLOPs, in_features=4096, out_features=28672, bias=True)
          (down_proj): Linear(58.72 M = 0.79% Params, 7.52 GMACs = 0.79% MACs, 15.03 GFLOPS = 0.79% FLOPs, in_features=14336, out_features=4096, bias=True)
          (dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.1, inplace=False)
        )
        (input_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
      )
    )
    (final_layernorm): LayerNorm(8.19 K = 0% Params, 0 MACs = 0% MACs, 2.62 MFLOPS = 0% FLOPs, (4096,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(411.04 M = 5.56% Params, 52.61 GMACs = 5.56% MACs, 105.23 GFLOPS = 5.56% FLOPs, in_features=4096, out_features=100352, bias=False)
)
---------------------------------------------------------------------------------------------------
minicpm FLOPs:1.89 TFLOPS   MACs:945.97 GMACs   Params:7.39 B
  

phi-3-medium

------------------------------------- Calculate Flops Results -------------------------------------
Notations:
number of parameters (Params), number of multiply-accumulate operations(MACs),
number of floating-point operations (FLOPs), floating-point operations per second (FLOPS),
fwd FLOPs (model forward propagation FLOPs), bwd FLOPs (model backward propagation FLOPs),
default model backpropagation takes 2.00 times as much computation as forward propagation.

Total Training Params:                                                  13.96 B 
fwd MACs:                                                               1.77 TMACs
fwd FLOPs:                                                              3.55 TFLOPS
fwd+bwd MACs:                                                           5.32 TMACs
fwd+bwd FLOPs:                                                          10.64 TFLOPS

-------------------------------- Detailed Calculated FLOPs Results --------------------------------
Each module caculated is listed after its name in the following order: 
params, percentage of total params, MACs, percentage of total MACs, FLOPS, percentage of total FLOPs

Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). 
 They are not counted as submodules in calflops and not to be printed out. However they make up the difference between a parent's MACs and the sum of its submodules'.
2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.

Phi3ForCausalLM(
  13.96 B = 100% Params, 1.77 TMACs = 100% MACs, 3.55 TFLOPS = 100% FLOPs
  (model): Phi3Model(
    13.8 B = 98.82% Params, 1.75 TMACs = 98.81% MACs, 3.5 TFLOPS = 98.81% FLOPs
    (embed_tokens): Embedding(164.17 M = 1.18% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, 32064, 5120, padding_idx=32000)
    (embed_dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
    (layers): ModuleList(
      (0-39): 40 x Phi3DecoderLayer(
        340.8 M = 2.44% Params, 43.79 GMACs = 2.47% MACs, 87.58 GFLOPS = 2.47% FLOPs
        (self_attn): Phi3Attention(
          65.54 M = 0.47% Params, 8.56 GMACs = 0.48% MACs, 17.11 GFLOPS = 0.48% FLOPs
          (o_proj): Linear(26.21 M = 0.19% Params, 3.36 GMACs = 0.19% MACs, 6.71 GFLOPS = 0.19% FLOPs, in_features=5120, out_features=5120, bias=False)
          (qkv_proj): Linear(39.32 M = 0.28% Params, 5.03 GMACs = 0.28% MACs, 10.07 GFLOPS = 0.28% FLOPs, in_features=5120, out_features=7680, bias=False)
          (rotary_emb): Phi3RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Phi3MLP(
          275.25 M = 1.97% Params, 35.23 GMACs = 1.99% MACs, 70.47 GFLOPS = 1.99% FLOPs
          (gate_up_proj): Linear(183.5 M = 1.31% Params, 23.49 GMACs = 1.33% MACs, 46.98 GFLOPS = 1.33% FLOPs, in_features=5120, out_features=35840, bias=False)
          (down_proj): Linear(91.75 M = 0.66% Params, 11.74 GMACs = 0.66% MACs, 23.49 GFLOPS = 0.66% FLOPs, in_features=17920, out_features=5120, bias=False)
          (activation_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 2.29 MFLOPS = 0% FLOPs)
        )
        (input_layernorm): Phi3RMSNorm(5.12 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (resid_attn_dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm(5.12 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
    )
    (norm): Phi3RMSNorm(5.12 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
  )
  (lm_head): Linear(164.17 M = 1.18% Params, 21.01 GMACs = 1.19% MACs, 42.03 GFLOPS = 1.19% FLOPs, in_features=5120, out_features=32064, bias=False)
)
---------------------------------------------------------------------------------------------------
minicpm FLOPs:3.55 TFLOPS   MACs:1.77 TMACs   Params:13.96 B
  

gemma-2-27B

------------------------------------- Calculate Flops Results -------------------------------------
Notations:
number of parameters (Params), number of multiply-accumulate operations(MACs),
number of floating-point operations (FLOPs), floating-point operations per second (FLOPS),
fwd FLOPs (model forward propagation FLOPs), bwd FLOPs (model backward propagation FLOPs),
default model backpropagation takes 2.00 times as much computation as forward propagation.

Total Training Params:                                                  27.23 B 
fwd MACs:                                                               3.49 TMACs
fwd FLOPs:                                                              6.98 TFLOPS
fwd+bwd MACs:                                                           10.47 TMACs
fwd+bwd FLOPs:                                                          20.95 TFLOPS

-------------------------------- Detailed Calculated FLOPs Results --------------------------------
Each module caculated is listed after its name in the following order: 
params, percentage of total params, MACs, percentage of total MACs, FLOPS, percentage of total FLOPs

Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). 
 They are not counted as submodules in calflops and not to be printed out. However they make up the difference between a parent's MACs and the sum of its submodules'.
2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.

Gemma2ForCausalLM(
  27.23 B = 100% Params, 3.49 TMACs = 100% MACs, 6.98 TFLOPS = 100% FLOPs
  (model): Gemma2Model(
    27.23 B = 100% Params, 3.34 TMACs = 95.67% MACs, 6.68 TFLOPS = 95.68% FLOPs
    (embed_tokens): Embedding(1.18 B = 4.33% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, 256000, 4608, padding_idx=0)
    (layers): ModuleList(
      (0-45): 46 x Gemma2DecoderLayer(
        566.25 M = 2.08% Params, 72.61 GMACs = 2.08% MACs, 145.23 GFLOPS = 2.08% FLOPs
        (self_attn): Gemma2Attention(
          56.62 M = 0.21% Params, 7.38 GMACs = 0.21% MACs, 14.76 GFLOPS = 0.21% FLOPs
          (q_proj): Linear(18.87 M = 0.07% Params, 2.42 GMACs = 0.07% MACs, 4.83 GFLOPS = 0.07% FLOPs, in_features=4608, out_features=4096, bias=False)
          (k_proj): Linear(9.44 M = 0.03% Params, 1.21 GMACs = 0.03% MACs, 2.42 GFLOPS = 0.03% FLOPs, in_features=4608, out_features=2048, bias=False)
          (v_proj): Linear(9.44 M = 0.03% Params, 1.21 GMACs = 0.03% MACs, 2.42 GFLOPS = 0.03% FLOPs, in_features=4608, out_features=2048, bias=False)
          (o_proj): Linear(18.87 M = 0.07% Params, 2.42 GMACs = 0.07% MACs, 4.83 GFLOPS = 0.07% FLOPs, in_features=4096, out_features=4608, bias=False)
          (rotary_emb): Gemma2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Gemma2MLP(
          509.61 M = 1.87% Params, 65.23 GMACs = 1.87% MACs, 130.46 GFLOPS = 1.87% FLOPs
          (gate_proj): Linear(169.87 M = 0.62% Params, 21.74 GMACs = 0.62% MACs, 43.49 GFLOPS = 0.62% FLOPs, in_features=4608, out_features=36864, bias=False)
          (up_proj): Linear(169.87 M = 0.62% Params, 21.74 GMACs = 0.62% MACs, 43.49 GFLOPS = 0.62% FLOPs, in_features=4608, out_features=36864, bias=False)
          (down_proj): Linear(169.87 M = 0.62% Params, 21.74 GMACs = 0.62% MACs, 43.49 GFLOPS = 0.62% FLOPs, in_features=36864, out_features=4608, bias=False)
          (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 4.72 MFLOPS = 0% FLOPs)
        )
        (input_layernorm): Gemma2RMSNorm(4.61 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4608,), eps=1e-06)
        (post_attention_layernorm): Gemma2RMSNorm(4.61 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4608,), eps=1e-06)
        (pre_feedforward_layernorm): Gemma2RMSNorm(4.61 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4608,), eps=1e-06)
        (post_feedforward_layernorm): Gemma2RMSNorm(4.61 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4608,), eps=1e-06)
      )
    )
    (norm): Gemma2RMSNorm(4.61 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4608,), eps=1e-06)
  )
  (lm_head): Linear(1.18 B = 4.33% Params, 150.99 GMACs = 4.33% MACs, 301.99 GFLOPS = 4.32% FLOPs, in_features=4608, out_features=256000, bias=False)
)
---------------------------------------------------------------------------------------------------
gemma-2-27b-it FLOPs:6.98 TFLOPS   MACs:3.49 TMACs   Params:27.23 B
  

glm4-9B

  
------------------------------------- Calculate Flops Results -------------------------------------
Notations:
number of parameters (Params), number of multiply-accumulate operations(MACs),
number of floating-point operations (FLOPs), floating-point operations per second (FLOPS),
fwd FLOPs (model forward propagation FLOPs), bwd FLOPs (model backward propagation FLOPs),
default model backpropagation takes 2.00 times as much computation as forward propagation.

Total Training Params:                                                  9.4 B   
fwd MACs:                                                               1.12 TMACs
fwd FLOPs:                                                              2.25 TFLOPS
fwd+bwd MACs:                                                           3.37 TMACs
fwd+bwd FLOPs:                                                          6.74 TFLOPS

-------------------------------- Detailed Calculated FLOPs Results --------------------------------
Each module caculated is listed after its name in the following order: 
params, percentage of total params, MACs, percentage of total MACs, FLOPS, percentage of total FLOPs

Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). 
 They are not counted as submodules in calflops and not to be printed out. However they make up the difference between a parent's MACs and the sum of its submodules'.
2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.

ChatGLMForConditionalGeneration(
  9.4 B = 100% Params, 1.12 TMACs = 100% MACs, 2.25 TFLOPS = 100% FLOPs
  (transformer): ChatGLMModel(
    9.4 B = 100% Params, 1.12 TMACs = 100% MACs, 2.25 TFLOPS = 100% FLOPs
    (embedding): Embedding(
      620.76 M = 6.6% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
      (word_embeddings): Embedding(620.76 M = 6.6% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, 151552, 4096)
    )
    (rotary_pos_emb): RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
    (encoder): GLMTransformer(
      8.16 B = 86.79% Params, 1.04 TMACs = 92.93% MACs, 2.09 TFLOPS = 92.93% FLOPs
      (layers): ModuleList(
        (0-39): 40 x GLMBlock(
          203.96 M = 2.17% Params, 26.11 GMACs = 2.32% MACs, 52.21 GFLOPS = 2.32% FLOPs
          (input_layernorm): RMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (self_attention): SelfAttention(
            35.66 M = 0.38% Params, 4.56 GMACs = 0.41% MACs, 9.13 GFLOPS = 0.41% FLOPs
            (query_key_value): Linear(18.88 M = 0.2% Params, 2.42 GMACs = 0.22% MACs, 4.83 GFLOPS = 0.21% FLOPs, in_features=4096, out_features=4608, bias=True)
            (core_attention): SdpaAttention(
              0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (attention_dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
            )
            (dense): Linear(16.78 M = 0.18% Params, 2.15 GMACs = 0.19% MACs, 4.29 GFLOPS = 0.19% FLOPs, in_features=4096, out_features=4096, bias=False)
          )
          (post_attention_layernorm): RMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (mlp): MLP(
            168.3 M = 1.79% Params, 21.54 GMACs = 1.92% MACs, 43.09 GFLOPS = 1.92% FLOPs
            (dense_h_to_4h): Linear(112.2 M = 1.19% Params, 14.36 GMACs = 1.28% MACs, 28.72 GFLOPS = 1.28% FLOPs, in_features=4096, out_features=27392, bias=False)
            (dense_4h_to_h): Linear(56.1 M = 0.6% Params, 7.18 GMACs = 0.64% MACs, 14.36 GFLOPS = 0.64% FLOPs, in_features=13696, out_features=4096, bias=False)
          )
        )
      )
      (final_layernorm): RMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
    )
    (output_layer): Linear(620.76 M = 6.6% Params, 79.46 GMACs = 7.07% MACs, 158.91 GFLOPS = 7.07% FLOPs, in_features=4096, out_features=151552, bias=False)
  )
)
---------------------------------------------------------------------------------------------------
/mnt/bn/znzx-public/models/glm-4-9b-chat FLOPs:2.25 TFLOPS   MACs:1.12 TMACs   Params:9.4 B
  

llama 3.1-8B

------------------------------------- Calculate Flops Results -------------------------------------
Notations:
number of parameters (Params), number of multiply-accumulate operations(MACs),
number of floating-point operations (FLOPs), floating-point operations per second (FLOPS),
fwd FLOPs (model forward propagation FLOPs), bwd FLOPs (model backward propagation FLOPs),
default model backpropagation takes 2.00 times as much computation as forward propagation.

Total Training Params:                                                  8.03 B  
fwd MACs:                                                               960.6 GMACs
fwd FLOPs:                                                              1.92 TFLOPS
fwd+bwd MACs:                                                           2.88 TMACs
fwd+bwd FLOPs:                                                          5.76 TFLOPS

-------------------------------- Detailed Calculated FLOPs Results --------------------------------
Each module caculated is listed after its name in the following order: 
params, percentage of total params, MACs, percentage of total MACs, FLOPS, percentage of total FLOPs

Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). 
 They are not counted as submodules in calflops and not to be printed out. However they make up the difference between a parent's MACs and the sum of its submodules'.
2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.

LlamaForCausalLM(
  8.03 B = 100% Params, 960.6 GMACs = 100% MACs, 1.92 TFLOPS = 100% FLOPs
  (model): LlamaModel(
    7.5 B = 93.46% Params, 893.35 GMACs = 93% MACs, 1.79 TFLOPS = 93% FLOPs
    (embed_tokens): Embedding(525.34 M = 6.54% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, 128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        218.11 M = 2.72% Params, 27.92 GMACs = 2.91% MACs, 55.84 GFLOPS = 2.91% FLOPs
        (self_attn): LlamaSdpaAttention(
          41.94 M = 0.52% Params, 5.37 GMACs = 0.56% MACs, 10.74 GFLOPS = 0.56% FLOPs
          (q_proj): Linear(16.78 M = 0.21% Params, 2.15 GMACs = 0.22% MACs, 4.29 GFLOPS = 0.22% FLOPs, in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(4.19 M = 0.05% Params, 536.87 MMACs = 0.06% MACs, 1.07 GFLOPS = 0.06% FLOPs, in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(4.19 M = 0.05% Params, 536.87 MMACs = 0.06% MACs, 1.07 GFLOPS = 0.06% FLOPs, in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(16.78 M = 0.21% Params, 2.15 GMACs = 0.22% MACs, 4.29 GFLOPS = 0.22% FLOPs, in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): LlamaMLP(
          176.16 M = 2.19% Params, 22.55 GMACs = 2.35% MACs, 45.1 GFLOPS = 2.35% FLOPs
          (gate_proj): Linear(58.72 M = 0.73% Params, 7.52 GMACs = 0.78% MACs, 15.03 GFLOPS = 0.78% FLOPs, in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(58.72 M = 0.73% Params, 7.52 GMACs = 0.78% MACs, 15.03 GFLOPS = 0.78% FLOPs, in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(58.72 M = 0.73% Params, 7.52 GMACs = 0.78% MACs, 15.03 GFLOPS = 0.78% FLOPs, in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.84 MFLOPS = 0% FLOPs)
        )
        (input_layernorm): LlamaRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
  )
  (lm_head): Linear(525.34 M = 6.54% Params, 67.24 GMACs = 7% MACs, 134.49 GFLOPS = 7% FLOPs, in_features=4096, out_features=128256, bias=False)
)
---------------------------------------------------------------------------------------------------
Llama-3.1-8B-Ultra-Instruct FLOPs:1.92 TFLOPS   MACs:960.6 GMACs   Params:8.03 B
  

qwen2-7B

------------------------------------- Calculate Flops Results -------------------------------------
Notations:
number of parameters (Params), number of multiply-accumulate operations(MACs),
number of floating-point operations (FLOPs), floating-point operations per second (FLOPS),
fwd FLOPs (model forward propagation FLOPs), bwd FLOPs (model backward propagation FLOPs),
default model backpropagation takes 2.00 times as much computation as forward propagation.

Total Training Params:                                                  7.62 B  
fwd MACs:                                                               905 GMACs
fwd FLOPs:                                                              1.81 TFLOPS
fwd+bwd MACs:                                                           2.71 TMACs
fwd+bwd FLOPs:                                                          5.43 TFLOPS

-------------------------------- Detailed Calculated FLOPs Results --------------------------------
Each module caculated is listed after its name in the following order: 
params, percentage of total params, MACs, percentage of total MACs, FLOPS, percentage of total FLOPs

Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). 
 They are not counted as submodules in calflops and not to be printed out. However they make up the difference between a parent's MACs and the sum of its submodules'.
2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.

Qwen2ForCausalLM(
  7.62 B = 100% Params, 905 GMACs = 100% MACs, 1.81 TFLOPS = 100% FLOPs
  (model): Qwen2Model(
    7.07 B = 92.84% Params, 835.24 GMACs = 92.29% MACs, 1.67 TFLOPS = 92.29% FLOPs
    (embed_tokens): Embedding(545 M = 7.16% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, 152064, 3584)
    (layers): ModuleList(
      (0-27): 28 x Qwen2DecoderLayer(
        233.06 M = 3.06% Params, 29.83 GMACs = 3.3% MACs, 59.66 GFLOPS = 3.3% FLOPs
        (self_attn): Qwen2SdpaAttention(
          29.36 M = 0.39% Params, 3.76 GMACs = 0.42% MACs, 7.52 GFLOPS = 0.42% FLOPs
          (q_proj): Linear(12.85 M = 0.17% Params, 1.64 GMACs = 0.18% MACs, 3.29 GFLOPS = 0.18% FLOPs, in_features=3584, out_features=3584, bias=True)
          (k_proj): Linear(1.84 M = 0.02% Params, 234.88 MMACs = 0.03% MACs, 469.76 MFLOPS = 0.03% FLOPs, in_features=3584, out_features=512, bias=True)
          (v_proj): Linear(1.84 M = 0.02% Params, 234.88 MMACs = 0.03% MACs, 469.76 MFLOPS = 0.03% FLOPs, in_features=3584, out_features=512, bias=True)
          (o_proj): Linear(12.85 M = 0.17% Params, 1.64 GMACs = 0.18% MACs, 3.29 GFLOPS = 0.18% FLOPs, in_features=3584, out_features=3584, bias=False)
          (rotary_emb): Qwen2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Qwen2MLP(
          203.69 M = 2.67% Params, 26.07 GMACs = 2.88% MACs, 52.15 GFLOPS = 2.88% FLOPs
          (gate_proj): Linear(67.9 M = 0.89% Params, 8.69 GMACs = 0.96% MACs, 17.38 GFLOPS = 0.96% FLOPs, in_features=3584, out_features=18944, bias=False)
          (up_proj): Linear(67.9 M = 0.89% Params, 8.69 GMACs = 0.96% MACs, 17.38 GFLOPS = 0.96% FLOPs, in_features=3584, out_features=18944, bias=False)
          (down_proj): Linear(67.9 M = 0.89% Params, 8.69 GMACs = 0.96% MACs, 17.38 GFLOPS = 0.96% FLOPs, in_features=18944, out_features=3584, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 2.42 MFLOPS = 0% FLOPs)
        )
        (input_layernorm): Qwen2RMSNorm(3.58 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (3584,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm(3.58 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (3584,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm(3.58 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (3584,), eps=1e-06)
  )
  (lm_head): Linear(545 M = 7.16% Params, 69.76 GMACs = 7.71% MACs, 139.52 GFLOPS = 7.71% FLOPs, in_features=3584, out_features=152064, bias=False)
)
---------------------------------------------------------------------------------------------------
Qwen2-7B-Instruct FLOPs:1.81 TFLOPS   MACs:905 GMACs   Params:7.62 B
  

minicpm3-4B

------------------------------------- Calculate Flops Results -------------------------------------
Notations:
number of parameters (Params), number of multiply-accumulate operations(MACs),
number of floating-point operations (FLOPs), floating-point operations per second (FLOPS),
fwd FLOPs (model forward propagation FLOPs), bwd FLOPs (model backward propagation FLOPs),
default model backpropagation takes 2.00 times as much computation as forward propagation.

Total Training Params:                                                  4.07 B  
fwd MACs:                                                               521.41 GMACs
fwd FLOPs:                                                              1.04 TFLOPS
fwd+bwd MACs:                                                           1.56 TMACs
fwd+bwd FLOPs:                                                          3.13 TFLOPS

-------------------------------- Detailed Calculated FLOPs Results --------------------------------
Each module caculated is listed after its name in the following order: 
params, percentage of total params, MACs, percentage of total MACs, FLOPS, percentage of total FLOPs

Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). 
 They are not counted as submodules in calflops and not to be printed out. However they make up the difference between a parent's MACs and the sum of its submodules'.
2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.

MiniCPM3ForCausalLM(
  4.07 B = 100% Params, 521.41 GMACs = 100% MACs, 1.04 TFLOPS = 100% FLOPs
  (model): MiniCPM3Model(
    4.07 B = 100% Params, 497.34 GMACs = 95.38% MACs, 994.73 GFLOPS = 95.38% FLOPs
    (embed_tokens): Embedding(188.03 M = 4.62% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, 73448, 2560)
    (layers): ModuleList(
      (0-61): 62 x MiniCPMDecoderLayer(
        62.67 M = 1.54% Params, 8.02 GMACs = 1.54% MACs, 16.04 GFLOPS = 1.54% FLOPs
        (self_attn): MiniCPMFlashAttention2(
          13.52 M = 0.33% Params, 1.73 GMACs = 0.33% MACs, 3.46 GFLOPS = 0.33% FLOPs
          (q_a_proj): Linear(1.97 M = 0.05% Params, 251.66 MMACs = 0.05% MACs, 503.32 MFLOPS = 0.05% FLOPs, in_features=2560, out_features=768, bias=False)
          (q_a_layernorm): MiniCPMRMSNorm(768 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (q_b_proj): Linear(2.95 M = 0.07% Params, 377.49 MMACs = 0.07% MACs, 754.97 MFLOPS = 0.07% FLOPs, in_features=768, out_features=3840, bias=False)
          (kv_a_proj_with_mqa): Linear(737.28 K = 0.02% Params, 94.37 MMACs = 0.02% MACs, 188.74 MFLOPS = 0.02% FLOPs, in_features=2560, out_features=288, bias=False)
          (kv_a_layernorm): MiniCPMRMSNorm(256 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (kv_b_proj): Linear(1.31 M = 0.03% Params, 167.77 MMACs = 0.03% MACs, 335.54 MFLOPS = 0.03% FLOPs, in_features=256, out_features=5120, bias=False)
          (o_proj): Linear(6.55 M = 0.16% Params, 838.86 MMACs = 0.16% MACs, 1.68 GFLOPS = 0.16% FLOPs, in_features=2560, out_features=2560, bias=False)
          (rotary_emb): MiniCPMLongRoPE(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): MiniCPMMLP(
          49.15 M = 1.21% Params, 6.29 GMACs = 1.21% MACs, 12.58 GFLOPS = 1.21% FLOPs
          (gate_proj): Linear(16.38 M = 0.4% Params, 2.1 GMACs = 0.4% MACs, 4.19 GFLOPS = 0.4% FLOPs, in_features=2560, out_features=6400, bias=False)
          (up_proj): Linear(16.38 M = 0.4% Params, 2.1 GMACs = 0.4% MACs, 4.19 GFLOPS = 0.4% FLOPs, in_features=2560, out_features=6400, bias=False)
          (down_proj): Linear(16.38 M = 0.4% Params, 2.1 GMACs = 0.4% MACs, 4.19 GFLOPS = 0.4% FLOPs, in_features=6400, out_features=2560, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 819.2 KFLOPS = 0% FLOPs)
        )
        (input_layernorm): MiniCPMRMSNorm(2.56 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (post_attention_layernorm): MiniCPMRMSNorm(2.56 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
    )
    (norm): MiniCPMRMSNorm(2.56 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
  )
  (lm_head): Linear(188.03 M = 4.62% Params, 24.07 GMACs = 4.62% MACs, 48.13 GFLOPS = 4.62% FLOPs, in_features=2560, out_features=73448, bias=False)
)
---------------------------------------------------------------------------------------------------
MiniCPM3-4B FLOPs:1.04 TFLOPS   MACs:521.41 GMACs   Params:4.07 B
  

qwen2-0.5B

------------------------------------- Calculate Flops Results -------------------------------------
Notations:
number of parameters (Params), number of multiply-accumulate operations(MACs),
number of floating-point operations (FLOPs), floating-point operations per second (FLOPS),
fwd FLOPs (model forward propagation FLOPs), bwd FLOPs (model backward propagation FLOPs),
default model backpropagation takes 2.00 times as much computation as forward propagation.

Total Training Params:                                                  494.03 M
fwd MACs:                                                               63.23 GMACs
fwd FLOPs:                                                              126.47 GFLOPS
fwd+bwd MACs:                                                           189.68 GMACs
fwd+bwd FLOPs:                                                          379.41 GFLOPS

-------------------------------- Detailed Calculated FLOPs Results --------------------------------
Each module caculated is listed after its name in the following order: 
params, percentage of total params, MACs, percentage of total MACs, FLOPS, percentage of total FLOPs

Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). 
 They are not counted as submodules in calflops and not to be printed out. However they make up the difference between a parent's MACs and the sum of its submodules'.
2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.

Qwen2ForCausalLM(
  494.03 M = 100% Params, 63.23 GMACs = 100% MACs, 126.47 GFLOPS = 100% FLOPs
  (model): Qwen2Model(
    494.03 M = 100% Params, 45.8 GMACs = 72.44% MACs, 91.62 GFLOPS = 72.44% FLOPs
    (embed_tokens): Embedding(136.13 M = 27.56% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, 151936, 896)
    (layers): ModuleList(
      (0-23): 24 x Qwen2DecoderLayer(
        14.91 M = 3.02% Params, 1.91 GMACs = 3.02% MACs, 3.82 GFLOPS = 3.02% FLOPs
        (self_attn): Qwen2SdpaAttention(
          1.84 M = 0.37% Params, 234.88 MMACs = 0.37% MACs, 469.76 MFLOPS = 0.37% FLOPs
          (q_proj): Linear(803.71 K = 0.16% Params, 102.76 MMACs = 0.16% MACs, 205.52 MFLOPS = 0.16% FLOPs, in_features=896, out_features=896, bias=True)
          (k_proj): Linear(114.82 K = 0.02% Params, 14.68 MMACs = 0.02% MACs, 29.36 MFLOPS = 0.02% FLOPs, in_features=896, out_features=128, bias=True)
          (v_proj): Linear(114.82 K = 0.02% Params, 14.68 MMACs = 0.02% MACs, 29.36 MFLOPS = 0.02% FLOPs, in_features=896, out_features=128, bias=True)
          (o_proj): Linear(802.82 K = 0.16% Params, 102.76 MMACs = 0.16% MACs, 205.52 MFLOPS = 0.16% FLOPs, in_features=896, out_features=896, bias=False)
          (rotary_emb): Qwen2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Qwen2MLP(
          13.07 M = 2.65% Params, 1.67 GMACs = 2.65% MACs, 3.35 GFLOPS = 2.65% FLOPs
          (gate_proj): Linear(4.36 M = 0.88% Params, 557.84 MMACs = 0.88% MACs, 1.12 GFLOPS = 0.88% FLOPs, in_features=896, out_features=4864, bias=False)
          (up_proj): Linear(4.36 M = 0.88% Params, 557.84 MMACs = 0.88% MACs, 1.12 GFLOPS = 0.88% FLOPs, in_features=896, out_features=4864, bias=False)
          (down_proj): Linear(4.36 M = 0.88% Params, 557.84 MMACs = 0.88% MACs, 1.12 GFLOPS = 0.88% FLOPs, in_features=4864, out_features=896, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 622.59 KFLOPS = 0% FLOPs)
        )
        (input_layernorm): Qwen2RMSNorm(896 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (post_attention_layernorm): Qwen2RMSNorm(896 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
    )
    (norm): Qwen2RMSNorm(896 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
  )
  (lm_head): Linear(136.13 M = 27.56% Params, 17.43 GMACs = 27.56% MACs, 34.85 GFLOPS = 27.56% FLOPs, in_features=896, out_features=151936, bias=False)
)
---------------------------------------------------------------------------------------------------
Qwen2-0.5B-Instruct FLOPs:126.47 GFLOPS   MACs:63.23 GMACs   Params:494.03 M
  

Llama 3.2 1B

------------------------------------- Calculate Flops Results -------------------------------------
Notations:
number of parameters (Params), number of multiply-accumulate operations(MACs),
number of floating-point operations (FLOPs), floating-point operations per second (FLOPS),
fwd FLOPs (model forward propagation FLOPs), bwd FLOPs (model backward propagation FLOPs),
default model backpropagation takes 2.00 times as much computation as forward propagation.

Total Training Params:                                                  1.24 B  
fwd MACs:                                                               5.06 TMACs
fwd FLOPs:                                                              10.12 TFLOPS
fwd+bwd MACs:                                                           15.18 TMACs
fwd+bwd FLOPs:                                                          30.37 TFLOPS

-------------------------------- Detailed Calculated FLOPs Results --------------------------------
Each module caculated is listed after its name in the following order: 
params, percentage of total params, MACs, percentage of total MACs, FLOPS, percentage of total FLOPs

Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). 
 They are not counted as submodules in calflops and not to be printed out. However they make up the difference between a parent's MACs and the sum of its submodules'.
2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.

LlamaForCausalLM(
  1.24 B = 100% Params, 5.06 TMACs = 100% MACs, 10.12 TFLOPS = 100% FLOPs
  (model): LlamaModel(
    1.24 B = 100% Params, 3.99 TMACs = 78.74% MACs, 7.97 TFLOPS = 78.75% FLOPs
    (embed_tokens): Embedding(262.67 M = 21.25% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, 128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        60.82 M = 4.92% Params, 249.11 GMACs = 4.92% MACs, 498.25 GFLOPS = 4.92% FLOPs
        (self_attn): LlamaSdpaAttention(
          10.49 M = 0.85% Params, 42.95 GMACs = 0.85% MACs, 85.9 GFLOPS = 0.85% FLOPs
          (q_proj): Linear(4.19 M = 0.34% Params, 17.18 GMACs = 0.34% MACs, 34.36 GFLOPS = 0.34% FLOPs, in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(1.05 M = 0.08% Params, 4.29 GMACs = 0.08% MACs, 8.59 GFLOPS = 0.08% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0.08% Params, 4.29 GMACs = 0.08% MACs, 8.59 GFLOPS = 0.08% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(4.19 M = 0.34% Params, 17.18 GMACs = 0.34% MACs, 34.36 GFLOPS = 0.34% FLOPs, in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): LlamaMLP(
          50.33 M = 4.07% Params, 206.16 GMACs = 4.07% MACs, 412.35 GFLOPS = 4.07% FLOPs
          (gate_proj): Linear(16.78 M = 1.36% Params, 68.72 GMACs = 1.36% MACs, 137.44 GFLOPS = 1.36% FLOPs, in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(16.78 M = 1.36% Params, 68.72 GMACs = 1.36% MACs, 137.44 GFLOPS = 1.36% FLOPs, in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(16.78 M = 1.36% Params, 68.72 GMACs = 1.36% MACs, 137.44 GFLOPS = 1.36% FLOPs, in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 33.55 MFLOPS = 0% FLOPs)
        )
        (input_layernorm): LlamaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
  )
  (lm_head): Linear(262.67 M = 21.25% Params, 1.08 TMACs = 21.26% MACs, 2.15 TFLOPS = 21.25% FLOPs, in_features=2048, out_features=128256, bias=False)
)
---------------------------------------------------------------------------------------------------
Llama-3.2-1B-Instruct FLOPs:10.12 TFLOPS   MACs:5.06 TMACs   Params:1.24 B
  

llama. 3.2 3B

------------------------------------- Calculate Flops Results -------------------------------------
Notations:
number of parameters (Params), number of multiply-accumulate operations(MACs),
number of floating-point operations (FLOPs), floating-point operations per second (FLOPS),
fwd FLOPs (model forward propagation FLOPs), bwd FLOPs (model backward propagation FLOPs),
default model backpropagation takes 2.00 times as much computation as forward propagation.

Total Training Params:                                                  3.21 B  
fwd MACs:                                                               13.16 TMACs
fwd FLOPs:                                                              26.32 TFLOPS
fwd+bwd MACs:                                                           39.48 TMACs
fwd+bwd FLOPs:                                                          78.96 TFLOPS

-------------------------------- Detailed Calculated FLOPs Results --------------------------------
Each module caculated is listed after its name in the following order: 
params, percentage of total params, MACs, percentage of total MACs, FLOPS, percentage of total FLOPs

Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). 
 They are not counted as submodules in calflops and not to be printed out. However they make up the difference between a parent's MACs and the sum of its submodules'.
2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.

LlamaForCausalLM(
  3.21 B = 100% Params, 13.16 TMACs = 100% MACs, 26.32 TFLOPS = 100% FLOPs
  (model): LlamaModel(
    3.21 B = 100% Params, 11.54 TMACs = 87.74% MACs, 23.09 TFLOPS = 87.74% FLOPs
    (embed_tokens): Embedding(394 M = 12.26% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, 128256, 3072)
    (layers): ModuleList(
      (0-27): 28 x LlamaDecoderLayer(
        100.67 M = 3.13% Params, 412.32 GMACs = 3.13% MACs, 824.67 GFLOPS = 3.13% FLOPs
        (self_attn): LlamaSdpaAttention(
          25.17 M = 0.78% Params, 103.08 GMACs = 0.78% MACs, 206.16 GFLOPS = 0.78% FLOPs
          (q_proj): Linear(9.44 M = 0.29% Params, 38.65 GMACs = 0.29% MACs, 77.31 GFLOPS = 0.29% FLOPs, in_features=3072, out_features=3072, bias=False)
          (k_proj): Linear(3.15 M = 0.1% Params, 12.88 GMACs = 0.1% MACs, 25.77 GFLOPS = 0.1% FLOPs, in_features=3072, out_features=1024, bias=False)
          (v_proj): Linear(3.15 M = 0.1% Params, 12.88 GMACs = 0.1% MACs, 25.77 GFLOPS = 0.1% FLOPs, in_features=3072, out_features=1024, bias=False)
          (o_proj): Linear(9.44 M = 0.29% Params, 38.65 GMACs = 0.29% MACs, 77.31 GFLOPS = 0.29% FLOPs, in_features=3072, out_features=3072, bias=False)
          (rotary_emb): LlamaRotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): LlamaMLP(
          75.5 M = 2.35% Params, 309.24 GMACs = 2.35% MACs, 618.51 GFLOPS = 2.35% FLOPs
          (gate_proj): Linear(25.17 M = 0.78% Params, 103.08 GMACs = 0.78% MACs, 206.16 GFLOPS = 0.78% FLOPs, in_features=3072, out_features=8192, bias=False)
          (up_proj): Linear(25.17 M = 0.78% Params, 103.08 GMACs = 0.78% MACs, 206.16 GFLOPS = 0.78% FLOPs, in_features=3072, out_features=8192, bias=False)
          (down_proj): Linear(25.17 M = 0.78% Params, 103.08 GMACs = 0.78% MACs, 206.16 GFLOPS = 0.78% FLOPs, in_features=8192, out_features=3072, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 33.55 MFLOPS = 0% FLOPs)
        )
        (input_layernorm): LlamaRMSNorm(3.07 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (3072,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm(3.07 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (3072,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm(3.07 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (3072,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
  )
  (lm_head): Linear(394 M = 12.26% Params, 1.61 TMACs = 12.26% MACs, 3.23 TFLOPS = 12.26% FLOPs, in_features=3072, out_features=128256, bias=False)
)
---------------------------------------------------------------------------------------------------
Llama-3.2-3B-Instruct FLOPs:26.32 TFLOPS   MACs:13.16 TMACs   Params:3.21 B
  

Llama-3.2-11B-Vision-Instruct

------------------------------------- Calculate Flops Results -------------------------------------
Notations:
number of parameters (Params), number of multiply-accumulate operations(MACs),
number of floating-point operations (FLOPs), floating-point operations per second (FLOPS),
fwd FLOPs (model forward propagation FLOPs), bwd FLOPs (model backward propagation FLOPs),
default model backpropagation takes 2.00 times as much computation as forward propagation.

Total Training Params:                                                  9.78 B  
fwd MACs:                                                               30.74 TMACs
fwd FLOPs:                                                              61.48 TFLOPS
fwd+bwd MACs:                                                           92.22 TMACs
fwd+bwd FLOPs:                                                          184.44 TFLOPS

-------------------------------- Detailed Calculated FLOPs Results --------------------------------
Each module caculated is listed after its name in the following order: 
params, percentage of total params, MACs, percentage of total MACs, FLOPS, percentage of total FLOPs

Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). 
 They are not counted as submodules in calflops and not to be printed out. However they make up the difference between a parent's MACs and the sum of its submodules'.
2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.

MllamaForCausalLM(
  9.78 B = 100% Params, 30.74 TMACs = 100% MACs, 61.48 TFLOPS = 100% FLOPs
  (model): MllamaTextModel(
    9.25 B = 94.63% Params, 28.59 TMACs = 93% MACs, 57.18 TFLOPS = 93% FLOPs
    (embed_tokens): Embedding(525.37 M = 5.37% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, 128264, 4096, padding_idx=128004)
    (layers): ModuleList(
      (0-2): 3 x MllamaSelfAttentionDecoderLayer(
        218.11 M = 2.23% Params, 893.35 GMACs = 2.91% MACs, 1.79 TFLOPS = 2.91% FLOPs
        (self_attn): MllamaTextSelfSdpaAttention(
          41.94 M = 0.43% Params, 171.8 GMACs = 0.56% MACs, 343.6 GFLOPS = 0.56% FLOPs
          (q_proj): Linear(16.78 M = 0.17% Params, 68.72 GMACs = 0.22% MACs, 137.44 GFLOPS = 0.22% FLOPs, in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(4.19 M = 0.04% Params, 17.18 GMACs = 0.06% MACs, 34.36 GFLOPS = 0.06% FLOPs, in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(4.19 M = 0.04% Params, 17.18 GMACs = 0.06% MACs, 34.36 GFLOPS = 0.06% FLOPs, in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(16.78 M = 0.17% Params, 68.72 GMACs = 0.22% MACs, 137.44 GFLOPS = 0.22% FLOPs, in_features=4096, out_features=4096, bias=False)
        )
        (mlp): MllamaTextMLP(
          176.16 M = 1.8% Params, 721.55 GMACs = 2.35% MACs, 1.44 TFLOPS = 2.35% FLOPs
          (gate_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 58.72 MFLOPS = 0% FLOPs)
        )
        (input_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
        (post_attention_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
      )
      (3): MllamaCrossAttentionDecoderLayer(
        218.11 M = 2.23% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (cross_attn): MllamaTextCrossSdpaAttention(
          41.94 M = 0.43% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (q_proj): Linear(16.78 M = 0.17% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(4.19 M = 0.04% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(4.19 M = 0.04% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(16.78 M = 0.17% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=4096, bias=False)
          (q_norm): MllamaTextRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-05)
          (k_norm): MllamaTextRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-05)
        )
        (input_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
        (mlp): MllamaTextMLP(
          176.16 M = 1.8% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (gate_proj): Linear(58.72 M = 0.6% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(58.72 M = 0.6% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(58.72 M = 0.6% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (post_attention_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
      )
      (4-7): 4 x MllamaSelfAttentionDecoderLayer(
        218.11 M = 2.23% Params, 893.35 GMACs = 2.91% MACs, 1.79 TFLOPS = 2.91% FLOPs
        (self_attn): MllamaTextSelfSdpaAttention(
          41.94 M = 0.43% Params, 171.8 GMACs = 0.56% MACs, 343.6 GFLOPS = 0.56% FLOPs
          (q_proj): Linear(16.78 M = 0.17% Params, 68.72 GMACs = 0.22% MACs, 137.44 GFLOPS = 0.22% FLOPs, in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(4.19 M = 0.04% Params, 17.18 GMACs = 0.06% MACs, 34.36 GFLOPS = 0.06% FLOPs, in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(4.19 M = 0.04% Params, 17.18 GMACs = 0.06% MACs, 34.36 GFLOPS = 0.06% FLOPs, in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(16.78 M = 0.17% Params, 68.72 GMACs = 0.22% MACs, 137.44 GFLOPS = 0.22% FLOPs, in_features=4096, out_features=4096, bias=False)
        )
        (mlp): MllamaTextMLP(
          176.16 M = 1.8% Params, 721.55 GMACs = 2.35% MACs, 1.44 TFLOPS = 2.35% FLOPs
          (gate_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 58.72 MFLOPS = 0% FLOPs)
        )
        (input_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
        (post_attention_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
      )
      (8): MllamaCrossAttentionDecoderLayer(
        218.11 M = 2.23% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (cross_attn): MllamaTextCrossSdpaAttention(
          41.94 M = 0.43% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (q_proj): Linear(16.78 M = 0.17% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(4.19 M = 0.04% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(4.19 M = 0.04% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(16.78 M = 0.17% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=4096, bias=False)
          (q_norm): MllamaTextRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-05)
          (k_norm): MllamaTextRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-05)
        )
        (input_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
        (mlp): MllamaTextMLP(
          176.16 M = 1.8% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (gate_proj): Linear(58.72 M = 0.6% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(58.72 M = 0.6% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(58.72 M = 0.6% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (post_attention_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
      )
      (9-12): 4 x MllamaSelfAttentionDecoderLayer(
        218.11 M = 2.23% Params, 893.35 GMACs = 2.91% MACs, 1.79 TFLOPS = 2.91% FLOPs
        (self_attn): MllamaTextSelfSdpaAttention(
          41.94 M = 0.43% Params, 171.8 GMACs = 0.56% MACs, 343.6 GFLOPS = 0.56% FLOPs
          (q_proj): Linear(16.78 M = 0.17% Params, 68.72 GMACs = 0.22% MACs, 137.44 GFLOPS = 0.22% FLOPs, in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(4.19 M = 0.04% Params, 17.18 GMACs = 0.06% MACs, 34.36 GFLOPS = 0.06% FLOPs, in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(4.19 M = 0.04% Params, 17.18 GMACs = 0.06% MACs, 34.36 GFLOPS = 0.06% FLOPs, in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(16.78 M = 0.17% Params, 68.72 GMACs = 0.22% MACs, 137.44 GFLOPS = 0.22% FLOPs, in_features=4096, out_features=4096, bias=False)
        )
        (mlp): MllamaTextMLP(
          176.16 M = 1.8% Params, 721.55 GMACs = 2.35% MACs, 1.44 TFLOPS = 2.35% FLOPs
          (gate_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 58.72 MFLOPS = 0% FLOPs)
        )
        (input_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
        (post_attention_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
      )
      (13): MllamaCrossAttentionDecoderLayer(
        218.11 M = 2.23% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (cross_attn): MllamaTextCrossSdpaAttention(
          41.94 M = 0.43% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (q_proj): Linear(16.78 M = 0.17% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(4.19 M = 0.04% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(4.19 M = 0.04% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(16.78 M = 0.17% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=4096, bias=False)
          (q_norm): MllamaTextRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-05)
          (k_norm): MllamaTextRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-05)
        )
        (input_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
        (mlp): MllamaTextMLP(
          176.16 M = 1.8% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (gate_proj): Linear(58.72 M = 0.6% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(58.72 M = 0.6% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(58.72 M = 0.6% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (post_attention_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
      )
      (14-17): 4 x MllamaSelfAttentionDecoderLayer(
        218.11 M = 2.23% Params, 893.35 GMACs = 2.91% MACs, 1.79 TFLOPS = 2.91% FLOPs
        (self_attn): MllamaTextSelfSdpaAttention(
          41.94 M = 0.43% Params, 171.8 GMACs = 0.56% MACs, 343.6 GFLOPS = 0.56% FLOPs
          (q_proj): Linear(16.78 M = 0.17% Params, 68.72 GMACs = 0.22% MACs, 137.44 GFLOPS = 0.22% FLOPs, in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(4.19 M = 0.04% Params, 17.18 GMACs = 0.06% MACs, 34.36 GFLOPS = 0.06% FLOPs, in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(4.19 M = 0.04% Params, 17.18 GMACs = 0.06% MACs, 34.36 GFLOPS = 0.06% FLOPs, in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(16.78 M = 0.17% Params, 68.72 GMACs = 0.22% MACs, 137.44 GFLOPS = 0.22% FLOPs, in_features=4096, out_features=4096, bias=False)
        )
        (mlp): MllamaTextMLP(
          176.16 M = 1.8% Params, 721.55 GMACs = 2.35% MACs, 1.44 TFLOPS = 2.35% FLOPs
          (gate_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 58.72 MFLOPS = 0% FLOPs)
        )
        (input_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
        (post_attention_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
      )
      (18): MllamaCrossAttentionDecoderLayer(
        218.11 M = 2.23% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (cross_attn): MllamaTextCrossSdpaAttention(
          41.94 M = 0.43% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (q_proj): Linear(16.78 M = 0.17% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(4.19 M = 0.04% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(4.19 M = 0.04% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(16.78 M = 0.17% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=4096, bias=False)
          (q_norm): MllamaTextRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-05)
          (k_norm): MllamaTextRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-05)
        )
        (input_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
        (mlp): MllamaTextMLP(
          176.16 M = 1.8% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (gate_proj): Linear(58.72 M = 0.6% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(58.72 M = 0.6% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(58.72 M = 0.6% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (post_attention_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
      )
      (19-22): 4 x MllamaSelfAttentionDecoderLayer(
        218.11 M = 2.23% Params, 893.35 GMACs = 2.91% MACs, 1.79 TFLOPS = 2.91% FLOPs
        (self_attn): MllamaTextSelfSdpaAttention(
          41.94 M = 0.43% Params, 171.8 GMACs = 0.56% MACs, 343.6 GFLOPS = 0.56% FLOPs
          (q_proj): Linear(16.78 M = 0.17% Params, 68.72 GMACs = 0.22% MACs, 137.44 GFLOPS = 0.22% FLOPs, in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(4.19 M = 0.04% Params, 17.18 GMACs = 0.06% MACs, 34.36 GFLOPS = 0.06% FLOPs, in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(4.19 M = 0.04% Params, 17.18 GMACs = 0.06% MACs, 34.36 GFLOPS = 0.06% FLOPs, in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(16.78 M = 0.17% Params, 68.72 GMACs = 0.22% MACs, 137.44 GFLOPS = 0.22% FLOPs, in_features=4096, out_features=4096, bias=False)
        )
        (mlp): MllamaTextMLP(
          176.16 M = 1.8% Params, 721.55 GMACs = 2.35% MACs, 1.44 TFLOPS = 2.35% FLOPs
          (gate_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 58.72 MFLOPS = 0% FLOPs)
        )
        (input_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
        (post_attention_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
      )
      (23): MllamaCrossAttentionDecoderLayer(
        218.11 M = 2.23% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (cross_attn): MllamaTextCrossSdpaAttention(
          41.94 M = 0.43% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (q_proj): Linear(16.78 M = 0.17% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(4.19 M = 0.04% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(4.19 M = 0.04% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(16.78 M = 0.17% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=4096, bias=False)
          (q_norm): MllamaTextRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-05)
          (k_norm): MllamaTextRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-05)
        )
        (input_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
        (mlp): MllamaTextMLP(
          176.16 M = 1.8% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (gate_proj): Linear(58.72 M = 0.6% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(58.72 M = 0.6% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(58.72 M = 0.6% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (post_attention_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
      )
      (24-27): 4 x MllamaSelfAttentionDecoderLayer(
        218.11 M = 2.23% Params, 893.35 GMACs = 2.91% MACs, 1.79 TFLOPS = 2.91% FLOPs
        (self_attn): MllamaTextSelfSdpaAttention(
          41.94 M = 0.43% Params, 171.8 GMACs = 0.56% MACs, 343.6 GFLOPS = 0.56% FLOPs
          (q_proj): Linear(16.78 M = 0.17% Params, 68.72 GMACs = 0.22% MACs, 137.44 GFLOPS = 0.22% FLOPs, in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(4.19 M = 0.04% Params, 17.18 GMACs = 0.06% MACs, 34.36 GFLOPS = 0.06% FLOPs, in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(4.19 M = 0.04% Params, 17.18 GMACs = 0.06% MACs, 34.36 GFLOPS = 0.06% FLOPs, in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(16.78 M = 0.17% Params, 68.72 GMACs = 0.22% MACs, 137.44 GFLOPS = 0.22% FLOPs, in_features=4096, out_features=4096, bias=False)
        )
        (mlp): MllamaTextMLP(
          176.16 M = 1.8% Params, 721.55 GMACs = 2.35% MACs, 1.44 TFLOPS = 2.35% FLOPs
          (gate_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 58.72 MFLOPS = 0% FLOPs)
        )
        (input_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
        (post_attention_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
      )
      (28): MllamaCrossAttentionDecoderLayer(
        218.11 M = 2.23% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (cross_attn): MllamaTextCrossSdpaAttention(
          41.94 M = 0.43% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (q_proj): Linear(16.78 M = 0.17% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(4.19 M = 0.04% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(4.19 M = 0.04% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(16.78 M = 0.17% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=4096, bias=False)
          (q_norm): MllamaTextRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-05)
          (k_norm): MllamaTextRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-05)
        )
        (input_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
        (mlp): MllamaTextMLP(
          176.16 M = 1.8% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (gate_proj): Linear(58.72 M = 0.6% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(58.72 M = 0.6% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(58.72 M = 0.6% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (post_attention_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
      )
      (29-32): 4 x MllamaSelfAttentionDecoderLayer(
        218.11 M = 2.23% Params, 893.35 GMACs = 2.91% MACs, 1.79 TFLOPS = 2.91% FLOPs
        (self_attn): MllamaTextSelfSdpaAttention(
          41.94 M = 0.43% Params, 171.8 GMACs = 0.56% MACs, 343.6 GFLOPS = 0.56% FLOPs
          (q_proj): Linear(16.78 M = 0.17% Params, 68.72 GMACs = 0.22% MACs, 137.44 GFLOPS = 0.22% FLOPs, in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(4.19 M = 0.04% Params, 17.18 GMACs = 0.06% MACs, 34.36 GFLOPS = 0.06% FLOPs, in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(4.19 M = 0.04% Params, 17.18 GMACs = 0.06% MACs, 34.36 GFLOPS = 0.06% FLOPs, in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(16.78 M = 0.17% Params, 68.72 GMACs = 0.22% MACs, 137.44 GFLOPS = 0.22% FLOPs, in_features=4096, out_features=4096, bias=False)
        )
        (mlp): MllamaTextMLP(
          176.16 M = 1.8% Params, 721.55 GMACs = 2.35% MACs, 1.44 TFLOPS = 2.35% FLOPs
          (gate_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 58.72 MFLOPS = 0% FLOPs)
        )
        (input_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
        (post_attention_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
      )
      (33): MllamaCrossAttentionDecoderLayer(
        218.11 M = 2.23% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (cross_attn): MllamaTextCrossSdpaAttention(
          41.94 M = 0.43% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (q_proj): Linear(16.78 M = 0.17% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(4.19 M = 0.04% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(4.19 M = 0.04% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(16.78 M = 0.17% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=4096, bias=False)
          (q_norm): MllamaTextRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-05)
          (k_norm): MllamaTextRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-05)
        )
        (input_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
        (mlp): MllamaTextMLP(
          176.16 M = 1.8% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (gate_proj): Linear(58.72 M = 0.6% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(58.72 M = 0.6% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(58.72 M = 0.6% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (post_attention_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
      )
      (34-37): 4 x MllamaSelfAttentionDecoderLayer(
        218.11 M = 2.23% Params, 893.35 GMACs = 2.91% MACs, 1.79 TFLOPS = 2.91% FLOPs
        (self_attn): MllamaTextSelfSdpaAttention(
          41.94 M = 0.43% Params, 171.8 GMACs = 0.56% MACs, 343.6 GFLOPS = 0.56% FLOPs
          (q_proj): Linear(16.78 M = 0.17% Params, 68.72 GMACs = 0.22% MACs, 137.44 GFLOPS = 0.22% FLOPs, in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(4.19 M = 0.04% Params, 17.18 GMACs = 0.06% MACs, 34.36 GFLOPS = 0.06% FLOPs, in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(4.19 M = 0.04% Params, 17.18 GMACs = 0.06% MACs, 34.36 GFLOPS = 0.06% FLOPs, in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(16.78 M = 0.17% Params, 68.72 GMACs = 0.22% MACs, 137.44 GFLOPS = 0.22% FLOPs, in_features=4096, out_features=4096, bias=False)
        )
        (mlp): MllamaTextMLP(
          176.16 M = 1.8% Params, 721.55 GMACs = 2.35% MACs, 1.44 TFLOPS = 2.35% FLOPs
          (gate_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 58.72 MFLOPS = 0% FLOPs)
        )
        (input_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
        (post_attention_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
      )
      (38): MllamaCrossAttentionDecoderLayer(
        218.11 M = 2.23% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (cross_attn): MllamaTextCrossSdpaAttention(
          41.94 M = 0.43% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (q_proj): Linear(16.78 M = 0.17% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(4.19 M = 0.04% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(4.19 M = 0.04% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(16.78 M = 0.17% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=4096, bias=False)
          (q_norm): MllamaTextRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-05)
          (k_norm): MllamaTextRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-05)
        )
        (input_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
        (mlp): MllamaTextMLP(
          176.16 M = 1.8% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (gate_proj): Linear(58.72 M = 0.6% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(58.72 M = 0.6% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(58.72 M = 0.6% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (post_attention_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
      )
      (39): MllamaSelfAttentionDecoderLayer(
        218.11 M = 2.23% Params, 893.35 GMACs = 2.91% MACs, 1.79 TFLOPS = 2.91% FLOPs
        (self_attn): MllamaTextSelfSdpaAttention(
          41.94 M = 0.43% Params, 171.8 GMACs = 0.56% MACs, 343.6 GFLOPS = 0.56% FLOPs
          (q_proj): Linear(16.78 M = 0.17% Params, 68.72 GMACs = 0.22% MACs, 137.44 GFLOPS = 0.22% FLOPs, in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(4.19 M = 0.04% Params, 17.18 GMACs = 0.06% MACs, 34.36 GFLOPS = 0.06% FLOPs, in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(4.19 M = 0.04% Params, 17.18 GMACs = 0.06% MACs, 34.36 GFLOPS = 0.06% FLOPs, in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(16.78 M = 0.17% Params, 68.72 GMACs = 0.22% MACs, 137.44 GFLOPS = 0.22% FLOPs, in_features=4096, out_features=4096, bias=False)
        )
        (mlp): MllamaTextMLP(
          176.16 M = 1.8% Params, 721.55 GMACs = 2.35% MACs, 1.44 TFLOPS = 2.35% FLOPs
          (gate_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(58.72 M = 0.6% Params, 240.52 GMACs = 0.78% MACs, 481.04 GFLOPS = 0.78% FLOPs, in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 58.72 MFLOPS = 0% FLOPs)
        )
        (input_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
        (post_attention_layernorm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
      )
    )
    (norm): MllamaTextRMSNorm(4.1 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (4096,), eps=1e-05)
    (rotary_emb): MllamaRotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
  )
  (lm_head): Linear(525.34 M = 5.37% Params, 2.15 TMACs = 7% MACs, 4.3 TFLOPS = 7% FLOPs, in_features=4096, out_features=128256, bias=False)
)
---------------------------------------------------------------------------------------------------
Llama-3.2-11B-Vision-Instruct FLOPs:61.48 TFLOPS   MACs:30.74 TMACs   Params:9.78 B
  

MllamaForConditionalGeneration(
  (vision_model): MllamaVisionModel(
    (patch_embedding): Conv2d(3, 1280, kernel_size=(14, 14), stride=(14, 14), padding=valid, bias=False)
    (gated_positional_embedding): MllamaPrecomputedPositionEmbedding(
      (tile_embedding): Embedding(9, 8197120)
    )
    (pre_tile_positional_embedding): MllamaPrecomputedAspectRatioEmbedding(
      (embedding): Embedding(9, 5120)
    )
    (post_tile_positional_embedding): MllamaPrecomputedAspectRatioEmbedding(
      (embedding): Embedding(9, 5120)
    )
    (layernorm_pre): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
    (layernorm_post): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
    (transformer): MllamaVisionEncoder(
      (layers): ModuleList(
        (0-31): 32 x MllamaVisionEncoderLayer(
          (self_attn): MllamaVisionSdpaAttention(
            (q_proj): Linear(in_features=1280, out_features=1280, bias=False)
            (k_proj): Linear(in_features=1280, out_features=1280, bias=False)
            (v_proj): Linear(in_features=1280, out_features=1280, bias=False)
            (o_proj): Linear(in_features=1280, out_features=1280, bias=False)
          )
          (mlp): MllamaVisionMLP(
            (activation_fn): GELUActivation()
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
          )
          (input_layernorm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
          (post_attention_layernorm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
        )
      )
    )
    (global_transformer): MllamaVisionEncoder(
      (layers): ModuleList(
        (0-7): 8 x MllamaVisionEncoderLayer(
          (self_attn): MllamaVisionSdpaAttention(
            (q_proj): Linear(in_features=1280, out_features=1280, bias=False)
            (k_proj): Linear(in_features=1280, out_features=1280, bias=False)
            (v_proj): Linear(in_features=1280, out_features=1280, bias=False)
            (o_proj): Linear(in_features=1280, out_features=1280, bias=False)
          )
          (mlp): MllamaVisionMLP(
            (activation_fn): GELUActivation()
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
          )
          (input_layernorm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
          (post_attention_layernorm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
        )
      )
    )
  )
  (language_model): MllamaForCausalLM(
    (model): MllamaTextModel(
      (embed_tokens): Embedding(128264, 4096, padding_idx=128004)
      (layers): ModuleList(
        (0-2): 3 x MllamaSelfAttentionDecoderLayer(
          (self_attn): MllamaTextSelfSdpaAttention(
            (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          )
          (mlp): MllamaTextMLP(
            (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
        )
        (3): MllamaCrossAttentionDecoderLayer(
          (cross_attn): MllamaTextCrossSdpaAttention(
            (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (q_norm): MllamaTextRMSNorm((128,), eps=1e-05)
            (k_norm): MllamaTextRMSNorm((128,), eps=1e-05)
          )
          (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          (mlp): MllamaTextMLP(
            (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
            (act_fn): SiLU()
          )
          (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
        )
        (4-7): 4 x MllamaSelfAttentionDecoderLayer(
          (self_attn): MllamaTextSelfSdpaAttention(
            (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          )
          (mlp): MllamaTextMLP(
            (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
        )
        (8): MllamaCrossAttentionDecoderLayer(
          (cross_attn): MllamaTextCrossSdpaAttention(
            (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (q_norm): MllamaTextRMSNorm((128,), eps=1e-05)
            (k_norm): MllamaTextRMSNorm((128,), eps=1e-05)
          )
          (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          (mlp): MllamaTextMLP(
            (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
            (act_fn): SiLU()
          )
          (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
        )
        (9-12): 4 x MllamaSelfAttentionDecoderLayer(
          (self_attn): MllamaTextSelfSdpaAttention(
            (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          )
          (mlp): MllamaTextMLP(
            (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
        )
        (13): MllamaCrossAttentionDecoderLayer(
          (cross_attn): MllamaTextCrossSdpaAttention(
            (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (q_norm): MllamaTextRMSNorm((128,), eps=1e-05)
            (k_norm): MllamaTextRMSNorm((128,), eps=1e-05)
          )
          (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          (mlp): MllamaTextMLP(
            (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
            (act_fn): SiLU()
          )
          (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
        )
        (14-17): 4 x MllamaSelfAttentionDecoderLayer(
          (self_attn): MllamaTextSelfSdpaAttention(
            (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          )
          (mlp): MllamaTextMLP(
            (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
        )
        (18): MllamaCrossAttentionDecoderLayer(
          (cross_attn): MllamaTextCrossSdpaAttention(
            (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (q_norm): MllamaTextRMSNorm((128,), eps=1e-05)
            (k_norm): MllamaTextRMSNorm((128,), eps=1e-05)
          )
          (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          (mlp): MllamaTextMLP(
            (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
            (act_fn): SiLU()
          )
          (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
        )
        (19-22): 4 x MllamaSelfAttentionDecoderLayer(
          (self_attn): MllamaTextSelfSdpaAttention(
            (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          )
          (mlp): MllamaTextMLP(
            (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
        )
        (23): MllamaCrossAttentionDecoderLayer(
          (cross_attn): MllamaTextCrossSdpaAttention(
            (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (q_norm): MllamaTextRMSNorm((128,), eps=1e-05)
            (k_norm): MllamaTextRMSNorm((128,), eps=1e-05)
          )
          (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          (mlp): MllamaTextMLP(
            (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
            (act_fn): SiLU()
          )
          (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
        )
        (24-27): 4 x MllamaSelfAttentionDecoderLayer(
          (self_attn): MllamaTextSelfSdpaAttention(
            (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          )
          (mlp): MllamaTextMLP(
            (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
        )
        (28): MllamaCrossAttentionDecoderLayer(
          (cross_attn): MllamaTextCrossSdpaAttention(
            (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (q_norm): MllamaTextRMSNorm((128,), eps=1e-05)
            (k_norm): MllamaTextRMSNorm((128,), eps=1e-05)
          )
          (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          (mlp): MllamaTextMLP(
            (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
            (act_fn): SiLU()
          )
          (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
        )
        (29-32): 4 x MllamaSelfAttentionDecoderLayer(
          (self_attn): MllamaTextSelfSdpaAttention(
            (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          )
          (mlp): MllamaTextMLP(
            (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
        )
        (33): MllamaCrossAttentionDecoderLayer(
          (cross_attn): MllamaTextCrossSdpaAttention(
            (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (q_norm): MllamaTextRMSNorm((128,), eps=1e-05)
            (k_norm): MllamaTextRMSNorm((128,), eps=1e-05)
          )
          (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          (mlp): MllamaTextMLP(
            (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
            (act_fn): SiLU()
          )
          (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
        )
        (34-37): 4 x MllamaSelfAttentionDecoderLayer(
          (self_attn): MllamaTextSelfSdpaAttention(
            (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          )
          (mlp): MllamaTextMLP(
            (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
        )
        (38): MllamaCrossAttentionDecoderLayer(
          (cross_attn): MllamaTextCrossSdpaAttention(
            (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (q_norm): MllamaTextRMSNorm((128,), eps=1e-05)
            (k_norm): MllamaTextRMSNorm((128,), eps=1e-05)
          )
          (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          (mlp): MllamaTextMLP(
            (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
            (act_fn): SiLU()
          )
          (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
        )
        (39): MllamaSelfAttentionDecoderLayer(
          (self_attn): MllamaTextSelfSdpaAttention(
            (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          )
          (mlp): MllamaTextMLP(
            (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
            (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
          (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
        )
      )
      (norm): MllamaTextRMSNorm((4096,), eps=1e-05)
      (rotary_emb): MllamaRotaryEmbedding()
    )
    (lm_head): Linear(in_features=4096, out_features=128256, bias=False)
  )
  (multi_modal_projector): Linear(in_features=7680, out_features=4096, bias=True)
)
  

Qwen2.3-3B-Instruct

------------------------------------- Calculate Flops Results -------------------------------------
Notations:
number of parameters (Params), number of multiply-accumulate operations(MACs),
number of floating-point operations (FLOPs), floating-point operations per second (FLOPS),
fwd FLOPs (model forward propagation FLOPs), bwd FLOPs (model backward propagation FLOPs),
default model backpropagation takes 2.00 times as much computation as forward propagation.

Total Training Params:                                                  3.09 B  
fwd MACs:                                                               12.64 TMACs
fwd FLOPs:                                                              25.28 TFLOPS
fwd+bwd MACs:                                                           37.92 TMACs
fwd+bwd FLOPs:                                                          75.84 TFLOPS

-------------------------------- Detailed Calculated FLOPs Results --------------------------------
Each module caculated is listed after its name in the following order: 
params, percentage of total params, MACs, percentage of total MACs, FLOPS, percentage of total FLOPs

Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). 
 They are not counted as submodules in calflops and not to be printed out. However they make up the difference between a parent's MACs and the sum of its submodules'.
2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.

Qwen2ForCausalLM(
  3.09 B = 100% Params, 12.64 TMACs = 100% MACs, 25.28 TFLOPS = 100% FLOPs
  (model): Qwen2Model(
    3.09 B = 100% Params, 11.36 TMACs = 89.92% MACs, 22.73 TFLOPS = 89.92% FLOPs
    (embed_tokens): Embedding(311.16 M = 10.08% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, 151936, 2048)
    (layers): ModuleList(
      (0-35): 36 x Qwen2DecoderLayer(
        77.08 M = 2.5% Params, 315.68 GMACs = 2.5% MACs, 631.41 GFLOPS = 2.5% FLOPs
        (self_attn): Qwen2SdpaAttention(
          9.44 M = 0.31% Params, 38.65 GMACs = 0.31% MACs, 77.31 GFLOPS = 0.31% FLOPs
          (q_proj): Linear(4.2 M = 0.14% Params, 17.18 GMACs = 0.14% MACs, 34.36 GFLOPS = 0.14% FLOPs, in_features=2048, out_features=2048, bias=True)
          (k_proj): Linear(524.54 K = 0.02% Params, 2.15 GMACs = 0.02% MACs, 4.29 GFLOPS = 0.02% FLOPs, in_features=2048, out_features=256, bias=True)
          (v_proj): Linear(524.54 K = 0.02% Params, 2.15 GMACs = 0.02% MACs, 4.29 GFLOPS = 0.02% FLOPs, in_features=2048, out_features=256, bias=True)
          (o_proj): Linear(4.19 M = 0.14% Params, 17.18 GMACs = 0.14% MACs, 34.36 GFLOPS = 0.14% FLOPs, in_features=2048, out_features=2048, bias=False)
          (rotary_emb): Qwen2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (mlp): Qwen2MLP(
          67.63 M = 2.19% Params, 277.03 GMACs = 2.19% MACs, 554.1 GFLOPS = 2.19% FLOPs
          (gate_proj): Linear(22.54 M = 0.73% Params, 92.34 GMACs = 0.73% MACs, 184.68 GFLOPS = 0.73% FLOPs, in_features=2048, out_features=11008, bias=False)
          (up_proj): Linear(22.54 M = 0.73% Params, 92.34 GMACs = 0.73% MACs, 184.68 GFLOPS = 0.73% FLOPs, in_features=2048, out_features=11008, bias=False)
          (down_proj): Linear(22.54 M = 0.73% Params, 92.34 GMACs = 0.73% MACs, 184.68 GFLOPS = 0.73% FLOPs, in_features=11008, out_features=2048, bias=False)
          (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 45.09 MFLOPS = 0% FLOPs)
        )
        (input_layernorm): Qwen2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (post_attention_layernorm): Qwen2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
    )
    (norm): Qwen2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
  )
  (lm_head): Linear(311.16 M = 10.08% Params, 1.27 TMACs = 10.08% MACs, 2.55 TFLOPS = 10.08% FLOPs, in_features=2048, out_features=151936, bias=False)
)
---------------------------------------------------------------------------------------------------
/mnt/bn/znzx-public/models/Qwen2.5-3B-Instruct FLOPs:25.28 TFLOPS   MACs:12.64 TMACs   Params:3.09 B
  

加了lora的qwen2-1.5B

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 2048)
    (layers): ModuleList(
      (0-23): 24 x Qwen2DecoderLayer(
        (self_attn): Qwen2SdpaAttention(
          (q_proj): lora.Linear(
            (base_layer): Linear(in_features=2048, out_features=2048, bias=True)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.05, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=2048, out_features=64, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=64, out_features=2048, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
            (lora_magnitude_vector): ModuleDict()
          )
          (k_proj): Linear(in_features=2048, out_features=2048, bias=True)
          (v_proj): lora.Linear(
            (base_layer): Linear(in_features=2048, out_features=2048, bias=True)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.05, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=2048, out_features=64, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=64, out_features=2048, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
            (lora_magnitude_vector): ModuleDict()
          )
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): Qwen2RotaryEmbedding()
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=2048, out_features=5504, bias=False)
          (up_proj): Linear(in_features=2048, out_features=5504, bias=False)
          (down_proj): Linear(in_features=5504, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((2048,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((2048,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((2048,), eps=1e-06)
  )
  (lm_head): Linear(in_features=2048, out_features=151936, bias=False)
)
  

clip model

tensor([[49406,   320,  1125,   539,   320,  2368, 49407],
        [49406,   320,  1125,   539,   320,  1929, 49407]])
CLIPModel(
  (text_model): CLIPTextTransformer(
    (embeddings): CLIPTextEmbeddings(
      (token_embedding): Embedding(49408, 512)
      (position_embedding): Embedding(77, 512)
    )
    (encoder): CLIPEncoder(
      (layers): ModuleList(
        (0-11): 12 x CLIPEncoderLayer(
          (self_attn): CLIPAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (mlp): CLIPMLP(
            (activation_fn): QuickGELUActivation()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
          )
          (layer_norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
      )
    )
    (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (vision_model): CLIPVisionTransformer(
    (embeddings): CLIPVisionEmbeddings(
      (patch_embedding): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16), bias=False)
      (position_embedding): Embedding(197, 768)
    )
    (pre_layrnorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (encoder): CLIPEncoder(
      (layers): ModuleList(
        (0-11): 12 x CLIPEncoderLayer(
          (self_attn): CLIPAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (layer_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlp): CLIPMLP(
            (activation_fn): QuickGELUActivation()
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
          )
          (layer_norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
      )
    )
    (post_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (visual_projection): Linear(in_features=768, out_features=512, bias=False)
  (text_projection): Linear(in_features=512, out_features=512, bias=False)
)
outputs text: torch.Size([2, 512]), image:torch.Size([1, 512])
  

clip vision embeddings

print(inputs)
print(model.vision_model.embeddings)
print(f"input shape: {inputs['pixel_values'].shape}")
embedding_output = model.vision_model.embeddings(inputs['pixel_values'])
print(f"output shape: {embedding_output.shape}")
  

{'input_ids': tensor([[49406,   320,  1125,   539,   320,  2368, 49407],
        [49406,   320,  1125,   539,   320,  1929, 49407]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1]]), 'pixel_values': tensor([[[[ 0.5873,  0.5873,  0.6165,  ...,  0.0617,  0.0471, -0.0259],
          [ 0.5727,  0.5727,  0.6603,  ...,  0.1201,  0.0763,  0.0909],
          [ 0.5873,  0.5435,  0.6165,  ...,  0.0325,  0.1201,  0.0617],
          ...,
          [ 1.8719,  1.8573,  1.8719,  ...,  1.3902,  1.4340,  1.4194],
          [ 1.8281,  1.8719,  1.8427,  ...,  1.4486,  1.4340,  1.5070],
          [ 1.8573,  1.9011,  1.8281,  ...,  1.3756,  1.3610,  1.4486]],

         [[-1.3169, -1.3019, -1.3169,  ..., -1.4970, -1.4369, -1.4820],
          [-1.2418, -1.2718, -1.2268,  ..., -1.4369, -1.4669, -1.4519],
          [-1.2568, -1.3169, -1.2268,  ..., -1.4669, -1.4069, -1.4519],
          ...,
          [ 0.1239,  0.1089,  0.1239,  ..., -0.7016, -0.6865, -0.6865],
          [ 0.0789,  0.0939,  0.0488,  ..., -0.6565, -0.6865, -0.6115],
          [ 0.0939,  0.1089,  0.0038,  ..., -0.7766, -0.7316, -0.6115]],

         [[-0.4848, -0.4137, -0.3853,  ..., -0.9541, -0.8545, -0.8545],
          [-0.4137, -0.4706, -0.3711,  ..., -0.8119, -0.8545, -0.7834],
          [-0.3284, -0.4422, -0.3853,  ..., -0.8688, -0.8119, -0.8830],
          ...,
          [ 1.5771,  1.6482,  1.6340,  ...,  0.9088,  0.9514,  0.8945],
          [ 1.6198,  1.6055,  1.6055,  ...,  0.8661,  0.8092,  0.7950],
          [ 1.6624,  1.6766,  1.5487,  ...,  0.7950,  0.8661,  0.8519]]]])}
CLIPVisionEmbeddings(
  (patch_embedding): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16), bias=False)
  (position_embedding): Embedding(197, 768)
)
input shape: torch.Size([1, 3, 224, 224])
output shape: torch.Size([1, 197, 768])
  

image embeddings flops

from thop import profile,clever_format
flops,params = profile(model.vision_model.embeddings, inputs=(inputs['pixel_values'],), verbose=True)
flops,params = clever_format([flops, params], "%.3f")
print("flops:", flops, "params:", params)
  

[INFO] Register count_convNd() for <class 'torch.nn.modules.conv.Conv2d'>.
flops: 115.606M params: 589.824K
  

vision model flops

from thop import profile,clever_format
flops,params = profile(model.vision_model, inputs=(inputs['pixel_values'],), verbose=True)
flops,params = clever_format([flops, params], "%.3f")
print("flops:", flops, "params:", params)
  

[INFO] Register count_convNd() for <class 'torch.nn.modules.conv.Conv2d'>.
[INFO] Register count_normalization() for <class 'torch.nn.modules.normalization.LayerNorm'>.
[INFO] Register count_linear() for <class 'torch.nn.modules.linear.Linear'>.
flops: 33.726G params: 85.647M
  

85M * 196 * 2(flops/mac) = 33G 符合预期

text model flops

CLIP architecture(graph)

from thop import profile,clever_format

print(f"input shape: {inputs['input_ids'].shape}")

flops,params = profile(model.text_model, inputs=(inputs['input_ids'],), verbose=True)

flops,params = clever_format([flops, params], "%.3f")

print("flops:", flops, "params:", params)

input shape: torch.Size([2, 7])
[INFO] Register count_linear() for <class 'torch.nn.modules.linear.Linear'>.
[INFO] Register count_normalization() for <class 'torch.nn.modules.normalization.LayerNorm'>.
flops: 1.058G params: 37.830M

  

38M * 2 * 7 * 2 flops/mac = 1.06G flops 符合预期

Qwen3-30B-A3B

Loading checkpoint shards:   0%|          | 0/16 [00:00<?, ?it/s]
Loading checkpoint shards:   6%|▋         | 1/16 [00:01<00:25,  1.68s/it]
Loading checkpoint shards:  12%|█▎        | 2/16 [00:03<00:25,  1.81s/it]
Loading checkpoint shards:  19%|█▉        | 3/16 [00:05<00:23,  1.83s/it]
Loading checkpoint shards:  25%|██▌       | 4/16 [00:07<00:22,  1.86s/it]
Loading checkpoint shards:  31%|███▏      | 5/16 [00:09<00:20,  1.91s/it]
Loading checkpoint shards:  38%|███▊      | 6/16 [00:11<00:19,  1.93s/it]
Loading checkpoint shards:  44%|████▍     | 7/16 [00:13<00:17,  1.94s/it]
Loading checkpoint shards:  50%|█████     | 8/16 [00:15<00:15,  1.96s/it]
Loading checkpoint shards:  56%|█████▋    | 9/16 [00:17<00:13,  1.98s/it]
Loading checkpoint shards:  62%|██████▎   | 10/16 [00:19<00:12,  2.03s/it]
Loading checkpoint shards:  69%|██████▉   | 11/16 [00:21<00:10,  2.06s/it]
Loading checkpoint shards:  75%|███████▌  | 12/16 [00:23<00:08,  2.15s/it]
Loading checkpoint shards:  81%|████████▏ | 13/16 [00:25<00:06,  2.10s/it]
Loading checkpoint shards:  88%|████████▊ | 14/16 [00:27<00:04,  2.08s/it]
Loading checkpoint shards:  94%|█████████▍| 15/16 [00:29<00:02,  2.07s/it]
Loading checkpoint shards: 100%|██████████| 16/16 [00:30<00:00,  1.56s/it]
Loading checkpoint shards: 100%|██████████| 16/16 [00:30<00:00,  1.90s/it]

------------------------------------- Calculate Flops Results -------------------------------------
Notations:
number of parameters (Params), number of multiply-accumulate operations(MACs),
number of floating-point operations (FLOPs), floating-point operations per second (FLOPS),
fwd FLOPs (model forward propagation FLOPs), bwd FLOPs (model backward propagation FLOPs),
default model backpropagation takes 2.00 times as much computation as forward propagation.

Total Training Params:                                                  30.53 B 
fwd MACs:                                                               389.33 GMACs
fwd FLOPs:                                                              778.7 GFLOPS
fwd+bwd MACs:                                                           1.17 TMACs
fwd+bwd FLOPs:                                                          2.34 TFLOPS

-------------------------------- Detailed Calculated FLOPs Results --------------------------------
Each module caculated is listed after its name in the following order: 
params, percentage of total params, MACs, percentage of total MACs, FLOPS, percentage of total FLOPs

Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). 
 They are not counted as submodules in calflops and not to be printed out. However they make up the difference between a parent's MACs and the sum of its submodules'.
2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.

Qwen3MoeForCausalLM(
  30.53 B = 100% Params, 389.33 GMACs = 100% MACs, 778.7 GFLOPS = 100% FLOPs
  (model): Qwen3MoeModel(
    30.22 B = 98.98% Params, 349.5 GMACs = 89.77% MACs, 699.04 GFLOPS = 89.77% FLOPs
    (embed_tokens): Embedding(311.16 M = 1.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, 151936, 2048)
    (layers): ModuleList(
      (0): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (1): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (2-4): 3 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (5): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (6-15): 10 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (16): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (17-52): 36 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (53): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (54-67): 14 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (68): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (69-83): 15 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (84): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (85-113): 29 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (114): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (115-118): 4 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (119): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (120-127): 8 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (1): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-3): 4 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (4-5): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (6-54): 49 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (55): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (56-67): 12 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (68): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (69-81): 13 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (82): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (83-89): 7 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (90): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (91-113): 23 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (114): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (115-118): 4 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (119): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (120-127): 8 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (2): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-33): 34 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (34): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (35): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (36): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (37-40): 4 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (41): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (42-76): 35 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (77): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (78-82): 5 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (83): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (84-91): 8 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (92): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (93-112): 20 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (113): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (114-123): 10 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (124): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (125-127): 3 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (3): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-11): 12 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (12): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (13-42): 30 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (43): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (44-81): 38 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (82): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (83-84): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (85): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (86-106): 21 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (107-108): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (109): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (110): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (111-117): 7 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (118): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (119-127): 9 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (4): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-14): 15 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (15-16): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (17-18): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (19): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (20-42): 23 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (43): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (44-63): 20 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (64): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (65-70): 6 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (71): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (72-83): 12 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (84): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (85-125): 41 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (126): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (127): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (5): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-7): 8 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (8): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (9-25): 17 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (26): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (27-37): 11 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (38): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (39-45): 7 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (46): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (47-80): 34 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (81): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (82-87): 6 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (88): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (89-97): 9 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (98): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (99): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (100): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (101-127): 27 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (6): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-31): 32 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (32): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (33-49): 17 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (50): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (51-68): 18 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (69): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (70): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (71): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (72-73): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (74): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (75-90): 16 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (91): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (92-102): 11 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (103): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (104-107): 4 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (108): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (109-127): 19 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (7): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (1-27): 27 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (28): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (29-37): 9 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (38): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (39-62): 24 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (63): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (64-92): 29 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (93-94): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (95-114): 20 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (115): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (116-117): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (118): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (119-127): 9 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (8): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-2): 3 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (3): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (4-9): 6 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (10): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (11-37): 27 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (38): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (39-59): 21 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (60): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (61-75): 15 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (76): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (77-82): 6 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (83): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (84-87): 4 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (88): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (89-95): 7 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (96): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (97-127): 31 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (9): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-9): 10 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (10): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (11-38): 28 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (39): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (40-63): 24 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (64): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (65-68): 4 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (69): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (70-74): 5 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (75): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (76): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (77): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (78-84): 7 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (85): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (86-122): 37 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (123): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (124-127): 4 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (10): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (1-37): 37 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (38): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (39-50): 12 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (51): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (52-68): 17 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (69-70): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (71-83): 13 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (84): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (85-96): 12 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (97): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (98-103): 6 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (104): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (105-127): 23 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (11): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (1-3): 3 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (4): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (5-8): 4 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (9): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (10-13): 4 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (14): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (15-35): 21 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (36): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (37-38): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (39): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (40-69): 30 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (70): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (71-95): 25 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (96): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (97-127): 31 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (12): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-9): 10 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (10): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (11-23): 13 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (24): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (25-30): 6 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (31): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (32-40): 9 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (41): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (42-87): 46 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (88): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (89-100): 12 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (101): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (102-107): 6 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (108): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (109-112): 4 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (113): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (114-127): 14 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (13): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-6): 7 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (7): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (8-12): 5 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (13-14): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (15-19): 5 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (20): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (21-34): 14 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (35): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (36-61): 26 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (62): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (63-103): 41 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (104): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (105-109): 5 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (110): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (111-127): 17 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (14): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (1): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (2-13): 12 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (14): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (15-25): 11 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (26): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (27-54): 28 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (55): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (56-60): 5 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (61): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (62-90): 29 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (91): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (92-102): 11 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (103): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (104-116): 13 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (117): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (118-127): 10 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (15): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-4): 5 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (5): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (6-23): 18 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (24): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (25-31): 7 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (32): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (33-41): 9 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (42): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (43-48): 6 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (49): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (50-65): 16 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (66): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (67-72): 6 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (73): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (74-105): 32 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (106): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (107-127): 21 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (16): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-20): 21 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (21): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (22-26): 5 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (27-28): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (29-40): 12 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (41): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (42-52): 11 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (53): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (54-72): 19 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (73-74): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (75-86): 12 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (87): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (88-127): 40 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (17): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-21): 22 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (22-23): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (24-25): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (26): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (27-29): 3 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (30): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (31-47): 17 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (48): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (49-72): 24 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (73): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (74-94): 21 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (95): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (96-123): 28 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (124): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (125-127): 3 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (18): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-31): 32 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (32): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (33-47): 15 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (48): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (49-65): 17 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (66): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (67): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (68): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (69-89): 21 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (90): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (91-92): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (93): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (94-102): 9 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (103): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (104-106): 3 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (107): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (108-127): 20 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (19): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-6): 7 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (7): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (8): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (9): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (10-17): 8 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (18): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (19-27): 9 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (28): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (29-67): 39 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (68): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (69-104): 36 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (105): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (106-114): 9 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (115): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (116-117): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (118): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (119-127): 9 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (20): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-2): 3 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (3): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (4-25): 22 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (26): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (27-33): 7 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (34): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (35-37): 3 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (38): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (39-57): 19 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (58): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (59-87): 29 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (88): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (89-90): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (91): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (92-95): 4 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (96): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (97-127): 31 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (21): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-11): 12 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (12-13): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (14-25): 12 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (26): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (27-31): 5 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (32): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (33-60): 28 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (61): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (62-68): 7 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (69): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (70-74): 5 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (75): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (76-84): 9 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (85): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (86-127): 42 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (22): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (1-20): 20 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (21): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (22-37): 16 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (38): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (39-50): 12 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (51): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (52-67): 16 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (68-70): 3 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (71-116): 46 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (117): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (118-127): 10 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (23): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-2): 3 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (3-4): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (5-6): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (7): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (8-27): 20 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (28): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (29-35): 7 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (36): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (37-72): 36 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (73): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (74-79): 6 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (80): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (81-98): 18 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (99): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (100-127): 28 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (24): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-3): 4 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (4): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (5-7): 3 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (8): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (9): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (10): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (11-23): 13 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (24): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (25-48): 24 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (49): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (50-64): 15 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (65): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (66-77): 12 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (78): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (79-104): 26 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (105): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (106-127): 22 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (25): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (1-19): 19 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (20): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (21-34): 14 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (35): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (36-61): 26 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (62): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (63-84): 22 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (85): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (86-91): 6 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (92): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (93-103): 11 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (104): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (105-109): 5 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (110): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (111-127): 17 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (26): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-25): 26 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (26): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (27-44): 18 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (45): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (46-58): 13 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (59): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (60-66): 7 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (67): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (68-69): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (70): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (71-97): 27 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (98): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (99-102): 4 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (103): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (104-116): 13 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (117): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (118-127): 10 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (27): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-4): 5 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (5): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (6-18): 13 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (19): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (20-31): 12 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (32): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (33-40): 8 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (41): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (42-58): 17 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (59): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (60-72): 13 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (73): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (74-109): 36 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (110): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (111-113): 3 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (114): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (115-127): 13 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (28): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-26): 27 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (27): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (28-33): 6 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (34): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (35-40): 6 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (41): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (42-52): 11 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (53): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (54-69): 16 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (70): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (71-72): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (73-74): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (75-92): 18 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (93): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (94-127): 34 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (29): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-15): 16 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (16): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (17): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (18): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (19-22): 4 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (23): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (24-25): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (26): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (27-28): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (29): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (30-47): 18 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (48): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (49-72): 24 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (73): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (74-104): 31 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (105): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (106-127): 22 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (30): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-25): 26 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (26-27): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (28-47): 20 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (48): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (49-65): 17 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (66): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (67-73): 7 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (74): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (75-78): 4 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (79): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (80-89): 10 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (90): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (91-102): 12 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (103): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (104-127): 24 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (31): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-6): 7 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (7): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (8-27): 20 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (28): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (29-67): 39 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (68): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (69-70): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (71): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (72-87): 16 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (88): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (89-114): 26 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (115): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (116-117): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (118): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (119-124): 6 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (125): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (126-127): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (32): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-16): 17 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (17): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (18-37): 20 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (38): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (39-41): 3 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (42): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (43-62): 20 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (63): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (64-95): 32 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (96): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (97-110): 14 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (111): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (112-118): 7 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (119): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (120-125): 6 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (126): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (127): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (33): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-9): 10 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (10): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (11): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (12): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (13-31): 19 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (32): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (33-60): 28 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (61): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (62-63): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (64): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (65-68): 4 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (69): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (70-74): 5 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (75): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (76-86): 11 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (87): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (88-127): 40 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (34): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (1-21): 21 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (22): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (23-31): 9 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (32): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (33-37): 5 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (38): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (39-50): 12 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (51): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (52-62): 11 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (63): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (64-96): 33 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (97-98): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (99-127): 29 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (35): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-6): 7 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (7): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (8-15): 8 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (16): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (17-27): 11 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (28): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (29-35): 7 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (36): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (37-40): 4 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (41): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (42-98): 57 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (99): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (100-107): 8 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (108): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (109-112): 4 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (113): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (114-127): 14 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (36): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-9): 10 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (10): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (11-23): 13 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (24): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (25-40): 16 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (41): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (42-55): 14 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (56): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (57-64): 8 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (65): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (66-97): 32 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (98): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (99-100): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (101): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (102-104): 3 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (105): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (106-127): 22 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (37): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-28): 29 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (29): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (30-34): 5 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (35): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (36-69): 34 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (70): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (71-91): 21 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (92): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (93-99): 7 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (100): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (101-103): 3 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (104): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (105-109): 5 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (110-111): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (112-127): 16 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (38): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (1): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (2-11): 10 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (12): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (13-44): 32 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (45): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (46-54): 9 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (55): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (56-60): 5 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (61): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (62-76): 15 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (77): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (78-91): 14 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (92): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (93-116): 24 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (117): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (118-127): 10 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (39): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-40): 41 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (41): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (42-72): 31 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (73): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (74-85): 12 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (86): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (87-91): 5 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (92-93): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (94-112): 19 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (113): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (114-121): 8 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (122-123): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (124-127): 4 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (40): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-15): 16 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (16): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (17-38): 22 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (39): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (40-58): 19 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (59): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (60-63): 4 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (64): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (65-73): 9 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (74): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (75): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (76): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (77-92): 16 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (93): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (94-96): 3 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (97): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (98-127): 30 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (41): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-24): 25 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (25): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (26-28): 3 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (29): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (30-47): 18 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (48): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (49-72): 24 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (73): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (74-78): 5 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (79): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (80-92): 13 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (93): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (94-104): 11 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (105): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (106-107): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (108): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (109-127): 19 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (42): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-4): 5 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (5): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (6-16): 11 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (17): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (18-40): 23 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (41): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (42-51): 10 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (52): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (53-79): 27 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (80): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (81-109): 29 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (110): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (111): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (112-113): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (114-127): 14 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (43): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-9): 10 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (10): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (11-16): 6 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (17): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (18-77): 60 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (78): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (79-82): 4 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (83): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (84-95): 12 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (96): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (97): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (98): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (99): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (100): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (101-126): 26 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (127): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (44): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-9): 10 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (10): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (11-38): 28 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (39-40): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (41-81): 41 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (82): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (83-91): 9 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (92): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (93-95): 3 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (96): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (97-103): 7 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (104): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (105-118): 14 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (119): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (120-127): 8 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (45): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-25): 26 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (26): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (27-37): 11 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (38): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (39-62): 24 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (63-64): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (65-68): 4 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (69): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (70-85): 16 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (86): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (87-88): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (89): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (90-117): 28 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (118): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (119-127): 9 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (46): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-24): 25 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (25): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (26-33): 8 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (34): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (35-37): 3 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (38): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (39-73): 35 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (74-75): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (76-95): 20 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (96): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (97-103): 7 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (104): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (105-117): 13 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (118): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (119-127): 9 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
      (47): Qwen3MoeDecoderLayer(
        623.12 M = 2.04% Params, 7.28 GMACs = 1.87% MACs, 14.56 GFLOPS = 1.87% FLOPs
        (self_attn): Qwen3MoeAttention(
          18.87 M = 0.06% Params, 2.42 GMACs = 0.62% MACs, 4.83 GFLOPS = 0.62% FLOPs
          (q_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(1.05 M = 0% Params, 134.22 MMACs = 0.03% MACs, 268.44 MFLOPS = 0.03% FLOPs, in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(8.39 M = 0.03% Params, 1.07 GMACs = 0.28% MACs, 2.15 GFLOPS = 0.28% FLOPs, in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
          (k_norm): Qwen3MoeRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (128,), eps=1e-06)
        )
        (mlp): Qwen3MoeSparseMoeBlock(
          604.24 M = 1.98% Params, 4.87 GMACs = 1.25% MACs, 9.73 GFLOPS = 1.25% FLOPs
          (gate): Linear(262.14 K = 0% Params, 33.55 MMACs = 0.01% MACs, 67.11 MFLOPS = 0.01% FLOPs, in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-1): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (2): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (3-57): 55 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (58): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (59-60): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (61): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (62-76): 15 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (77): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (78-100): 23 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (101): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (102-111): 10 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (112): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (113-115): 3 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (116): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (117-124): 8 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
            (125): Qwen3MoeMLP(
              4.72 M = 0.02% Params, 603.98 MMACs = 0.16% MACs, 1.21 GFLOPS = 0.16% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 201.33 MMACs = 0.05% MACs, 402.65 MFLOPS = 0.05% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 98.3 KFLOPS = 0% FLOPs)
            )
            (126-127): 2 x Qwen3MoeMLP(
              4.72 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
              (gate_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(1.57 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            )
          )
        )
        (input_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
      )
    )
    (norm): Qwen3MoeRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (2048,), eps=1e-06)
    (rotary_emb): Qwen3MoeRotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
  )
  (lm_head): Linear(311.16 M = 1.02% Params, 39.83 GMACs = 10.23% MACs, 79.66 GFLOPS = 10.23% FLOPs, in_features=2048, out_features=151936, bias=False)
)
---------------------------------------------------------------------------------------------------
Qwen3-30B-A3B-Instruct-2507 FLOPs:778.7 GFLOPS   MACs:389.33 GMACs   Params:30.53 B 


  

deepseek-vl-v2(16B-A2B)

总体：

| 总权重 | vision | projector | language | | | —— | ——– | ——— | ——– | — | | 16.15B | 428.23 M | 13.64 M | 15.71 B | | language:

共27层（0~26），第0层没有多专家：

| total | embeddings | layer 0 | layer 1~26 | lm head | | ——- | ———- | ——- | ———- | ——– | | 15.71 B | 209.72 M | 81.01 M | 584.85 M | 209.72 M | layer 0:

total	self_attn	MLP
81.01 M	13.76 M	67.24 M
	Wq: 6.29M	Gate: 22.41 M
	kv_a: 1.18 M	UP: 22.41 M
	kv_b:2.1 M	Down:22.41 M
	Wo: 4.19M

layer 1~26:

self_attn与layer 0相同，MLP总参数量571.08 M， 64个专家，每个8.65 M，两个共享专家，每个也是8.65 M。即共66个专家。

activation:

每次激活6个专家。一层共51.9M参数，26层共1.349B。 layer 1~26的self_attn共0.358B. layer 0共0.081B。 lm head: 0.2B。加一起共1.987B，差不多2B。

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Python version is above 3.10, patching the collections module.
Add pad token = ['<｜▁pad▁｜>'] to the tokenizer
<｜▁pad▁｜>:100002
Add image token = ['<image>'] to the tokenizer
<image>:100003
Add grounding-related tokens = ['<|ref|>', '<|/ref|>', '<|det|>', '<|/det|>', '<|grounding|>'] to the tokenizer with input_ids
<|ref|>:100004
<|/ref|>:100005
<|det|>:100006
<|/det|>:100007
<|grounding|>:100008
Add chat tokens = ['<|User|>', '<|Assistant|>'] to the tokenizer with input_ids
<|User|>:100009
<|Assistant|>:100010


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:14<00:43, 14.58s/it]
Loading checkpoint shards:  50%|█████     | 2/4 [00:30<00:30, 15.33s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:47<00:16, 16.30s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [01:01<00:00, 15.05s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [01:01<00:00, 15.26s/it]

------------------------------------- Calculate Flops Results -------------------------------------
Notations:
number of parameters (Params), number of multiply-accumulate operations(MACs),
number of floating-point operations (FLOPs), floating-point operations per second (FLOPS),
fwd FLOPs (model forward propagation FLOPs), bwd FLOPs (model backward propagation FLOPs),
default model backpropagation takes 2.00 times as much computation as forward propagation.

Total Training Params:                                                  16.15 B 
fwd MACs:                                                               317.84 GMACs
fwd FLOPs:                                                              642.98 GFLOPS
fwd+bwd MACs:                                                           953.53 GMACs
fwd+bwd FLOPs:                                                          1.93 TFLOPS

-------------------------------- Detailed Calculated FLOPs Results --------------------------------
Each module caculated is listed after its name in the following order: 
params, percentage of total params, MACs, percentage of total MACs, FLOPS, percentage of total FLOPs

Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). 
 They are not counted as submodules in calflops and not to be printed out. However they make up the difference between a parent's MACs and the sum of its submodules'.
2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.

DeepseekVLV2ForCausalLM(
  16.15 B = 100% Params, 317.84 GMACs = 100% MACs, 642.98 GFLOPS = 100% FLOPs
  (vision): VisionTransformer(
    428.23 M = 2.65% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
    (patch_embed): PatchEmbed(
      678.53 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
      (proj): Conv2d(678.53 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, 3, 1152, kernel_size=(14, 14), stride=(14, 14))
      (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
    )
    (pos_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
    (patch_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
    (norm_pre): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
    (blocks): Sequential(
      411.47 M = 2.55% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
      (0): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
      (1): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
      (2): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
      (3): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
      (4): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
      (5): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
      (6): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
      (7): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
      (8): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
      (9): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
      (10): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
      (11): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
      (12): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
      (13): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
      (14): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
      (15): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
      (16): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
      (17): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
      (18): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
      (19): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
      (20): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
      (21): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
      (22): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
      (23): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
      (24): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
      (25): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
      (26): Block(
        15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (norm1): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          5.31 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (qkv): Linear(3.98 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=3456, bias=True)
          (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (attn_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
          (proj_drop): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (ls1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path1): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (norm2): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
          (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
          (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='tanh')
          (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
          (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
          (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        )
        (ls2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (drop_path2): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      )
    )
    (norm): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
    (attn_pool): AttentionPoolLatent(
      15.24 M = 0.09% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
      (q): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
      (kv): Linear(2.66 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=2304, bias=True)
      (q_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      (k_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
      (proj): Linear(1.33 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=1152, bias=True)
      (proj_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
      (norm): LayerNorm(2.3 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, (1152,), eps=1e-06, elementwise_affine=True)
      (mlp): Mlp(
        9.92 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
        (fc1): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1152, out_features=4304, bias=True)
        (act): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='none')
        (drop1): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
        (norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        (fc2): Linear(4.96 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4304, out_features=1152, bias=True)
        (drop2): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
      )
    )
    (fc_norm): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
    (head_drop): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, p=0.0, inplace=False)
    (head): Identity(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
  )
  (projector): MlpProjector(
    13.64 M = 0.08% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
    (layers): Sequential(
      13.64 M = 0.08% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
      (0): Linear(9.44 M = 0.06% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=4608, out_features=2048, bias=True)
      (1): GELU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, approximate='none')
      (2): Linear(4.2 M = 0.03% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=2048, bias=True)
    )
  )
  (language): DeepseekV2ForCausalLM(
    15.71 B = 97.26% Params, 317.84 GMACs = 100% MACs, 642.98 GFLOPS = 100% FLOPs
    (model): DeepseekV2Model(
      15.5 B = 95.97% Params, 291 GMACs = 91.55% MACs, 589.29 GFLOPS = 91.65% FLOPs
      (embed_tokens): Embedding(209.72 M = 1.3% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, 102400, 2048)
      (layers): ModuleList(
        (0): DeepseekV2DecoderLayer(
          81.01 M = 0.5% Params, 10.52 GMACs = 3.31% MACs, 21.31 GFLOPS = 3.31% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MLP(
            67.24 M = 0.42% Params, 8.61 GMACs = 2.71% MACs, 17.21 GFLOPS = 2.68% FLOPs
            (gate_proj): Linear(22.41 M = 0.14% Params, 2.87 GMACs = 0.9% MACs, 5.74 GFLOPS = 0.89% FLOPs, in_features=2048, out_features=10944, bias=False)
            (up_proj): Linear(22.41 M = 0.14% Params, 2.87 GMACs = 0.9% MACs, 5.74 GFLOPS = 0.89% FLOPs, in_features=2048, out_features=10944, bias=False)
            (down_proj): Linear(22.41 M = 0.14% Params, 2.87 GMACs = 0.9% MACs, 5.74 GFLOPS = 0.89% FLOPs, in_features=10944, out_features=2048, bias=False)
            (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.4 MFLOPS = 0% FLOPs)
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (1): DeepseekV2DecoderLayer(
          584.85 M = 3.62% Params, 10.79 GMACs = 3.39% MACs, 21.85 GFLOPS = 3.4% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MoE(
            571.08 M = 3.54% Params, 8.88 GMACs = 2.79% MACs, 17.75 GFLOPS = 2.76% FLOPs
            (experts): ModuleList(
              (0-6): 7 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (7): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (8-12): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (13): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (14): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (15): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (16-25): 10 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (26): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (27-30): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (31): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (32-42): 11 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (43): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (44): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.11 GMACs = 0.35% MACs, 2.21 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 180.22 KFLOPS = 0% FLOPs)
              )
              (45): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (46-50): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (51): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (52-56): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (57): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (58-59): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (60): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (61-63): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
            )
            (gate): MoEGate(131.07 K = 0% Params, 16.78 MMACs = 0.01% MACs, 33.55 MFLOPS = 0.01% FLOPs)
            (shared_experts): DeepseekV2MLP(
              17.3 M = 0.11% Params, 2.21 GMACs = 0.7% MACs, 4.43 GFLOPS = 0.69% FLOPs
              (gate_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (up_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (down_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2816, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 360.45 KFLOPS = 0% FLOPs)
            )
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (2): DeepseekV2DecoderLayer(
          584.85 M = 3.62% Params, 10.79 GMACs = 3.39% MACs, 21.85 GFLOPS = 3.4% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MoE(
            571.08 M = 3.54% Params, 8.88 GMACs = 2.79% MACs, 17.75 GFLOPS = 2.76% FLOPs
            (experts): ModuleList(
              (0): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (1-2): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (3): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (4-7): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (8): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (9-13): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (14): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (15-26): 12 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (27): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (28-34): 7 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (35): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (36-41): 6 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (42): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (43-45): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (46): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (47-49): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (50): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (51-52): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (53): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (54-56): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (57): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (58-62): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (63): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
            )
            (gate): MoEGate(131.07 K = 0% Params, 16.78 MMACs = 0.01% MACs, 33.55 MFLOPS = 0.01% FLOPs)
            (shared_experts): DeepseekV2MLP(
              17.3 M = 0.11% Params, 2.21 GMACs = 0.7% MACs, 4.43 GFLOPS = 0.69% FLOPs
              (gate_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (up_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (down_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2816, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 360.45 KFLOPS = 0% FLOPs)
            )
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (3): DeepseekV2DecoderLayer(
          584.85 M = 3.62% Params, 10.79 GMACs = 3.39% MACs, 21.85 GFLOPS = 3.4% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MoE(
            571.08 M = 3.54% Params, 8.88 GMACs = 2.79% MACs, 17.75 GFLOPS = 2.76% FLOPs
            (experts): ModuleList(
              (0-4): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (5): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.11 GMACs = 0.35% MACs, 2.21 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 180.22 KFLOPS = 0% FLOPs)
              )
              (6): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (7): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (8-10): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (11): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (12-13): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (14-15): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (16-32): 17 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (33): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (34-46): 13 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (47): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (48): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (49-53): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (54): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (55-57): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (58): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (59-61): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (62): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (63): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
            )
            (gate): MoEGate(131.07 K = 0% Params, 16.78 MMACs = 0.01% MACs, 33.55 MFLOPS = 0.01% FLOPs)
            (shared_experts): DeepseekV2MLP(
              17.3 M = 0.11% Params, 2.21 GMACs = 0.7% MACs, 4.43 GFLOPS = 0.69% FLOPs
              (gate_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (up_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (down_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2816, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 360.45 KFLOPS = 0% FLOPs)
            )
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (4): DeepseekV2DecoderLayer(
          584.85 M = 3.62% Params, 10.79 GMACs = 3.39% MACs, 21.85 GFLOPS = 3.4% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MoE(
            571.08 M = 3.54% Params, 8.88 GMACs = 2.79% MACs, 17.75 GFLOPS = 2.76% FLOPs
            (experts): ModuleList(
              (0-1): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (2): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (3-13): 11 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (14): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (15-17): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (18): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (19-20): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (21): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (22-24): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (25-26): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (27-28): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (29): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (30-32): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (33): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (34): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (35-36): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (37): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (38-48): 11 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (49): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (50-58): 9 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (59): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (60-63): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
            )
            (gate): MoEGate(131.07 K = 0% Params, 16.78 MMACs = 0.01% MACs, 33.55 MFLOPS = 0.01% FLOPs)
            (shared_experts): DeepseekV2MLP(
              17.3 M = 0.11% Params, 2.21 GMACs = 0.7% MACs, 4.43 GFLOPS = 0.69% FLOPs
              (gate_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (up_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (down_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2816, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 360.45 KFLOPS = 0% FLOPs)
            )
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (5): DeepseekV2DecoderLayer(
          584.85 M = 3.62% Params, 10.79 GMACs = 3.39% MACs, 21.85 GFLOPS = 3.4% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MoE(
            571.08 M = 3.54% Params, 8.88 GMACs = 2.79% MACs, 17.75 GFLOPS = 2.76% FLOPs
            (experts): ModuleList(
              (0-8): 9 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (9): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (10-14): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (15): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (16-17): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (18): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (19): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (20): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (21-37): 17 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (38-39): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (40): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (41): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (42-46): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (47): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (48-51): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (52): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (53): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (54-55): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (56-62): 7 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (63): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
            )
            (gate): MoEGate(131.07 K = 0% Params, 16.78 MMACs = 0.01% MACs, 33.55 MFLOPS = 0.01% FLOPs)
            (shared_experts): DeepseekV2MLP(
              17.3 M = 0.11% Params, 2.21 GMACs = 0.7% MACs, 4.43 GFLOPS = 0.69% FLOPs
              (gate_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (up_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (down_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2816, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 360.45 KFLOPS = 0% FLOPs)
            )
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (6): DeepseekV2DecoderLayer(
          584.85 M = 3.62% Params, 10.79 GMACs = 3.39% MACs, 21.85 GFLOPS = 3.4% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MoE(
            571.08 M = 3.54% Params, 8.88 GMACs = 2.79% MACs, 17.75 GFLOPS = 2.76% FLOPs
            (experts): ModuleList(
              (0-12): 13 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (13): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (14-16): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (17): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (18-20): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (21-22): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (23-31): 9 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (32): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (33-36): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (37): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (38): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (39): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (40): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (41): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (42-45): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (46): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (47): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (48-54): 7 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (55-56): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (57-63): 7 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
            )
            (gate): MoEGate(131.07 K = 0% Params, 16.78 MMACs = 0.01% MACs, 33.55 MFLOPS = 0.01% FLOPs)
            (shared_experts): DeepseekV2MLP(
              17.3 M = 0.11% Params, 2.21 GMACs = 0.7% MACs, 4.43 GFLOPS = 0.69% FLOPs
              (gate_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (up_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (down_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2816, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 360.45 KFLOPS = 0% FLOPs)
            )
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (7): DeepseekV2DecoderLayer(
          584.85 M = 3.62% Params, 10.79 GMACs = 3.39% MACs, 21.85 GFLOPS = 3.4% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MoE(
            571.08 M = 3.54% Params, 8.88 GMACs = 2.79% MACs, 17.75 GFLOPS = 2.76% FLOPs
            (experts): ModuleList(
              (0-7): 8 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (8): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (9-15): 7 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (16): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (17): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (18): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (19-27): 9 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (28): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (29-30): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (31): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (32-37): 6 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (38): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (39-40): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (41): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (42): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (43-44): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (45): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (46): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (47-49): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (50): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (51-57): 7 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (58): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (59-63): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
            )
            (gate): MoEGate(131.07 K = 0% Params, 16.78 MMACs = 0.01% MACs, 33.55 MFLOPS = 0.01% FLOPs)
            (shared_experts): DeepseekV2MLP(
              17.3 M = 0.11% Params, 2.21 GMACs = 0.7% MACs, 4.43 GFLOPS = 0.69% FLOPs
              (gate_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (up_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (down_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2816, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 360.45 KFLOPS = 0% FLOPs)
            )
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (8): DeepseekV2DecoderLayer(
          584.85 M = 3.62% Params, 10.79 GMACs = 3.39% MACs, 21.85 GFLOPS = 3.4% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MoE(
            571.08 M = 3.54% Params, 8.88 GMACs = 2.79% MACs, 17.75 GFLOPS = 2.76% FLOPs
            (experts): ModuleList(
              (0-7): 8 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (8): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (9-10): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (11): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (12): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (13): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (14-29): 16 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (30): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (31-32): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (33): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (34-38): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (39): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (40-46): 7 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (47): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (48-49): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (50): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (51): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (52): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (53): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (54): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (55-62): 8 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (63): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
            )
            (gate): MoEGate(131.07 K = 0% Params, 16.78 MMACs = 0.01% MACs, 33.55 MFLOPS = 0.01% FLOPs)
            (shared_experts): DeepseekV2MLP(
              17.3 M = 0.11% Params, 2.21 GMACs = 0.7% MACs, 4.43 GFLOPS = 0.69% FLOPs
              (gate_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (up_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (down_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2816, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 360.45 KFLOPS = 0% FLOPs)
            )
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (9): DeepseekV2DecoderLayer(
          584.85 M = 3.62% Params, 10.79 GMACs = 3.39% MACs, 21.85 GFLOPS = 3.4% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MoE(
            571.08 M = 3.54% Params, 8.88 GMACs = 2.79% MACs, 17.75 GFLOPS = 2.76% FLOPs
            (experts): ModuleList(
              (0): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (1-9): 9 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (10): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (11): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (12): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (13): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.11 GMACs = 0.35% MACs, 2.21 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 180.22 KFLOPS = 0% FLOPs)
              )
              (14-16): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (17): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (18-21): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (22): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (23-26): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (27): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (28-31): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (32): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (33-44): 12 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (45): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (46-49): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (50): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (51-62): 12 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (63): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
            )
            (gate): MoEGate(131.07 K = 0% Params, 16.78 MMACs = 0.01% MACs, 33.55 MFLOPS = 0.01% FLOPs)
            (shared_experts): DeepseekV2MLP(
              17.3 M = 0.11% Params, 2.21 GMACs = 0.7% MACs, 4.43 GFLOPS = 0.69% FLOPs
              (gate_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (up_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (down_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2816, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 360.45 KFLOPS = 0% FLOPs)
            )
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (10): DeepseekV2DecoderLayer(
          584.85 M = 3.62% Params, 10.79 GMACs = 3.39% MACs, 21.85 GFLOPS = 3.4% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MoE(
            571.08 M = 3.54% Params, 8.88 GMACs = 2.79% MACs, 17.75 GFLOPS = 2.76% FLOPs
            (experts): ModuleList(
              (0-8): 9 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (9): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.11 GMACs = 0.35% MACs, 2.21 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 180.22 KFLOPS = 0% FLOPs)
              )
              (10): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (11-12): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (13): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (14-18): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (19): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (20-21): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (22): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (23-24): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (25): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (26): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (27): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (28): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (29-40): 12 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (41): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (42-54): 13 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (55): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (56-58): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (59): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (60-63): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
            )
            (gate): MoEGate(131.07 K = 0% Params, 16.78 MMACs = 0.01% MACs, 33.55 MFLOPS = 0.01% FLOPs)
            (shared_experts): DeepseekV2MLP(
              17.3 M = 0.11% Params, 2.21 GMACs = 0.7% MACs, 4.43 GFLOPS = 0.69% FLOPs
              (gate_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (up_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (down_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2816, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 360.45 KFLOPS = 0% FLOPs)
            )
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (11): DeepseekV2DecoderLayer(
          584.85 M = 3.62% Params, 10.79 GMACs = 3.39% MACs, 21.85 GFLOPS = 3.4% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MoE(
            571.08 M = 3.54% Params, 8.88 GMACs = 2.79% MACs, 17.75 GFLOPS = 2.76% FLOPs
            (experts): ModuleList(
              (0-2): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (3): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (4): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (5-10): 6 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (11): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (12-16): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (17-18): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (19-28): 10 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (29-30): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (31-33): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (34): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (35-39): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (40): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (41-42): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (43): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (44-48): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (49): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (50-60): 11 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (61): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (62-63): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
            )
            (gate): MoEGate(131.07 K = 0% Params, 16.78 MMACs = 0.01% MACs, 33.55 MFLOPS = 0.01% FLOPs)
            (shared_experts): DeepseekV2MLP(
              17.3 M = 0.11% Params, 2.21 GMACs = 0.7% MACs, 4.43 GFLOPS = 0.69% FLOPs
              (gate_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (up_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (down_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2816, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 360.45 KFLOPS = 0% FLOPs)
            )
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (12): DeepseekV2DecoderLayer(
          584.85 M = 3.62% Params, 10.79 GMACs = 3.39% MACs, 21.85 GFLOPS = 3.4% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MoE(
            571.08 M = 3.54% Params, 8.88 GMACs = 2.79% MACs, 17.75 GFLOPS = 2.76% FLOPs
            (experts): ModuleList(
              (0-3): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (4-5): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (6-11): 6 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (12): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (13-17): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (18): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (19-22): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (23): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (24-27): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (28-29): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (30-39): 10 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (40): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (41-46): 6 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (47): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (48-58): 11 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (59): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (60): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (61): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (62): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (63): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
            )
            (gate): MoEGate(131.07 K = 0% Params, 16.78 MMACs = 0.01% MACs, 33.55 MFLOPS = 0.01% FLOPs)
            (shared_experts): DeepseekV2MLP(
              17.3 M = 0.11% Params, 2.21 GMACs = 0.7% MACs, 4.43 GFLOPS = 0.69% FLOPs
              (gate_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (up_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (down_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2816, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 360.45 KFLOPS = 0% FLOPs)
            )
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (13): DeepseekV2DecoderLayer(
          584.85 M = 3.62% Params, 10.79 GMACs = 3.39% MACs, 21.85 GFLOPS = 3.4% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MoE(
            571.08 M = 3.54% Params, 8.88 GMACs = 2.79% MACs, 17.75 GFLOPS = 2.76% FLOPs
            (experts): ModuleList(
              (0-5): 6 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (6): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (7-13): 7 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (14): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.11 GMACs = 0.35% MACs, 2.21 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 180.22 KFLOPS = 0% FLOPs)
              )
              (15): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (16): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (17): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (18-23): 6 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (24): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (25-28): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (29): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (30-44): 15 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (45): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (46): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (47): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (48): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (49-53): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (54): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (55-57): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (58): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (59-63): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
            )
            (gate): MoEGate(131.07 K = 0% Params, 16.78 MMACs = 0.01% MACs, 33.55 MFLOPS = 0.01% FLOPs)
            (shared_experts): DeepseekV2MLP(
              17.3 M = 0.11% Params, 2.21 GMACs = 0.7% MACs, 4.43 GFLOPS = 0.69% FLOPs
              (gate_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (up_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (down_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2816, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 360.45 KFLOPS = 0% FLOPs)
            )
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (14): DeepseekV2DecoderLayer(
          584.85 M = 3.62% Params, 10.79 GMACs = 3.39% MACs, 21.85 GFLOPS = 3.4% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MoE(
            571.08 M = 3.54% Params, 8.88 GMACs = 2.79% MACs, 17.75 GFLOPS = 2.76% FLOPs
            (experts): ModuleList(
              (0): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (1-6): 6 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (7): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (8-12): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (13): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (14-18): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (19): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (20-21): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (22-23): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (24-27): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (28): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (29-30): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (31): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (32-45): 14 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (46): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (47): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (48): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (49-50): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (51): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (52): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (53): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (54-63): 10 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
            )
            (gate): MoEGate(131.07 K = 0% Params, 16.78 MMACs = 0.01% MACs, 33.55 MFLOPS = 0.01% FLOPs)
            (shared_experts): DeepseekV2MLP(
              17.3 M = 0.11% Params, 2.21 GMACs = 0.7% MACs, 4.43 GFLOPS = 0.69% FLOPs
              (gate_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (up_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (down_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2816, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 360.45 KFLOPS = 0% FLOPs)
            )
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (15): DeepseekV2DecoderLayer(
          584.85 M = 3.62% Params, 10.79 GMACs = 3.39% MACs, 21.85 GFLOPS = 3.4% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MoE(
            571.08 M = 3.54% Params, 8.88 GMACs = 2.79% MACs, 17.75 GFLOPS = 2.76% FLOPs
            (experts): ModuleList(
              (0): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (1-4): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (5): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (6-13): 8 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (14): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (15-19): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (20): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (21-22): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (23-24): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (25): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (26): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (27): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (28): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (29): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (30): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (31-42): 12 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (43): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (44-46): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (47): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (48-54): 7 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (55): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (56-63): 8 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
            )
            (gate): MoEGate(131.07 K = 0% Params, 16.78 MMACs = 0.01% MACs, 33.55 MFLOPS = 0.01% FLOPs)
            (shared_experts): DeepseekV2MLP(
              17.3 M = 0.11% Params, 2.21 GMACs = 0.7% MACs, 4.43 GFLOPS = 0.69% FLOPs
              (gate_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (up_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (down_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2816, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 360.45 KFLOPS = 0% FLOPs)
            )
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (16): DeepseekV2DecoderLayer(
          584.85 M = 3.62% Params, 10.79 GMACs = 3.39% MACs, 21.85 GFLOPS = 3.4% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MoE(
            571.08 M = 3.54% Params, 8.88 GMACs = 2.79% MACs, 17.75 GFLOPS = 2.76% FLOPs
            (experts): ModuleList(
              (0-8): 9 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (9): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.11 GMACs = 0.35% MACs, 2.21 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 180.22 KFLOPS = 0% FLOPs)
              )
              (10): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (11-15): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (16): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (17-26): 10 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (27): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (28-29): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (30): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (31-32): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (33): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (34-36): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (37): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (38-39): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (40): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (41-46): 6 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (47): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (48-50): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (51): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (52-62): 11 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (63): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
            )
            (gate): MoEGate(131.07 K = 0% Params, 16.78 MMACs = 0.01% MACs, 33.55 MFLOPS = 0.01% FLOPs)
            (shared_experts): DeepseekV2MLP(
              17.3 M = 0.11% Params, 2.21 GMACs = 0.7% MACs, 4.43 GFLOPS = 0.69% FLOPs
              (gate_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (up_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (down_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2816, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 360.45 KFLOPS = 0% FLOPs)
            )
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (17): DeepseekV2DecoderLayer(
          584.85 M = 3.62% Params, 10.79 GMACs = 3.39% MACs, 21.85 GFLOPS = 3.4% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MoE(
            571.08 M = 3.54% Params, 8.88 GMACs = 2.79% MACs, 17.75 GFLOPS = 2.76% FLOPs
            (experts): ModuleList(
              (0): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (1-9): 9 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (10): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (11): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (12-25): 14 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (26): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.11 GMACs = 0.35% MACs, 2.21 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 180.22 KFLOPS = 0% FLOPs)
              )
              (27): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (28-29): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (30-31): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (32): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (33-37): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (38): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (39-40): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (41): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (42-58): 17 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (59): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (60-61): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (62): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (63): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
            )
            (gate): MoEGate(131.07 K = 0% Params, 16.78 MMACs = 0.01% MACs, 33.55 MFLOPS = 0.01% FLOPs)
            (shared_experts): DeepseekV2MLP(
              17.3 M = 0.11% Params, 2.21 GMACs = 0.7% MACs, 4.43 GFLOPS = 0.69% FLOPs
              (gate_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (up_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (down_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2816, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 360.45 KFLOPS = 0% FLOPs)
            )
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (18): DeepseekV2DecoderLayer(
          584.85 M = 3.62% Params, 10.79 GMACs = 3.39% MACs, 21.85 GFLOPS = 3.4% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MoE(
            571.08 M = 3.54% Params, 8.88 GMACs = 2.79% MACs, 17.75 GFLOPS = 2.76% FLOPs
            (experts): ModuleList(
              (0-1): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (2): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (3-9): 7 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (10): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (11-16): 6 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (17-18): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (19-29): 11 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (30): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (31-42): 12 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (43): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (44-45): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (46): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (47-48): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (49): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (50-52): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (53): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (54-55): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (56): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (57): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (58): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (59): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (60-63): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
            )
            (gate): MoEGate(131.07 K = 0% Params, 16.78 MMACs = 0.01% MACs, 33.55 MFLOPS = 0.01% FLOPs)
            (shared_experts): DeepseekV2MLP(
              17.3 M = 0.11% Params, 2.21 GMACs = 0.7% MACs, 4.43 GFLOPS = 0.69% FLOPs
              (gate_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (up_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (down_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2816, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 360.45 KFLOPS = 0% FLOPs)
            )
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (19): DeepseekV2DecoderLayer(
          584.85 M = 3.62% Params, 10.79 GMACs = 3.39% MACs, 21.85 GFLOPS = 3.4% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MoE(
            571.08 M = 3.54% Params, 8.88 GMACs = 2.79% MACs, 17.75 GFLOPS = 2.76% FLOPs
            (experts): ModuleList(
              (0): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (1): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (2-5): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (6): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (7): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (8): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (9-16): 8 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (17): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (18-21): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (22): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (23): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (24): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.11 GMACs = 0.35% MACs, 2.21 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 180.22 KFLOPS = 0% FLOPs)
              )
              (25-41): 17 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (42): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (43): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (44): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (45-53): 9 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (54): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (55-57): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (58): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (59-63): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
            )
            (gate): MoEGate(131.07 K = 0% Params, 16.78 MMACs = 0.01% MACs, 33.55 MFLOPS = 0.01% FLOPs)
            (shared_experts): DeepseekV2MLP(
              17.3 M = 0.11% Params, 2.21 GMACs = 0.7% MACs, 4.43 GFLOPS = 0.69% FLOPs
              (gate_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (up_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (down_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2816, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 360.45 KFLOPS = 0% FLOPs)
            )
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (20): DeepseekV2DecoderLayer(
          584.85 M = 3.62% Params, 10.79 GMACs = 3.39% MACs, 21.85 GFLOPS = 3.4% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MoE(
            571.08 M = 3.54% Params, 8.88 GMACs = 2.79% MACs, 17.75 GFLOPS = 2.76% FLOPs
            (experts): ModuleList(
              (0): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (1): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (2-14): 13 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (15): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (16-19): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (20-21): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (22-24): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (25): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (26-35): 10 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (36): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (37): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (38): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (39-42): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (43-44): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (45-54): 10 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (55): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (56): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (57-58): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (59): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (60-63): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
            )
            (gate): MoEGate(131.07 K = 0% Params, 16.78 MMACs = 0.01% MACs, 33.55 MFLOPS = 0.01% FLOPs)
            (shared_experts): DeepseekV2MLP(
              17.3 M = 0.11% Params, 2.21 GMACs = 0.7% MACs, 4.43 GFLOPS = 0.69% FLOPs
              (gate_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (up_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (down_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2816, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 360.45 KFLOPS = 0% FLOPs)
            )
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (21): DeepseekV2DecoderLayer(
          584.85 M = 3.62% Params, 10.79 GMACs = 3.39% MACs, 21.85 GFLOPS = 3.4% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MoE(
            571.08 M = 3.54% Params, 8.88 GMACs = 2.79% MACs, 17.75 GFLOPS = 2.76% FLOPs
            (experts): ModuleList(
              (0): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (1): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (2-9): 8 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (10): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.11 GMACs = 0.35% MACs, 2.21 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 180.22 KFLOPS = 0% FLOPs)
              )
              (11): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (12): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (13-14): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (15): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (16): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (17): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (18): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (19): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (20): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (21): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.11 GMACs = 0.35% MACs, 2.21 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 180.22 KFLOPS = 0% FLOPs)
              )
              (22-27): 6 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (28): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (29-41): 13 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (42): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (43-53): 11 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (54): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (55-63): 9 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
            )
            (gate): MoEGate(131.07 K = 0% Params, 16.78 MMACs = 0.01% MACs, 33.55 MFLOPS = 0.01% FLOPs)
            (shared_experts): DeepseekV2MLP(
              17.3 M = 0.11% Params, 2.21 GMACs = 0.7% MACs, 4.43 GFLOPS = 0.69% FLOPs
              (gate_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (up_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (down_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2816, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 360.45 KFLOPS = 0% FLOPs)
            )
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (22): DeepseekV2DecoderLayer(
          584.85 M = 3.62% Params, 10.79 GMACs = 3.39% MACs, 21.85 GFLOPS = 3.4% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MoE(
            571.08 M = 3.54% Params, 8.88 GMACs = 2.79% MACs, 17.75 GFLOPS = 2.76% FLOPs
            (experts): ModuleList(
              (0-2): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (3): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (4-7): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (8): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (9-10): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (11): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.11 GMACs = 0.35% MACs, 2.21 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 180.22 KFLOPS = 0% FLOPs)
              )
              (12-16): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (17): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (18-31): 14 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (32): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (33-49): 17 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (50): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (51): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (52-54): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (55): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (56-57): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (58): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (59): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (60): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (61-62): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (63): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
            )
            (gate): MoEGate(131.07 K = 0% Params, 16.78 MMACs = 0.01% MACs, 33.55 MFLOPS = 0.01% FLOPs)
            (shared_experts): DeepseekV2MLP(
              17.3 M = 0.11% Params, 2.21 GMACs = 0.7% MACs, 4.43 GFLOPS = 0.69% FLOPs
              (gate_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (up_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (down_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2816, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 360.45 KFLOPS = 0% FLOPs)
            )
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (23): DeepseekV2DecoderLayer(
          584.85 M = 3.62% Params, 10.79 GMACs = 3.39% MACs, 21.85 GFLOPS = 3.4% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MoE(
            571.08 M = 3.54% Params, 8.88 GMACs = 2.79% MACs, 17.75 GFLOPS = 2.76% FLOPs
            (experts): ModuleList(
              (0): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (1-2): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (3): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (4-7): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (8): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (9): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (10-15): 6 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (16): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (17-22): 6 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (23): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (24-31): 8 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (32): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (33-44): 12 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (45): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (46): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (47): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (48-53): 6 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (54-55): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (56): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (57): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (58-63): 6 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
            )
            (gate): MoEGate(131.07 K = 0% Params, 16.78 MMACs = 0.01% MACs, 33.55 MFLOPS = 0.01% FLOPs)
            (shared_experts): DeepseekV2MLP(
              17.3 M = 0.11% Params, 2.21 GMACs = 0.7% MACs, 4.43 GFLOPS = 0.69% FLOPs
              (gate_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (up_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (down_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2816, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 360.45 KFLOPS = 0% FLOPs)
            )
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (24): DeepseekV2DecoderLayer(
          584.85 M = 3.62% Params, 10.79 GMACs = 3.39% MACs, 21.85 GFLOPS = 3.4% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MoE(
            571.08 M = 3.54% Params, 8.88 GMACs = 2.79% MACs, 17.75 GFLOPS = 2.76% FLOPs
            (experts): ModuleList(
              (0-4): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (5): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (6-7): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (8): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (9): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (10): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (11-23): 13 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (24): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (25-39): 15 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (40): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (41): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (42): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (43-46): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (47): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.11 GMACs = 0.35% MACs, 2.21 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 180.22 KFLOPS = 0% FLOPs)
              )
              (48-49): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (50): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (51-60): 10 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (61): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (62-63): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
            )
            (gate): MoEGate(131.07 K = 0% Params, 16.78 MMACs = 0.01% MACs, 33.55 MFLOPS = 0.01% FLOPs)
            (shared_experts): DeepseekV2MLP(
              17.3 M = 0.11% Params, 2.21 GMACs = 0.7% MACs, 4.43 GFLOPS = 0.69% FLOPs
              (gate_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (up_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (down_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2816, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 360.45 KFLOPS = 0% FLOPs)
            )
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (25): DeepseekV2DecoderLayer(
          584.85 M = 3.62% Params, 10.79 GMACs = 3.39% MACs, 21.85 GFLOPS = 3.4% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MoE(
            571.08 M = 3.54% Params, 8.88 GMACs = 2.79% MACs, 17.75 GFLOPS = 2.76% FLOPs
            (experts): ModuleList(
              (0): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (1): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (2): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (3-9): 7 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (10): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (11): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (12-20): 9 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (21): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (22-35): 14 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (36): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (37): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (38-39): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (40-44): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (45): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (46): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (47): DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (48): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (49-51): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (52): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (53-63): 11 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
            )
            (gate): MoEGate(131.07 K = 0% Params, 16.78 MMACs = 0.01% MACs, 33.55 MFLOPS = 0.01% FLOPs)
            (shared_experts): DeepseekV2MLP(
              17.3 M = 0.11% Params, 2.21 GMACs = 0.7% MACs, 4.43 GFLOPS = 0.69% FLOPs
              (gate_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (up_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (down_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2816, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 360.45 KFLOPS = 0% FLOPs)
            )
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
        (26): DeepseekV2DecoderLayer(
          584.85 M = 3.62% Params, 10.79 GMACs = 3.39% MACs, 21.85 GFLOPS = 3.4% FLOPs
          (self_attn): DeepseekV2Attention(
            13.76 M = 0.09% Params, 1.91 GMACs = 0.6% MACs, 4.09 GFLOPS = 0.64% FLOPs
            (q_proj): Linear(6.29 M = 0.04% Params, 805.31 MMACs = 0.25% MACs, 1.61 GFLOPS = 0.25% FLOPs, in_features=2048, out_features=3072, bias=False)
            (kv_a_proj_with_mqa): Linear(1.18 M = 0.01% Params, 150.99 MMACs = 0.05% MACs, 301.99 MFLOPS = 0.05% FLOPs, in_features=2048, out_features=576, bias=False)
            (kv_a_layernorm): DeepseekV2RMSNorm(512 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
            (kv_b_proj): Linear(2.1 M = 0.01% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=512, out_features=4096, bias=False)
            (o_proj): Linear(4.19 M = 0.03% Params, 536.87 MMACs = 0.17% MACs, 1.07 GFLOPS = 0.17% FLOPs, in_features=2048, out_features=2048, bias=False)
            (rotary_emb): DeepseekV2RotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          )
          (mlp): DeepseekV2MoE(
            571.08 M = 3.54% Params, 8.88 GMACs = 2.79% MACs, 17.75 GFLOPS = 2.76% FLOPs
            (experts): ModuleList(
              (0-2): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (3): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (4-8): 5 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (9): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (10-16): 7 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (17): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (18-20): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (21): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (22-25): 4 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (26): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.11 GMACs = 0.35% MACs, 2.21 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 180.22 KFLOPS = 0% FLOPs)
              )
              (27-29): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (30): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (31-33): 3 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (34): DeepseekV2MLP(
                8.65 M = 0.05% Params, 8.65 MMACs = 0% MACs, 17.3 MFLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 2.88 MMACs = 0% MACs, 5.77 MFLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.41 KFLOPS = 0% FLOPs)
              )
              (35-43): 9 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (44-45): 2 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.1 GMACs = 0.35% MACs, 2.2 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 366.22 MMACs = 0.12% MACs, 732.43 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 178.82 KFLOPS = 0% FLOPs)
              )
              (46-55): 10 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
              (56): DeepseekV2MLP(
                8.65 M = 0.05% Params, 1.11 GMACs = 0.35% MACs, 2.21 GFLOPS = 0.34% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 369.1 MMACs = 0.12% MACs, 738.2 MFLOPS = 0.11% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 180.22 KFLOPS = 0% FLOPs)
              )
              (57-63): 7 x DeepseekV2MLP(
                8.65 M = 0.05% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs
                (gate_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (up_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=2048, out_features=1408, bias=False)
                (down_proj): Linear(2.88 M = 0.02% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs, in_features=1408, out_features=2048, bias=False)
                (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
              )
            )
            (gate): MoEGate(131.07 K = 0% Params, 16.78 MMACs = 0.01% MACs, 33.55 MFLOPS = 0.01% FLOPs)
            (shared_experts): DeepseekV2MLP(
              17.3 M = 0.11% Params, 2.21 GMACs = 0.7% MACs, 4.43 GFLOPS = 0.69% FLOPs
              (gate_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (up_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2048, out_features=2816, bias=False)
              (down_proj): Linear(5.77 M = 0.04% Params, 738.2 MMACs = 0.23% MACs, 1.48 GFLOPS = 0.23% FLOPs, in_features=2816, out_features=2048, bias=False)
              (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 360.45 KFLOPS = 0% FLOPs)
            )
          )
          (input_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
          (post_attention_layernorm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
        )
      )
      (norm): DeepseekV2RMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 0 FLOPS = 0% FLOPs)
    )
    (lm_head): Linear(209.72 M = 1.3% Params, 26.84 GMACs = 8.45% MACs, 53.69 GFLOPS = 8.35% FLOPs, in_features=2048, out_features=102400, bias=False)
  )
)
---------------------------------------------------------------------------------------------------
deepseek-vl2-small FLOPs:642.98 GFLOPS   MACs:317.84 GMACs   Params:16.15 B 


  

3B模型/8 V100

### model
#model_name_or_path: /mnt/bn/znzx-public/models/llama32/Llama-3.2-1B-Instruct
model_name_or_path: /root/Code/PythonScripts/custom_model_code/qwen2-3b-l46-2

### method
stage: pt
do_train: true
#train_from_scratch: true
finetuning_type: full
#resume_from_checkpoint: true
deepspeed: examples/deepspeed/ds_z3_config.json
#use_badam: false

logging_steps: 10
save_steps: 1000
save_total_limit: 5
num_train_epochs: 100
### dataset
dataset: "wiki_zh"
streaming: true
max_steps: 3000000
ignore_data_skip: true
eval_dataset: "wiki_zh"
template: qwen
cutoff_len: 1024
#max_samples: 50000
#overwrite_cache: true
fp16: true
preprocessing_num_workers: 16

### output
output_dir: /mnt/bn/znzx-public/lora/saves/custom_qwen3b_l46/
overwrite_output_dir: true

### eval
per_device_eval_batch_size: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 1

  

{
"bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 2304,
  "initializer_range": 0.02,
  "intermediate_size": 5760,
  "max_position_embeddings": 32768,
  "max_window_layers": 28,
  "model_type": "qwen2",
  "num_attention_heads": 18,
  "num_hidden_layers": 46,
  "num_key_value_heads": 2,
  "rms_norm_eps": 1e-06,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": true,
  "torch_dtype": "float16",
  "transformers_version": "4.44.2",
  "use_cache": false,
  "use_sliding_window": false,
  "vocab_size": 151936
}
  

推理速度：23988/119