现在已经完成了MiniMind中所有小组件的代码实现,将它们组合起来,就可以搭建MiniMind了。
这里展示的是MiniMind(Dense/MoE)的架构图


首先来搭建一个用于构建MiniMind的基础模块MiniMindBlock,对应MiniMind架构图中的Transformer Layer k:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
| class MiniMindBlock(nn.Module): def __init__(self, layer_id: int, config): super().__init__() self.num_attention_heads = config.num_attention_heads self.hidden_size = config.hidden_size self.head_dim = config.hidden_size // config.num_attention_heads
self.self_attn = Attention(config)
self.layer_id = layer_id
self.input_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
self.post_attention_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
self.mlp = FeedForward(config) if not config.use_moe else MOEFeedForward(config)
def forward( self, hidden_states, position_embeddings, past_key_value=None, use_cache=False, attention_mask=None ): residual = hidden_states
hidden_states, present_key_value = self.self_attn( self.input_layernorm(hidden_states), position_embeddings, past_key_value, use_cache, attention_mask )
hidden_states += residual
normed_hidden = self.post_attention_layernorm(hidden_states)
hidden_states = hidden_states + self.mlp(normed_hidden)
return hidden_states, present_key_value
|
现在来搭建MiniMind:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
| class MiniMindModel(nn.Module): def __init__(self, config): super().__init__() self.config = config self.vocab_size, self.num_hidden_layers = config.vocab_size, config.num_hidden_layers
self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
self.dropout = nn.Dropout(config.dropout)
self.layers = nn.ModuleList([ MiniMindBlock(l, config) for l in range(self.num_hidden_layers) ])
self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
freqs_cos, freqs_sin = precompute_freqs_cis( dim=config.hidden_size // config.num_attention_heads, end=config.max_position_embeddings, omiga=config.rope_theta )
self.register_buffer("freqs_cos", freqs_cos, persistent=False) self.register_buffer("freqs_sin", freqs_sin, persistent=False)
def forward(self, input_ids: Optional[torch.Tensor] = None, attention_mask: Optional[torch.Tensor] = None, past_key_values: Optional[List[Tuple[torch.Tensor, torch.Tensor]]] = None, use_cache: bool = False, **kwargs):
batch_size, seq_length = input_ids.shape
past_key_values = past_key_values or [None] * len(self.layers)
start_pos = past_key_values[0][0].shape[1] if past_key_values[0] is not None else 0
hidden_states = self.dropout(self.embed_tokens(input_ids))
position_embeddings = ( self.freqs_cos[start_pos:start_pos + seq_length], self.freqs_sin[start_pos:start_pos + seq_length] )
presents = []
for layer_idx, (layer, past_key_value) in enumerate(zip(self.layers, past_key_values)): hidden_states, present = layer( hidden_states, position_embeddings, past_key_value=past_key_value, use_cache=use_cache, attention_mask=attention_mask ) presents.append(present)
hidden_states = self.norm(hidden_states)
aux_loss = sum( layer.mlp.aux_loss for layer in self.layers if isinstance(layer.mlp, MOEFeedForward) )
return hidden_states, presents, aux_loss
|
注意,回看MiniMind架构图,到现在为止,我们搭建的网络仅到RMSNorm层,后面的Linear和SoftMax还没有添加。
接下来需要进一步封装一个MiniMindForCausalLM类,这是为了更好地应用该模型于因果语言建模任务(Causal Language Modeling),并增强其在推理、训练、生成等任务中的灵活性与兼容性。
因为虽然MiniMindModel已经实现了Transformer主干(包括嵌入层、注意力模块等核心组件),它只负责将输入的token ID编码为hidden states,属于“纯 backbone”模块。
而MiniMindForCausalLM是一个“任务级封装”,它在主干模型基础上加上了输出层(language modeling head,lm_head)和统一的输出结构,用于直接进行token-level的预测。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
| from transformers import PreTrainedModel, GenerationMixin, PretrainedConfig from transformers.modeling_outputs import CausalLMOutputWithPast
class MiniMindForCausalLM(PreTrainedModel, GenerationMixin):
def __init__(self, config: MiniMindConfig = None): self.config = config or MiniMindConfig() super().__init__(self.config)
self.model = MiniMindModel(self.config)
self.lm_head = nn.Linear(self.config.hidden_size, self.config.vocab_size, bias=False)
self.model.embed_tokens.weight = self.lm_head.weight
self.OUT = CausalLMOutputWithPast()
def forward(self, input_ids: Optional[torch.Tensor] = None, attention_mask: Optional[torch.Tensor] = None, past_key_values: Optional[List[Tuple[torch.Tensor, torch.Tensor]]] = None, use_cache: bool = True, logits_to_keep: Union[int, torch.Tensor] = 0, **args):
h, past_kvs, aux_loss = self.model( input_ids=input_ids, attention_mask=attention_mask, past_key_values=past_key_values, use_cache=use_cache, **args )
slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
logits = self.lm_head(h[:, slice_indices, :])
self.OUT.__setitem__('last_hidden_state', h) self.OUT.__setitem__('logits', logits) self.OUT.__setitem__('aux_loss', aux_loss) self.OUT.__setitem__('past_key_values', past_kvs)
return self.OUT
|
为了能够无缝对接HuggingFace的训练、推理与生成框架,MiniMindForCausalLM继承了PreTrainedModel和GenerationMixin,并使用标准的输出结构CausalLMOutputWithPast,从而实现了以下兼容性:
- 与
transformers.Trainer配合训练时自动识别logits和loss;
- 支持
.generate()方法进行文本生成(增量推理、KV缓存、温度采样等);
- 与
LLaMA、GPT等结构保持一致,便于迁移预训练权重或微调脚本;
- 通过
past_key_values的接口设计,MiniMindForCausalLM 支持增量推理场景下的KV缓存机制,显著提升生成速度.
也就是说,通过这些兼容性,后续的许多代码不需要再次手动实现,而是可以直接调用HuggingFace官方实现的接口,方便快捷且高效。