This is the home of the SmallDoge family of small language models, where we release a series of high-quality and ultra-fast small language models based on dynamic algorithms. All training details and code are publicly available on the small-doge repository. We have released:
As shown in the figure below, the sequence transformation part of the Doge architecture uses Dynamic Mask Attention
, which can be understood as using self-attention related to value states during training, and using state-space without past state decay during inference, to solve the problem of existing Transformers or SSMs getting lost in long text. The state transformation part of Doge uses Cross Domain Mixture of Experts
, which consists of dense linear layers and sparse embedding layers, and can additionally increase sparse parameters to continue training from dense weight checkpoints without retraining the entire model, thereby reducing the cost of continuous iteration of the model. In addition, Doge also uses RMSNorm
and Residual
with learnable parameters to adapt the gradient range of deep models.