
"The easy SaaS wave is over;
The next phase of wealth creation belongs to those with fundamental, systems-level AI expertise."










Merging American and Chinese AI models into one,
we created Rubin.
Strong Performance with Efficient Inference and Training,
outpacing Earlier Vanilla Transfomer model by 52%
Rubin Squirrel 0.5B
A lightweight 500 million parameter model built for research and education. Ideal for learning the fundamentals of training, fine-tuning, and deployment, it provides fast iteration speed and serves as a perfect entry point for exploring transformer-based architectures without heavy compute requirements.
Rubin Omni 2B
A medium-scale foundation model trained with pretraining, supervised fine-tuning (SFT), and GRPO (Generative Reinforcement with Preference Optimization). Designed for practical applications, it offers strong generalization while remaining efficient, making it a balanced choice for academic and applied research.
Rubin Omega 8B
A large-scale model featuring Latent Head Attention, enabling improved representation learning and context management. This scaled version is optimized for both reasoning and generation tasks, delivering robust performance in complex workloads while remaining manageable on modern GPU clusters.
Rubin Dragon 24B [8x 3B]
Our flagship model built with a Mixture of Experts (MoE) architecture combined with the Deepseek MLA and DSA. leading to more coherent, efficient, and explainable outputs. It scales to enterprise-grade workloads, supporting advanced research, production deployments, and cutting-edge applications.
| Specs | Rubin 500M | Rubin 2B | Rubin 8B | Rubin 24B |
|---|---|---|---|---|
| Parameters | 500M | 2B | 8B | 24B |
| Training Tokens | 26B | 800B | 1.6T | 2.7T |
| Training Code | ✔ | ✔ | ✔ | ✔ |
| Inference Code | ✔ | ✔ | ✔ | ✔ |
| Inference Mode | KV Cache | KV Cache + Sink | KV Cache | KV Cache |
| Optimizer | muon with adamw | muon with adamw | muon with adamw | custom muon with adamw |
| Lr scheduler | trapezoidal with cosine annealing | trapezoidal | warmup with cosine annealing | cosine annealing |
| Attention | Grouped Query Attention | Grouped Query Attention | Grouped Query Attention with LHA | Sliding Window with MoE & LHA |
| Positional Encoding | Standard ROPE | YaRN ROPE | YaRN ROPE | YaRN ROPE |
| Normalisation | RMS | RMS | RMS | RMS |
| Model Weights | available | available | to be released soon | to be released soon |
| GPU used | 1 x H100SXM | 8 x H200 SXM | 4 x B200 | 32 x B200 |
| end to end training | ✔ | ✔ | ✔ | ✔ |
| PURPOSE | Research | PRODUCTION | Production | PRODUCTION |
| local inference | ✔ | ✔ | — | — |
| Model type | base model | base model + SFT + GRPO | instruction model [SFT] | to be released soon |

Welcome to age of intelligence,
your ability to create intelligent digital beings start with brainoidlabs