How to Train Multiple LLM Sizes Simultaneously with NVIDIA Star Elastic
Introduction
Training a family of large language models (LLMs) typically requires separate training runs for each variant—a costly and time-consuming process. NVIDIA's Star Elastic method offers a breakthrough: it lets you embed several smaller model sizes inside a single checkpoint during a single training run. This guide walks you through the steps to implement Star Elastic, enabling you to extract 23B and 12B parameter submodels from a 30B parent without additional fine-tuning. All you need is a compatible base model, data, and a understanding of knowledge distillation.

What You Need
- Base model: A hybrid Mamba–Transformer–MoE model (e.g., Nemotron Nano v3 with 30B total, 3.6B active parameters)
- Training data: Approximately 160B tokens (or comparable volume)
- Hardware: High-end GPUs (e.g., A100 or H100) with sufficient memory for the largest variant
- Software: PyTorch, deep learning framework with Gumbel-Softmax support
- Knowledge: Familiarity with knowledge distillation (KD), MoE layers, and importance estimation techniques
Step 1: Select and Prepare Your Base Model
Start with a parent model that uses a modular architecture—preferably one with separable components like embedding channels, attention heads, Mamba SSM heads, MoE experts, and FFN layers. The Star Elastic method exploits these axes. If using Nemotron Nano v3, ensure it is trained to convergence as a non-elastified model first (this serves as the teacher for distillation).
Step 2: Compute Component Importance Scores
Evaluate each model component's contribution to accuracy. Use importance estimation on:
- Embedding channels
- Attention heads
- Mamba SSM heads and head channels
- MoE expert count (per layer)
- FFN intermediate dimensions
For MoE layers, employ Router-Weighted Expert Activation Pruning (REAP): rank experts by a combined signal of routing gate values and expert output magnitudes. This is more principled than naive frequency-based pruning.
Step 3: Rank Components and Define Nested Subsets
Sort all components by their importance scores from Step 2. The key property of Star Elastic is nested weight-sharing: any smaller submodel will use the highest-ranked contiguous subset of components. Decide your target submodel sizes (e.g., 23B with 2.8B active, 12B with 2.0B active). The pruning will automatically use the top-ranked components for each budget.
Step 4: Build a Learnable Router
Unlike fixed compression recipes, Star Elastic uses an end-to-end trainable router. The router accepts a one-hot encoded target budget (e.g., "2.8B active") and outputs differentiable masks that select which components to keep. Implement the router using Gumbel-Softmax to allow gradient flow through discrete architectural decisions. The masks are trained jointly with the model.
Step 5: Jointly Train the Model and Router with Knowledge Distillation
Take the non-elastified parent model (from Step 1) as the teacher. The loss function combines standard training loss with knowledge distillation (KD) loss—the teacher's soft targets guide the training of all nested submodels simultaneously. During training, the router learns to select optimal components for each budget. The total training uses roughly 160B tokens (as in the original paper). Ensure the optimizer handles both the model weights and the router parameters.

Step 6: Extract Submodels with Zero-Shot Slicing
After training, the checkpoint contains all nested submodels inside the parent. To extract a specific variant (e.g., 12B), simply apply the mask learned for that budget. No additional fine-tuning is needed—the submodels are ready for inference. The extraction is a one-time operation: slice the relevant weights from the parent checkpoint using the router's final masks.
Step 7: Validate and Deploy
Evaluate each extracted submodel on your target tasks. Because they share high-importance components, performance should be close to that of separately trained models. Deploy each submodel as a standalone checkpoint; they can be served independently or dynamically selected based on latency/accuracy trade-offs.
Conclusion and Tips
Star Elastic eliminates the painful multiplier of training multiple LLM variants. Here are some tips for success:
- Multi-axis nesting: The method supports multiple axes—SSM, attention, MoE experts, FFN—so you can tune each dimension individually.
- Budget granularity: You are not limited to three sizes; the router can learn masks for many budgets simultaneously.
- Hardware considerations: The largest parent model requires the most memory, but the smaller variants can be extracted and served on cheaper hardware.
- Compare with alternatives: Unlike Minitron-style compression, the learnable router adapts to your specific data and task mix.
- Monitor router convergence: Use validation accuracy to ensure the router is not overfitting to one budget.
- Distillation strength: Tune the KD weight—too high may limit the smaller models' flexibility; too low may reduce quality.
For further details, refer to the original paper on arXiv (link placeholder).
Related Articles
- How to Use GitHub Spec-Kit for Spec-Driven Development with AI Coding Agents
- Testing the Unknown: Strategies for AI-Generated Code
- How to Transition Away from Claude Code as a Pro User
- OpenAI Averts AI Model 'Goblin Obsession' Before GPT-5.5 Launch, Safety Team Reveals
- OpenAI Deploys Enhanced Security Protocol for ChatGPT: Multi-Factor Authentication and Session Limits Now Live
- Building High-Performance LLM Infrastructure: Cloudflare’s Approach to Separating Input and Output Processing
- OpenAI's Specialized Voice Models: A New Era for Real-Time AI Agents
- Connecting Your Machines with Loopsy: A Comprehensive Guide to Cross-Device Terminal Communication