How to Train Multiple LLM Sizes Simultaneously with NVIDIA Star Elastic

Introduction

Training a family of large language models (LLMs) typically requires separate training runs for each variant—a costly and time-consuming process. NVIDIA's Star Elastic method offers a breakthrough: it lets you embed several smaller model sizes inside a single checkpoint during a single training run. This guide walks you through the steps to implement Star Elastic, enabling you to extract 23B and 12B parameter submodels from a 30B parent without additional fine-tuning. All you need is a compatible base model, data, and a understanding of knowledge distillation.

How to Train Multiple LLM Sizes Simultaneously with NVIDIA Star Elastic — Source: www.marktechpost.com

What You Need

Base model: A hybrid Mamba–Transformer–MoE model (e.g., Nemotron Nano v3 with 30B total, 3.6B active parameters)
Training data: Approximately 160B tokens (or comparable volume)
Hardware: High-end GPUs (e.g., A100 or H100) with sufficient memory for the largest variant
Software: PyTorch, deep learning framework with Gumbel-Softmax support
Knowledge: Familiarity with knowledge distillation (KD), MoE layers, and importance estimation techniques

Step 1: Select and Prepare Your Base Model

Start with a parent model that uses a modular architecture—preferably one with separable components like embedding channels, attention heads, Mamba SSM heads, MoE experts, and FFN layers. The Star Elastic method exploits these axes. If using Nemotron Nano v3, ensure it is trained to convergence as a non-elastified model first (this serves as the teacher for distillation).

Step 2: Compute Component Importance Scores

Evaluate each model component's contribution to accuracy. Use importance estimation on:

Embedding channels
Attention heads
Mamba SSM heads and head channels
MoE expert count (per layer)
FFN intermediate dimensions

For MoE layers, employ Router-Weighted Expert Activation Pruning (REAP): rank experts by a combined signal of routing gate values and expert output magnitudes. This is more principled than naive frequency-based pruning.

Step 3: Rank Components and Define Nested Subsets

Sort all components by their importance scores from Step 2. The key property of Star Elastic is nested weight-sharing: any smaller submodel will use the highest-ranked contiguous subset of components. Decide your target submodel sizes (e.g., 23B with 2.8B active, 12B with 2.0B active). The pruning will automatically use the top-ranked components for each budget.

Step 4: Build a Learnable Router

Unlike fixed compression recipes, Star Elastic uses an end-to-end trainable router. The router accepts a one-hot encoded target budget (e.g., "2.8B active") and outputs differentiable masks that select which components to keep. Implement the router using Gumbel-Softmax to allow gradient flow through discrete architectural decisions. The masks are trained jointly with the model.

Step 5: Jointly Train the Model and Router with Knowledge Distillation

Take the non-elastified parent model (from Step 1) as the teacher. The loss function combines standard training loss with knowledge distillation (KD) loss—the teacher's soft targets guide the training of all nested submodels simultaneously. During training, the router learns to select optimal components for each budget. The total training uses roughly 160B tokens (as in the original paper). Ensure the optimizer handles both the model weights and the router parameters.

Step 6: Extract Submodels with Zero-Shot Slicing

After training, the checkpoint contains all nested submodels inside the parent. To extract a specific variant (e.g., 12B), simply apply the mask learned for that budget. No additional fine-tuning is needed—the submodels are ready for inference. The extraction is a one-time operation: slice the relevant weights from the parent checkpoint using the router's final masks.

Step 7: Validate and Deploy

Evaluate each extracted submodel on your target tasks. Because they share high-importance components, performance should be close to that of separately trained models. Deploy each submodel as a standalone checkpoint; they can be served independently or dynamically selected based on latency/accuracy trade-offs.

Conclusion and Tips

Star Elastic eliminates the painful multiplier of training multiple LLM variants. Here are some tips for success:

Multi-axis nesting: The method supports multiple axes—SSM, attention, MoE experts, FFN—so you can tune each dimension individually.
Budget granularity: You are not limited to three sizes; the router can learn masks for many budgets simultaneously.
Hardware considerations: The largest parent model requires the most memory, but the smaller variants can be extracted and served on cheaper hardware.
Compare with alternatives: Unlike Minitron-style compression, the learnable router adapts to your specific data and task mix.
Monitor router convergence: Use validation accuracy to ensure the router is not overfitting to one budget.
Distillation strength: Tune the KD weight—too high may limit the smaller models' flexibility; too low may reduce quality.

For further details, refer to the original paper on arXiv (link placeholder).

Tags: