How to Train Multiple LLM Sizes Simultaneously with NVIDIA Star Elastic

By ✦ min read

Introduction

Training a family of large language models (LLMs) typically requires separate training runs for each variant—a costly and time-consuming process. NVIDIA's Star Elastic method offers a breakthrough: it lets you embed several smaller model sizes inside a single checkpoint during a single training run. This guide walks you through the steps to implement Star Elastic, enabling you to extract 23B and 12B parameter submodels from a 30B parent without additional fine-tuning. All you need is a compatible base model, data, and a understanding of knowledge distillation.

How to Train Multiple LLM Sizes Simultaneously with NVIDIA Star Elastic
Source: www.marktechpost.com

What You Need

Step 1: Select and Prepare Your Base Model

Start with a parent model that uses a modular architecture—preferably one with separable components like embedding channels, attention heads, Mamba SSM heads, MoE experts, and FFN layers. The Star Elastic method exploits these axes. If using Nemotron Nano v3, ensure it is trained to convergence as a non-elastified model first (this serves as the teacher for distillation).

Step 2: Compute Component Importance Scores

Evaluate each model component's contribution to accuracy. Use importance estimation on:

For MoE layers, employ Router-Weighted Expert Activation Pruning (REAP): rank experts by a combined signal of routing gate values and expert output magnitudes. This is more principled than naive frequency-based pruning.

Step 3: Rank Components and Define Nested Subsets

Sort all components by their importance scores from Step 2. The key property of Star Elastic is nested weight-sharing: any smaller submodel will use the highest-ranked contiguous subset of components. Decide your target submodel sizes (e.g., 23B with 2.8B active, 12B with 2.0B active). The pruning will automatically use the top-ranked components for each budget.

Step 4: Build a Learnable Router

Unlike fixed compression recipes, Star Elastic uses an end-to-end trainable router. The router accepts a one-hot encoded target budget (e.g., "2.8B active") and outputs differentiable masks that select which components to keep. Implement the router using Gumbel-Softmax to allow gradient flow through discrete architectural decisions. The masks are trained jointly with the model.

Step 5: Jointly Train the Model and Router with Knowledge Distillation

Take the non-elastified parent model (from Step 1) as the teacher. The loss function combines standard training loss with knowledge distillation (KD) loss—the teacher's soft targets guide the training of all nested submodels simultaneously. During training, the router learns to select optimal components for each budget. The total training uses roughly 160B tokens (as in the original paper). Ensure the optimizer handles both the model weights and the router parameters.

How to Train Multiple LLM Sizes Simultaneously with NVIDIA Star Elastic
Source: www.marktechpost.com

Step 6: Extract Submodels with Zero-Shot Slicing

After training, the checkpoint contains all nested submodels inside the parent. To extract a specific variant (e.g., 12B), simply apply the mask learned for that budget. No additional fine-tuning is needed—the submodels are ready for inference. The extraction is a one-time operation: slice the relevant weights from the parent checkpoint using the router's final masks.

Step 7: Validate and Deploy

Evaluate each extracted submodel on your target tasks. Because they share high-importance components, performance should be close to that of separately trained models. Deploy each submodel as a standalone checkpoint; they can be served independently or dynamically selected based on latency/accuracy trade-offs.

Conclusion and Tips

Star Elastic eliminates the painful multiplier of training multiple LLM variants. Here are some tips for success:

For further details, refer to the original paper on arXiv (link placeholder).

Tags:

Recommended

Discover More

Unlocking Interoperability: How to Bridge Mastodon, Bluesky, and the FediverseReviving Abandoned Open Source: A Practical Guide to Forking and Maintaining Critical ProjectsTech Pioneers Then and Now: Ask Jeeves and Apple Vision ProLights, Camera, Open Source: 10 Insights into Documenting the Code Behind the InternetBoosting JSON.stringify Speed: V8's Optimization Strategies