AI Models
Frontier LLM Models for On Premise Deployments
A curated selection of the most capable open-source AI models available today, optimized for deployment on NVIDIA Blackwell-powered hardware. These production-ready models deliver state-of-the-art performance while running entirely on your premises — ensuring complete data sovereignty and eliminating cloud dependencies.
Select a server above to highlight the AI models it supports.
| Release Date | Model | Modality | Vendor | Params Total | Params Active | Size | Status |
|---|---|---|---|---|---|---|---|
| 2025-04-05 | Llama 4 Maverick | | Stable Stable | ||||
| 2024-07-23 | Llama 3.1 | | — — | Stable Stable | |||
| 2026-03-10 | Nemotron 3 Super | | Stable | ||||
| 2025-08-05 | GPT oss 120b | | Stable | ||||
| 2024-07-24 | Mistral Large Instruct | | — — | Stable Experimental | |||
| 2024-04-10 | Mixtral 8x22B | | Stable | ||||
| 2025-12-08 | Devstral 2 | | — | Experimental | |||
| 2025-12-01 | Mistral Large 3 | | Stable | ||||
| 2025-09-01 | Apertus | | — | Stable | |||
| 2026-02-16 | Qwen 3.5 | | Stable Experimental | ||||
| 2025-04-28 | Qwen 3 | | Stable | ||||
| 2025-09-23 | Qwen 3 VL | | Stable | ||||
| 2025-09-10 | Qwen 3 Next Thinking | | Stable | ||||
| 2025-09-10 | Qwen 3 Next Instruct | | Stable | ||||
| 2025-07-22 | Qwen 3 Coder | | Stable | ||||
| 2026-02-03 | Qwen 3 Coder Next | | Stable | ||||
| 2026-02-10 | GLM 5 | | Stable | ||||
| 2025-12-22 | GLM 4.7 | | Stable Stable | ||||
| 2025-09-30 | GLM 4.6 | | Stable | ||||
| 2025-11-06 | Kimi K2 Thinking | | Stable | ||||
| 2026-01-26 | Kimi K2.5 | | Stable | ||||
| 2026-02-12 | MiniMax M2.5 | | Stable Experimental | ||||
| 2025-01-20 | DeepSeek R1 | | Stable | ||||
| 2025-11-30 | DeepSeek V3.2 | | Stable Experimental Stable | ||||
| 2026-01-27 | Trinity Large | | Experimental | ||||
| 2026-02-11 | Step 3.5 Flash | | Stable Experimental | ||||
| 2026-03-25 | Cohere Transcribe | | — | Stable | |||
| 2026-03-03 | LTX-2.3 | | — | Stable | |||
| 2025-11-25 | FLUX.2 Dev | | — | Stable |
Frequently Asked Questions
- Parameters refer to the number of learnable weights in the model (measured in billions, e.g., 70B). Size refers to the storage space required for the model files (measured in GB). Quantized models have fewer bits per parameter, resulting in smaller file sizes while maintaining most of the model's capabilities.
- In production, you need more VRAM than just the model size due to KV cache memory for context handling. With FP8 quantized KV cache (standard in production), plan for roughly 1.4–1.5× the model size. For example, a 550 GB model runs comfortably on 768 GB VRAM with FP8 KV cache. This is virtually lossless — NVIDIA H100/H200 GPUs have native FP8 tensor core support, making it essentially free performance-wise. With default BF16 KV cache, you'd need 1.7–2× model size instead.
- Experimental models are newer quantizations or configurations that are still being validated. They may offer better performance or efficiency but haven't been thoroughly tested in production environments. Stable models have been verified for reliable operation.
- Choose based on the models you need. Larger models require more VRAM. The S tier (96GB) handles most 70B models, M tier (384GB) supports multiple large models simultaneously, L (768GB) and XL (1440GB) tiers enable the largest frontier models like Llama 4 Maverick and DeepSeek V3.
- Yes, if you have sufficient VRAM. The total size of loaded models must fit within your server's available memory. Larger tiers allow running several models concurrently for different use cases.
- Modalities indicate what types of data a model can process (input) and generate (output). Text models handle written content, image models can analyze or generate visuals, code models are optimized for programming tasks, and multimodal models combine multiple capabilities.