OnPremAi

Select a server above to highlight the AI models it supports.

Release Date	Model	Vendor	Params Total	Params Active	Size	Status
2025-04-05	Llama 4 Maverick	Meta	401 B 228 B	17 B 17 B	379 GB 218 GB	Stable Stable
2024-07-23	Llama 3.1	Meta	405 B 70 B	— —	381 GB 67 GB	Stable Stable
2026-03-10	Nemotron 3 Super	Nvidia	124 B	12 B	74 GB	Stable
2025-08-05	GPT oss 120b	OpenAI	120 B	5 B	60 GB	Stable
2024-07-24	Mistral Large Instruct	Mistral	122 B 122 B	— —	114 GB 60 GB	Stable Experimental
2024-04-10	Mixtral 8x22B	Mistral	176 B	40 B	68 GB	Stable
2025-12-08	Devstral 2	Mistral	123 B	—	119 GB	Experimental
2025-12-01	Mistral Large 3	Mistral	673 B	41 B	375 GB	Stable
2025-09-01	Apertus	Swiss AI	70 B	—	67 GB	Stable
2026-02-16	Qwen 3.5	Alibaba Cloud	397 B 122 B	17 B 10 B	233 GB 74 GB	Stable Experimental
2025-04-28	Qwen 3	Alibaba Cloud	235 B	22 B	133 GB	Stable
2025-09-23	Qwen 3 VL	Alibaba Cloud	235 B	22 B	125 GB	Stable
2025-09-10	Qwen 3 Next Thinking	Alibaba Cloud	80 B	3 B	44 GB	Stable
2025-09-10	Qwen 3 Next Instruct	Alibaba Cloud	80 B	3 B	44 GB	Stable
2025-07-22	Qwen 3 Coder	Alibaba Cloud	241 B	35 B	254 GB	Stable
2026-02-03	Qwen 3 Coder Next	Alibaba Cloud	80 B	3 B	45 GB	Stable
2026-02-10	GLM 5	Z AI	435 B	40 B	429 GB	Stable
2025-12-22	GLM 4.7	Z AI	358 B 218 B	32 B 32 B	203 GB 205 GB	Stable Stable
2025-09-30	GLM 4.6	Z AI	200 B	32 B	187 GB	Stable
2025-11-06	Kimi K2 Thinking	Moonshot AI	1058 B	32 B	553 GB	Stable
2026-01-26	Kimi K2.5	Moonshot AI	1058 B	32 B	550 GB	Stable
2026-02-12	MiniMax M2.5	MiniMax AI	229 B 139 B	10 B 10 B	214 GB 74 GB	Stable Experimental
2025-01-20	DeepSeek R1	DeepSeek AI	396 B	37 B	394 GB	Stable
2025-11-30	DeepSeek V3.2	DeepSeek AI	685 B 685 B 394 B	37 B 37 B 37 B	642 GB 337 GB 386 GB	Stable Experimental Stable
2026-01-27	Trinity Large	Arcee AI	398 B	13 B	376 GB	Experimental
2026-02-11	Step 3.5 Flash	Stepfun AI	199 B 121 B	11 B 11 B	194 GB 63 GB	Stable Experimental
2026-03-25	Cohere Transcribe	Cohere Labs	2 B	—	2 GB	Stable
2026-03-03	LTX-2.3	Lightricks	22 B	—	20 GB	Stable
2025-11-25	FLUX.2 Dev	Black Forest Labs	32 B	—	60 GB	Stable

Frequently Asked Questions

Parameters refer to the number of learnable weights in the model (measured in billions, e.g., 70B). Size refers to the storage space required for the model files (measured in GB). Quantized models have fewer bits per parameter, resulting in smaller file sizes while maintaining most of the model's capabilities.
In production, you need more VRAM than just the model size due to KV cache memory for context handling. With FP8 quantized KV cache (standard in production), plan for roughly 1.4–1.5× the model size. For example, a 550 GB model runs comfortably on 768 GB VRAM with FP8 KV cache. This is virtually lossless — NVIDIA H100/H200 GPUs have native FP8 tensor core support, making it essentially free performance-wise. With default BF16 KV cache, you'd need 1.7–2× model size instead.
Experimental models are newer quantizations or configurations that are still being validated. They may offer better performance or efficiency but haven't been thoroughly tested in production environments. Stable models have been verified for reliable operation.
Choose based on the models you need. Larger models require more VRAM. The S tier (96GB) handles most 70B models, M tier (384GB) supports multiple large models simultaneously, L (768GB) and XL (1440GB) tiers enable the largest frontier models like Llama 4 Maverick and DeepSeek V3.
Yes, if you have sufficient VRAM. The total size of loaded models must fit within your server's available memory. Larger tiers allow running several models concurrently for different use cases.
Modalities indicate what types of data a model can process (input) and generate (output). Text models handle written content, image models can analyze or generate visuals, code models are optimized for programming tasks, and multimodal models combine multiple capabilities.

AI Models

Frontier LLM Models for On Premise Deployments

Frequently Asked Questions