Explaining NVIDIA NVFP4, the DGX Spark's Secret Weapon
Developed by NVIDIA, NVFP4 helps larger AI models run faster on a wider range of hardware.AI

The NVIDIA DGX Spark is among the most interesting pieces of computer hardware released in the past few years. It’s effectively a desktop PC, but one built entirely around NVIDIA hardware and targeted at AI workloads instead of gaming or general use. It serves up an NVIDIA GPU and CPU alongside 128GB of memory and 4TB of storage.
Such beefy specifications tell you that the DGX Spark is a performer, and it is, but focusing only on the hardware leaves out one important part of the equation: NVIDIA’s NVFP4. Introduced in June, 2025, NVFP4 is a new 4-bit floating point format designed to help larger models run on desktop hardware with less of a penalty for accuracy and intelligence.
Using NVFP4 requires some effort and intention on your part, however, both to understand and to use.
Quantization: A must-have for AI on the PC
NVFP4 is a form of quantization. You’ll need to know about quantization to understand NVFP4, so here’s a crash course.
An LLM contains billions, and sometimes trillions, of numbers (called parameters) that represent patterns learned from training data. A single number doesn’t require much computer memory but when you have billions of them, well, they add up. Most LLMs weigh in at a dozen to several dozen gigabytes, and larger models run into the hundreds, all of which must be loaded into memory for the model to function.
Most modern computer programs use 32-bit math as the default, which literally means a number is represented by a string of 32 binary numbers (1 or 0). The full-fat versions of LLMs often use 16-bit or sometimes 8-bit math. However, it’s always possible to convert a model to represent numbers with fewer bits.
That’s quantization.
NVFP4 is a new way to quantize
Quantization makes it possible to run an LLM containing a given number of parameters with less memory. But, that reduction in precision can also reduce the model’s intelligence. Representing the numbers in a model with fewer bits means some of those numbers won’t be right.
But it turns out there’s many ways to quantize. For example, a paper released in 2023 tested a range of methods used to quantize Meta’s Llama-7B and Llama-13B. It found the best 4-bit quantization methods reduced benchmark performance by less than 10%, while the worst reduced benchmark performance by half.
However, independent tests seem to indicate that speed is the real story of NVFP4 (for desktop and laptop PC users, at least). AI/ML research scientist Benjamin Marie found that models using NVFP4 were two to three times quicker, as measured by tokens output per second, than models quantized with other 4-bit formats. That could make larger models quantized to 4 bits feel a lot more usable on desk or PC hardware.
That’s important for devices with relatively modest hardware, including NVIDIA’s DGX Spark. As powerful though it may be, it’s still a small device using LPDDR5x RAM (which has limited memory bandwidth compared to VRAM) and a 240-watt internal power supply. You’re going to want NVFP4’s improved inference performance for best results.
A quick deep dive into NVFP4
So, NVFP4 can be used to quantize a model so that it uses less memory and delivers better inference performance on given hardware. But it’s not the only 4-bit quantization method. So, what does it do differently?
First, NVFP4 uses groups of smaller "micro-blocks" of 16 values that share a scaling factor, compared to the 32-value blocks used by MXFP4 (another 4-bit format used by OpenAI’s GPT-OSS). According to research from NVIDIA and Intel, this makes it easier to account for local variations in values.
Source: NVIDIA
NVFP4 also has more precise scaling factors. MXFP4 can only use a power-of-two scale (meaning numbers like 1, 2, 4, 8, etc). NVFP4 uses 8-bit floating point (FP8) scales with higher precision. It allows for more accurate quantization.
And NVFP4 uses two levels of scaling. The FP8 scale is for each 16-value block, but a broader FP32 scale is available across an entire tensor. This helps NVFP4 handle variations in values both within and across blocks.
If that’s confusing, you’re in good company. It makes my head spin, too. If you want to dive a bit deeper, I recommend this video from Julia Turc, an ex-Google AI researcher. It explains 4-bit quantization generally and includes details on various 4-bit quantization methods including NVFP4.
| Feature | NVIDIA NVFP4 | MXFP4 (OCP Standard) |
|---|---|---|
| Micro-block Size | 16 values per block | 32 values per block |
| Scaling Precision | 8-bit (FP8) high-precision | Power-of-two (integer) |
| Hardware Support | Blackwell Architecture (Native) | Broad (Hopper/Blackwell/FP8-capable) |
| Target Environment | Local AI / Edge Workstations | Large-scale Data Center |
| Performance Benefit | ~2x throughput vs. 4-bit baseline | Standard Efficiency |
| Accuracy Loss | < 1% on 70B+ parameter models | Variable based on block size |
What you need to run NVFP4
Models that use NVFP4 look like a great option for local AI inference performance. You can get a nice performance improvement without much if any noticeable reduction in model quality. However, a lot needs to align to actually use NVFP4.
First up, you need a GPU that uses a high-end variant of NVIDIA’s Blackwell architecture (or newer, if you’re reading this article in the future). That’s because Blackwell was designed at an architecture level with features that accelerate NVFP4. For desktop and laptop PCs, that means you’re going to ideally need a DGX Spark or DGX Station, or a GPU from the NVIDIA RTX PRO 6000 line.
What about the RTX 50-series? It seems it technically should work, since RTX 50-series GPUs use the Blackwell architecture, but there’s not much documentation on it at this point. Rather, most developer chat about NVFP4 on the RTX 50-series is about bugs discovered in the attempt. I also looked into running NVFP4 through LM Studio and Ollama on my own RTX 50-series laptop, but I haven't been able to find a way to do it yet
You’ll also need a model that was trained or quantized for NVFP4. While there’s quite a few now available, the selection is still a bit limited compared to the tens of thousands of other models that exist.
The future of NVFP4
The current state of NVFP4 is very much a “building the rails in front of the train” situation. NVIDIA only announced it a few months ago and papers describing its capabilities for training and quantization are even more recent.
Still, NVFP4 is important to NVIDIA. Just look at the last graph of NVIDIA’s NVFP4 announcement.
Source: NVIDIA
This graph promises that the Blackwell architecture can provide a 50x power efficiency improvement over Hopper when NVFP4 is in use. That’s a big deal for leading-edge AI labs and home users alike.
While it’s not a perfect analogy, I’m reminded of the early days of RTX ray tracing. It was promising from the start and solved a real problem developers faced, but it took a few years before it became widespread. I think the same could prove true for NVFP4.
More from MC News
- Everything We've Seen at CES 2026
- Run AI Locally: The Best LLMs for 8GB, 16GB, 32GB Memory and Beyond
- Quantization Explained: Why the Same LLM Gives Better Results on High-End Hardware
- Why VRAM and Memory Bandwidth are Key for Powering Local AI
- Hands-on with NVIDIA DGX Spark: Everything You Need to Know
Matthew S. Smith is a prolific tech journalist, critic, product reviewer, and influencer from Portland, Oregon. Over 16 years covering tech he has reviewed thousands of PC laptops, desktops, monitors, and other consumer gadgets. Matthew also hosts Computer Gaming Yesterday, a YouTube channel dedicated to retro PC gaming, and covers the latest artificial intelligence research for IEEE Spectrum.
