Hidden underflow in BF16 divider in mixed-precision FP designs



Floating-point computations dominate the landscape of all AI/ML compute but also in automotive, avionics and healthcare. While performance and compute errors dominated the landscape of floating-point design and verification, power optimizations forced designers to use non-standard precisions such as FP4, FP8, BF16, and so on. So, mixed precision computation has become prominent in modern-day design.

Figure 1 Here isa a sneak peek at floating point compute in AI designs. Source: Axiomise

However, a new paradigm of transprecision compute is also emerging. Tagliavini et.al captured this beautifully in their paper coining the term transprecision floating-point computing. According to the paper, transprecision computing is an area where “rather than tolerating errors implied by imprecise HW or SW computations, systems are explicitly designed to deliver just the required precision for intermediate computations”.

Mixed precision vs transprecision

Mixed precision and transprecision are closely related, but they are not the same thing. Mixed precision is mainly an algorithmic/use pattern; transprecision is a broader system and architecture paradigm. Mixed precision means using two or more fixed numeric formats within one algorithm or kernel. Transprecision goes further: it’s about designing the hardware, software, and algorithms so that the precision itself is a tunable resource.

In short, mixed precision is combining a few existing formats in one computation for performance/energy, with accuracy recovered by algorithmic tricks. And transprecision is an end-to-end paradigm where precision is a first-class, tunable knob, and the system supports many possible precisions (not just, for example, 16-bit/32-bit) to meet accuracy/energy goals.

The benefits are clear. Exploit lower precision where it’s safe, allowing better balance of performance and throughput, and lower energy consumption, thereby cooler chips, but maintain full-precision (higher) accuracy where needed. Specifically, formats such as FP16 and BF16 allow hardware to execute more FLOPs per cycle, often giving 1.5-3X speedups in deep learning workloads. Transprecision architectures can achieve multi‑x speedups by vectorizing and simplifying datapaths for small formats (for instance, 8-bit or 16‑bit “minifloats”).

Verification challenges

Mixed-precision and transprecision computing introduce substantial verification challenges because correctness is no longer tied to a single, well-understood format, but to a tapestry of interacting precisions, formats, and rounding behaviors across the pipeline. Mixed-precision and transprecision computing create a significantly harder verification problem than conventional floating-point design because correctness must be established not only within each individual format, but also across the boundaries between them.

In these systems, values may be computed, accumulated, converted, rounded, and stored at different precisions, so verification must account for interactions between multiple exponent ranges, mantissa widths, rounding modes, exception rules, and format-conversion paths rather than checking a single uniform arithmetic model. This increases the risk of subtle failures such as incorrect narrowing or widening behavior, loss of precision at format boundaries, inconsistent NaN and infinity propagation, mismatched exception flags, non-equivalent fused and non-fused results, and corner-case errors involving subnormals, saturation, or directed rounding.

The challenge is amplified further in configurable or transprecision FPUs, where the same hardware datapath may serve several formats and custom numerical types, making it easy for implementation shortcuts or shared logic to satisfy one format while violating the architectural intent of another.

As a result, effective verification of mixed-precision and transprecision designs requires more than numerical result checking: it demands format-aware reference models, carefully targeted boundary-case stimulus, cross-format property checking, and systematic validation of rounding and exception behavior under every supported precision configuration. The challenges of microarchitectural implementation are too many to capture here so we will defer to these for a separate blog, but pipelined implementations offer an interesting cross-dimensional challenge for verification.

Traditional verification methods

Simulation-based verification, whether directed or constrained-random, is inherently inadequate for mixed-precision and transprecision floating-point designs because it can exercise only a vanishingly small fraction of the enormous input and state space. Moreover, it’s biased toward scenarios that verification engineers think to test.

Even with sophisticated UVM environments and large regression farms, most tests focus on typical operating ranges and a limited set of corner cases, leaving huge gaps around precisely those conditions where multiple formats interact: operands that sit exactly on format boundaries, rare combinations of subnormals with different exponents, intricate sequences of narrowing and widening conversions, or subtle interactions between fused operations, rounding modes, and exception flags.

Many of the most damaging floating-point bugs historically have come from such corner cases that were numerically benign for most inputs but catastrophically wrong for particular patterns that simulation never hit. In a design that supports several precisions and custom formats, the number of distinct cross-format behaviors quickly explodes, making it practically impossible to gain real confidence from coverage metrics alone. That’s because hitting each branch or bin does not guarantee that all critical numerical combinations and flag behaviors have been validated.

Formal methods-based solutions

Formal verification, in contrast, can reason symbolically about entire classes of inputs and states at once, proving that key arithmetic, rounding, and exception properties hold for all operand values and all supported format combinations within a given block, and thus is uniquely positioned to close the coverage gap that simulation-based methods cannot realistically bridge. However, while C-to-RTL equivalence checking has been in use for many years to establish formal equivalence through proofs, deploying formal methods on direct pipelined implementations of RTL is not easy with C-to-RTL based tools.

What we need is a homogeneous architecture whereby we can reason about correctness of RTL micro-architectural implementation directly using a golden reference implementation in SVA, exploiting the abstraction-based techniques in property checking. This is why we built floatrix.

It’s an app offered as part of the axiomiser platform that can be automatically configured at runtime through a GUI to verify a range of FP precision formats through proofs obtained by exercising custom SVA properties on actual designs implemented in Verilog or VHDL requiring minimal human interaction; and most importantly requiring zero model minimization that is the norm in any C-to-RTL based formal equivalence checking solutions.

SVA models used in floatrix have been goldenized against the Berkeley Hardfloat models for IEEE-754 compliance. For non-standard precisions, we follow the reference guides of the implementation to adapt the models.

We have deployed floatrix on several designs since its launch in September 2025. We continue to find interesting bugs. Recently, we identified a tinniness issue on the FPU used by the OPENHW group. The Github ticket has more details.

In this article, we describe an interesting bug we caught in floating-point dividers found in the fpnew design (again part of the OPENHW group), using our floatrix app. In the next section we describe the bug itself and then we elaborate on its significance, and whether bugs like these are likely to happen in other designs. Our analysis covers the broader scope of what happens with designs in mixed-precision format, and gaps in verification causing these bugs to be missed in the first place.

Details of the bug

Before we describe the bug, let us capture some of the basic definitions.

Inexact flag: Raised when the rounded floating-point result is not equal to the mathematically exact result, meaning some precision was lost in rounding (this can accompany normal, overflowed, or underflowed results).

Underflow flag: Raised when a non-zero mathematical result is so small in magnitude that, after rounding to the target format, it becomes tiny (typically subnormal or zero); under IEEE‑754’s default rules, this is signalled only when that tiny result is also inexact.

Scenario

The inexact and underflow flags should be set if the unbound exact result (out of format) is 1×2-140 and the bounded result should be 16’h0000. Moreover, in the case of rounding up, the expected result should be 16’h0001. However, in FPnew, the result is always 16’h0000 with no exception flags raised.

Root cause

Root cause might be from performing the operations using single-precision arithmetic and then converting back to BF16.

Detection methods

The bug was caught using the floatrix underflow flag checks.

Figure 2 Waveform shows the special underflow case in division. Source: Axiomise

Why this class of bug is realistic?

Bugs like this can happen in real designs, especially in floating-point units that mix internal higher-precision arithmetic with lower-precision output formats such as BF16.

A bug like this reflects a common implementation pattern where the datapath computes in a wider internal format, such as FP32, and then converts or packs the result into a narrower format. If exception logic is tied only to the internal format and not to the final target format, the design can silently produce a numerically plausible output while missing required flags.

This is especially plausible in trans-precision designs. BF16 has the exponent range of FP32 but much less precision, which makes conversion-stage edge cases more common. A value can be representable or exact internally yet still underflow or become inexact when rounded to BF16. If the final conversion step is treated as a formatting step instead of a full IEEE-aware operation, underflow and inexact can be lost.

Reuse of a FP32 datapath for BF16

Modern fpu designs may implement a single “main” datapath (often FP32) and derive lower-precision results (BF16) by:

  1. Computing in FP32
  2. Rounding/packing down to the architectural format

For this bug, it means:

  1. The internal FP32 computation is perfectly normal and may produce a small, exact, subnormal value.
  2. Because that internal value is exact, the FP32 underflow and inexact logic quite reasonably decides “no underflow, no inexact”.
  3. If the design then blindly packs to BF16, the BF16 representation of that exact value can be 0x0000 or 0x0001, which is architecturally tiny and inexact from the BF16 point of view, but no new flags are generated.

So, the specific condition “exact subnormal in FP32, non-representable in BF16” is not rare; it’s exactly what you get whenever BF16 sees the far tail of the FP32 subnormal range.

Flag logic attached to the wrong format

In a typical implementation:

  • The main datapath correctly implements all rounding modes in FP32.
  • The BF16 path is implemented as a simple truncation, or as a hard-coded “round-to-nearest-even” micro-operation, ignoring the global rounding-mode control.

For this bug, that architectural decision has a precise consequence:

  • Underflow and inexact are defined with respect to the architectural result format and rounding (here BF16, with “tininess after rounding” for RISC‑V BF16).
  • If flags are tied to the internal FP32 representation, any case where FP32 is fine but BF16 underflows will be mis‑flagged.

The bug is therefore a very plausible pattern and can occur in other kinds of FP implementations because it reduces duplication of flag logic but is architecturally wrong for BF16 arithmetic.

RISC‑V BF16 underflow definition vs implementation shortcut

RISC‑V BF16 explicitly says:

  • Tininess is detected after rounding.
  • Underflow is signalled only when the result is both tiny and inexact.
  • The tininess detection itself conceptually uses rounding as if the exponent were unbounded in the target format.

A shortcut implementation for BF16 in a FP32-based unit often does:

“Let the FP32 unit compute and round; then chop its bits down to BF16.”

For this bug, perhaps that shortcut misses exactly this:

  • “Tiny and inexact” must be interpreted in BF16, not in FP32.
  • A result can be “non-tiny and exact” in FP32, yet “tiny and inexact” in BF16.

So, the focused risk factor is not generic underflow subtlety; it is the temptation to reuse the FP32 tininess logic instead of implementing BF16-aware tininess-after-rounding. That shortcut directly creates the buggy behavior.

Directed rounding applied only at the FP32 level

This case also involves a mode where “round to max” (or analogous directed rounding) should drive the BF16 result from 0x0000 to 0x0001. We suspect, this leads to:

  • FP32 sees the tiny result as exact, so the rounding mode does nothing interesting at that level.
  • BF16 should use the rounding mode to decide between 0x0000 and 0x0001, but the conversion block ignores it.
  • Consequently, the result is always 0x0000, and the flags never see the inexact/tiny condition.

So, the specific vulnerability of rounding-mode control is not propagated into the BF16 block, yet the ISA treats the BF16 operation as architecturally rounded in that mode.

Verification gaps specific to this pattern

For bugs like these to escape into silicon, we imagine that two very concrete verification gaps typically exist:

  • Format-mismatch in checkers: The reference model or scoreboard checks only the numeric value in FP32, or it narrows in the same way as the RTL (for example, using a float32‑>BF16 helper that also ignores flags), so it cannot see that BF16 flags differ from the spec.
  • Lack of subnormal+narrowing directed tests: Random and ISA-level tests hit plenty of BF16 arithmetic, but almost nothing in the region where:
  1. The true value is representable as a tiny FP32 subnormal.
  2. That value is below the BF16 subnormal range.
  3. The rounding mode is a directed one that should change 0x0000 to 0x0001.

These are not generic underflow issues; they are exactly the missing cases needed to expose flags generated in a wider format, and then silently narrowed.

Other similar bug patterns

This bug fits into a broader family of real floating-point design bugs:

  1. Flag-silent narrowing conversions: FP32 to BF16 or FP32 to FP16 loses information, but the design fails to raise inexact or underflow.
  2. Wrong rounding-mode application: The internal result is correct, but the final conversion stage ignores directed rounding such as round-toward-positive or round-to-max.
  3. Fused/non-fused mismatch: FMADD produces different flags than a mathematically equivalent MUL followed by ADD because the implementation handles flags at the wrong stage.
  4. Flush-to-zero leakage: A supposedly IEEE-compliant path accidentally behaves like flush-to-zero in one stage, especially for subnormal intermediates.
  5. Tininess detection mismatch: The design effectively uses one tininess rule internally and another assumption in verification or architectural expectations.

These are all realistic in designs that support multiple formats, configurability, or internal reuse of a higher precision datapath.

Practical impact

The practical impact of these kinds of bugs can be profound, even if the affected values are tiny. In many applications, the numerical difference may seem small, but the bug still matters because:

  1. Exception flags may drive diagnostics, fallback logic, or compliance tests.
  2. Image, DSP, and ML pipelines can be sensitive to repeated bias near zero.
  3. Safety or standards-driven environments care about architectural correctness, not just approximate numeric usefulness.

Why formal models for floating-point designs

Mixed-precision and transprecision floating-point designs offer compelling benefits in performance, power, and area, but they also amplify the risk of subtle correctness issues that are extremely hard to detect with traditional simulation-based verification alone.

The bug analyzed in this article illustrates how easy it is for architectural intent to be violated when arithmetic is performed in a wider internal format, flags are generated with respect to that format, and the final narrowing step is treated as a “mere” formatting operation rather than a first-class floating-point transformation with its own rounding and exception semantics.

This pattern is not unique to a single core or vendor; it’s a natural by-product of reusing FP32 datapaths to implement BF16, separating execution, rounding, conversion, and flag generation, and relying on checkers that mirror the same implementation shortcuts. More generally, the same structural causes can lead to a family of related failures: flag-silent narrowing conversions, incorrect application of directed rounding modes, inconsistencies between fused and non-fused operations, flush-to-zero leakage in ostensibly IEEE-compliant paths, and mismatched tininess rules between specification, design, and verification.

Addressing these challenges requires a shift from “best-effort” simulation to exhaustive, property-driven reasoning. By using format-aware formal models, such as those provided by floatrix, it becomes possible to prove that rounding, underflow, overflow, and flag behaviour are correct for all operands and all supported precisions, and to expose bugs that would otherwise hide indefinitely in rarely exercised corners of the state space.

Nicky Khodadad is senior solutions engineer at Axiomise.

Nguyen Vu is formal verification engineer at Axiomise.

Ashish Darbari is Founder and CEO of Axiomise.

Related Content

The post Hidden underflow in BF16 divider in mixed-precision FP designs appeared first on EDN.



Source link