Probabilistic Verification Theory¶

Motivation¶

For large neural networks – particularly semantic segmentation networks (SSNs) with millions of parameters and high-dimensional, pixel-level outputs – computing exact or even approximate reachable sets can be computationally intractable. The exponential complexity of ReLU branching and the sheer output dimensionality (e.g., \(256 \times 512 \times 19\) logits for a Cityscapes model) render deterministic methods impractical.

NNV3 integrates a probabilistic, data-driven verification approach based on conformal inference that provides an alternative: instead of verifying properties for all possible perturbations, it verifies them with high probability under a given distribution. This approach is architecture-agnostic (scales with inference cost, not network structure) and provides formal probabilistic guarantees.

Conformal Inference Framework¶

Core Idea¶

Given a collection of i.i.d. scalar random variables \(\mathbf{M} = \{R_1, R_2, \ldots, R_m\}\) sampled from \(R \sim \mathcal{D}\) (called nonconformity scores), conformal inference constructs a prediction interval for a new unseen draw \(R^{\text{unseen}}\) from the same distribution.

The key result: if we sort the calibration scores \(R_1 < R_2 < \ldots < R_m\) and select rank \(\ell := \lceil(m+1)(1-\epsilon)\rceil\), then:

\[\Pr[R^{\text{unseen}} \leq R_\ell] \geq 1 - \epsilon\]

where the probability is over the joint randomness of both calibration and test samples.

Double-Step Probabilistic Guarantee¶

The coverage level \(\delta = \Pr[R^{\text{unseen}} \leq R_\ell]\) is itself a random variable following a \(\text{Beta}(\ell, m+1-\ell)\) distribution. By appropriately tuning \(m\) and \(\ell\), the variance of \(\delta\) can be made extremely small (e.g., \(3.123 \times 10^{-8}\) for \(m = 8000\), \(\ell = 7999\)).

NNV3 uses the double-step (PAC) guarantee formulation using the regularized incomplete beta function:

\[\Pr\left[\Pr[R^{\text{unseen}} \leq R_\ell] > 1 - \epsilon\right] > 1 - \text{betacdf}_{1-\epsilon}(\ell, m+1-\ell)\]

This is denoted as the \(\langle \epsilon, \ell, m \rangle\) guarantee, where:

\(\epsilon\) is the miscoverage level (e.g., 0.0001 for 99.99% coverage)
\(\delta_1 = 1 - \epsilon\) is the coverage level
\(\delta_2 = 1 - \text{betacdf}_{1-\epsilon}(\ell, m+1-\ell)\) is the confidence of the guarantee

A key advantage of the double-step formulation is its flexibility: the strict relationship among \(m\), \(\ell\), and \(\epsilon\) can be relaxed, allowing practitioners to tune these parameters to achieve desired coverage-confidence tradeoffs.

Example¶

With \(m = 100{,}000\) calibration samples and rank \(\ell = 99{,}999\):

\[\Pr\left[\Pr[R^{\text{unseen}} < R_{99{,}999}] > 0.9999\right] > 0.9995\]

This guarantee holds for any calibration dataset sampled from distribution \(\mathcal{D}\), regardless of the specific distribution shape.

Probabilistic Reachability for Neural Networks¶

Problem Formulation¶

Given a neural network \(f : \mathbb{R}^{n_0} \to \mathbb{R}^n\), an input set \(\mathbf{I} \subset \mathbb{R}^{n_0}\), and a sampling distribution \(x \overset{\mathcal{W}}{\sim} \mathbf{I}\), probabilistic reachability constructs a set \(\mathbf{R}_f^\epsilon(\mathcal{W}; \ell, m)\) such that:

\[x \overset{\mathcal{W}}{\sim} \mathbf{I} \;\Rightarrow\; \Pr\left[\Pr[f(x) \in \mathbf{R}_f^\epsilon(\mathcal{W}; \ell, m)] \geq 1 - \epsilon\right] \geq 1 - \text{betacdf}_{1-\epsilon}(\ell, m+1-\ell)\]

Nonconformity Score Design¶

For an output \(\text{vec}(y) = [y(1), \ldots, y(n)] \in \mathbb{R}^n\), the nonconformity score is defined as:

\[R_i^{\text{calib}} = \max\left(\frac{|y_i(1) - c(1)|}{\tau_1}, \ldots, \frac{|y_i(n) - c(n)|}{\tau_n}\right)\]

where \(c \in \mathbb{R}^n\) is the center (average of training outputs) and \(\tau_k\) are normalization factors computed from the training dataset. This score design yields a hyper-rectangular reachable set centered at \(c\) with half-widths \(\tau_k R_\ell^{\text{calib}}\).

Scaling to Semantic Segmentation Networks¶

The High-Dimensionality Challenge¶

The naive hyper-rectangular approach becomes overly conservative in high-dimensional output spaces because:

Shape limitation: Hyper-rectangles are poor approximations of the true output distribution shape in high dimensions
Distribution space: The space of distributions \(\mathcal{Y}\) that can represent the calibration data grows dramatically with dimension, making conformal inference inherently more conservative

Surrogate-Based Approach¶

To address these challenges, the NeurIPS 2025 paper (Hashemi et al.) introduces a surrogate-based reachability technique:

Train a small ReLU surrogate \(g : \mathbb{R}^{n_0} \to \mathbb{R}^n\) to approximate the original network \(f\) on the input set \(\mathbf{I}\)
Compute deterministic reachable set of \(g\) using NNV’s Star-set reachability: \(\mathbf{R}_g(\mathbf{I})\)
Compute prediction error \(q(x) = f(x) - g(x)\) and apply the naive CI technique to bound this error with an \(\langle \epsilon, \ell, m \rangle\) guaranteed hyper-rectangle \(\mathbf{R}_q^\epsilon\)
Combine via Minkowski sum: \(\mathbf{R}_f^\epsilon = \mathbf{R}_g(\mathbf{I}) \oplus \mathbf{R}_q^\epsilon\)

This yields two key improvements:

The reachable set is no longer constrained to a hyper-rectangle (the surrogate’s Star set captures the output shape)
The CI calibration is performed on prediction errors rather than raw outputs, which are much smaller in magnitude

Dimensionality Reduction via Deflative PCA¶

For SSNs with extremely high-dimensional outputs (e.g., \(720 \times 960 \times 12\) for CamVid), even training the surrogate is challenging. The approach uses a two-stage training:

Stage 1 (PCA via deflation): Learn the top \(N \ll n\) principal directions of the output point cloud using a learning-based deflation algorithm. This maps high-dimensional logits to a low-dimensional representation \(v \in \mathbb{R}^N\) via \(v = A^\top \text{vec}(y)\)
Stage 2 (ReLU network): Train a small ReLU network \(g_2 : \mathbb{R}^r \to \mathbb{R}^N\) mapping perturbation coefficients to the low-dimensional space

The final surrogate is \(g(x^{\text{adv}}) = A \cdot g_2(\lambda)\), which is scalable to SSNs.

Pixel-Level Robustness Classification¶

Once the reachable set over SSN logits is constructed as a Star set, each pixel \((i,j)\) is classified by projecting onto its logit components:

Robust: The correct class logit’s lower bound exceeds all other classes’ upper bounds
Non-robust: A different class dominates for all perturbations
Unknown: Logit intervals overlap between classes (due to over-approximation conservatism)

The Robustness Value (RV) = percentage of pixels classified as robust.

Comparison with Deterministic Methods¶

Aspect	Deterministic (Star)	Probabilistic (CP)
Guarantee type	Worst-case (all inputs)	Statistical (\(\langle \epsilon, \ell, m \rangle\) guarantee)
Completeness	Sound (and complete for exact)	PAC coverage guarantee
Scalability	Limited by network size/depth	Scales with inference cost
Requirements	LP solver, symbolic analysis	Sampling + surrogate training
Output dimensionality	Struggles with high-dim outputs	Handles SSNs via PCA + CI
Distribution assumption	None (worst-case)	Requires sampling distribution \(\mathcal{W}\)
Best for	Small-medium networks	Large, intractable networks (SSNs, object detection)

Comparison with Randomized Smoothing¶

An advantage of conformal inference over randomized smoothing (Fischer et al., 2021; Anani et al., 2024) is that CI guarantees are defined directly on the base model \(\text{SSN}(x')\), while randomized smoothing operates on a noise-injected approximation \(\text{SSN}(x' + \nu)\). To make randomized smoothing apply to the base model, one would need \(\nu = 0\) (i.e., \(\sigma = 0\)), which collapses the perturbation ball to a singleton.

On the other hand, randomized smoothing provides guarantees over the worst-case input \(x' \in \mathbf{B}_r(x)\), whereas conformal inference assumes a prior distribution \(x' \overset{\mathcal{W}}{\sim} \mathbf{B}_r(x)\).

Limitations¶

Requires a prior sampling distribution \(\mathcal{W}\) (unlike worst-case deterministic methods)
Calibration set must be recomputed per test point (main computational cost)
Surrogate-based technique increases runtime (PCA + training + deterministic reachability)
For very high perturbation dimensions, falls back to the naive technique for scalability