Post-Quantum Kyber Crypto Agility

Crypto Agility in Hardware: Preparing for Post-Quantum Migration

How to design silicon that supports algorithm migration — and why Kyber-768 in hardware is only half the story.

December 9, 2025 — Bastionchip Engineering

Abstract visualization of cryptographic algorithm migration and post-quantum lattice structures

NIST finalized its first set of post-quantum cryptography standards in August 2024: FIPS 203 (ML-KEM, derived from CRYSTALS-Kyber), FIPS 204 (ML-DSA, derived from CRYSTALS-Dilithium), and FIPS 205 (SLH-DSA, derived from SPHINCS+). The finalization date matters for hardware teams because silicon development cycles are 18–36 months from tape-out to production. A device that will be deployed in 2027 or later must be designed today to support these algorithms — and more importantly, to support the algorithms that will follow as the cryptographic landscape continues to evolve. Kyber-768 in hardware is only half the story. The other half is building silicon that can migrate without a hardware respin.

CRYSTALS-Kyber and Dilithium: What Hardware Acceleration Actually Means

ML-KEM (Kyber-768 at security level 2) is a module-lattice-based key encapsulation mechanism. Its computational bottleneck is polynomial multiplication in the ring Z_q[x]/(x^256 + 1) where q = 3329. The standard optimization for this operation is the Number Theoretic Transform (NTT), analogous to the FFT for convolutions over finite fields. An NTT over Z_3329 with n=256 can be implemented in hardware as a butterfly network with pre-computed twiddle factors, consuming roughly 10K–30K gate equivalents depending on pipeline depth and parallelism.

For a hardware block supporting ML-KEM-768, the dominant operations are: NTT/INTT (2 forward + 1 inverse per encapsulation), polynomial multiplication in NTT domain, sampling from CBD (centered binomial distribution) using a SHA3-based PRNG, and SHA3/SHAKE128/SHAKE256 for key derivation and message hashing. The SHA3 engine is shared with HMAC-SHA3 and other hash operations, so it is already present in any modern security ASIC; the incremental area for Kyber acceleration is primarily the NTT butterfly unit and the CBD sampler.

ML-DSA (Dilithium) follows a structurally similar lattice construction over module-LWE, sharing the NTT datapath with Kyber. An HSM hardware block that implements the NTT for Kyber-768 can typically reuse it for Dilithium-2 (NIST security level 2) with relatively minor modifications to the polynomial parameter handling. This NTT datapath reuse is a meaningful silicon efficiency argument for co-implementing Kyber and Dilithium rather than treating them as independent accelerators.

Key Size Impact and Memory Constraints

The jump from classical to post-quantum key sizes is real and has direct implications for hardware design. A Kyber-768 public key is 1,184 bytes; an ML-DSA-65 public key is 1,952 bytes. Compare this to ECC P-256: 64 bytes uncompressed. An RSA-3072 public key is 384 bytes. For a hardware key storage block that has been sized for classical keys, PQC support may require an order-of-magnitude increase in key store capacity.

For an HSM die with on-chip key storage, the relevant question is whether the non-volatile key store capacity supports PQC key sizes alongside existing ECC and symmetric key contexts. A key store sized for 256 × 32-byte ECC key contexts (8 KB) cannot hold a single ML-DSA-65 key pair without overflow. Key store capacity planning for post-quantum migration is an RTL architecture decision, not a firmware configuration — it must be resolved before tape-out.

Certificate chains in TLS with hybrid key exchange (classical + PQC) also grow significantly. A TLS handshake using X25519Kyber768 hybrid key exchange (an IETF draft, code point 0x6399 in the supported_groups extension) sends both the ECC and KEM public keys in the ClientHello. The combined key material is roughly 1.2 KB where previously it was 32 bytes. For an HSM handling TLS termination at high throughput, the additional bytes per handshake affect the interface bandwidth budget and the key agreement latency budget. At 10 Gbps line rate with 1 KB TLS records, the handshake overhead increases from negligible to measurable, though still well under most latency budgets.

Crypto Agility Architecture in Silicon

Crypto agility, correctly defined for hardware, means the system can adopt a new algorithm without requiring a die respin. This is harder to achieve than it sounds. A hardware accelerator implemented as fixed RTL for algorithm X cannot support algorithm Y without new silicon. True crypto agility in an HSM requires a programmable crypto engine — one where algorithm-specific operations can be expressed as firmware executed on a secure processor, with hardware acceleration available for the computationally expensive primitives.

The architectural pattern is a layered design: a fixed hardware layer implements primitive operations (AES round functions, SHA3 permutations, polynomial NTT, field arithmetic); a secure programmable processor runs algorithm composition firmware on top of these primitives; the algorithm firmware is updatable via signed firmware update, subject to the HSM's firmware update policy. This is the same layered approach used in TPM 2.0 — the specification explicitly separates the "primitive operation" layer from the "command" layer, allowing new command implementations without redesigning the underlying crypto hardware.

The secure firmware update policy is critical for this architecture to provide security guarantees. If the algorithm firmware can be updated by any host command, an attacker who compromises the host can substitute a malicious algorithm implementation. Correct practice: algorithm firmware updates require a signed package from the silicon vendor, with signature verification against a public key fused into the die at manufacturing. The vendor key must be protected against compromise (HSM in the vendor's infrastructure; key ceremony; dual-party controls). This is exactly the key ceremony that FIPS 140-3 Level 3 evaluations examine for security module vendors — it is not merely a documentation exercise.

Hybrid TLS and Migration Path

The recommended migration path for TLS is hybrid key exchange, not an immediate cutover to PQC-only. A hybrid scheme negotiates both a classical key exchange (X25519 or ECC P-256) and a PQC KEM (Kyber-768), combining both derived secrets into the TLS session key via the HKDF key schedule. The combined session key is secure if either component is secure — providing protection against classical attackers (who might break Kyber if a weakness is found) and against quantum attackers (who can break X25519 with Shor's algorithm on a large enough quantum computer).

IETF standardization of hybrid key exchange for TLS 1.3 was in progress through 2024–2025 in the TLS WG. For an HSM supporting TLS offload or key agreement for TLS infrastructure, the firmware must implement the combined KDF that mixes the X25519 output and the Kyber ciphertext decapsulation output per the hybrid draft specification. Hardware that accelerates X25519 (a common feature in modern security ASICs via a Montgomery ladder ECC unit) and separately accelerates Kyber NTT can pipeline both operations in parallel, minimizing the latency overhead of the hybrid exchange relative to classical-only.

Consider a hardware security infrastructure team at a financial services platform evaluating HSM upgrades for their key management tier. They need to migrate 12 PKI roots and 150+ TLS intermediates to hybrid certificates before their regulator's PQC deadline. Their constraint is that certificate issuance and TLS handshake must continue working for clients that do not yet support PQC — a classic migration compatibility problem. The hybrid approach solves this: classical clients negotiated classical key exchange via the existing certificate; PQC-capable clients negotiate hybrid. The HSM supports both simultaneously, firmware-configurable. No hardware respin required if the silicon has both ECC and Kyber acceleration — and the ability to update the hybrid KDF composition via signed firmware.

The Agility Trap: What "Firmware Updatable" Cannot Fix

Crypto agility via firmware has limits that are important to state clearly. An HSM with a 128-bit AES-GCM hardware accelerator cannot be firmware-patched to support an algorithm that requires a 512-bit block cipher — the datapath width is fixed in RTL. Similarly, an NTT butterfly unit designed for q=3329 and n=256 (Kyber/Dilithium) may not efficiently support future lattice schemes with different modulus choices without additional hardware.

The NIST post-quantum standardization process is ongoing — BIKE, HQC, and FALCON are in various stages of consideration for future rounds. FALCON (FIPS 206, now SLH-DSA) uses a different mathematical structure from Dilithium, requiring floating-point arithmetic in its key generation algorithm. A hardware accelerator designed specifically for the polynomial arithmetic in Kyber and Dilithium would need new RTL to efficiently support FALCON.

We're not saying crypto-agile firmware-updatable silicon cannot handle future algorithm transitions. We're saying that the hardware primitive layer constrains what firmware can do efficiently, and "crypto agility" claims should be scrutinized for which layer the agility operates at. A device that can run any algorithm in software on its secure processor is agile but may have unacceptable latency for high-throughput PQC operations without hardware acceleration. The correct architecture choice depends on the target throughput, algorithm roadmap assumptions, and the timeline between silicon respins in the product family. There is no universally correct answer, but teams that ask these questions before tape-out rather than after deployment have significantly better options.