NeuKompression™ IP

NeuKompression™: Breaking the Inference Bottleneck and the 4-bit AI Revolution

U.S. Patent No. 11,615,286 B2

Maximizing AI Accelerator Efficiency without Compromising Accuracy

With the rapid evolution of LLMs and Generative AI, the memory subsystem has emerged as the primary bottleneck for inference performance and power efficiency. To address this challenge, NeuKompression™ provides critical compression technology that converts pre-trained FP32/16/8b high-precision models into 4-bit representations. The core advantage of this technology lies in its ability to significantly shrink model size without sacrificing accuracy, reducing the memory footprint by 2 to 8 times. By mitigating bandwidth dependency during data transfer, NeuKompression™ not only drastically optimizes power requirements but also effectively boosts the throughput and cost-effectiveness of AI accelerators, making high-performance inference more competitive in the market.

Key Features

Offline software compression: 2X-8X compression into efficient bitstream
On-the-fly hardware decompression: restores FP8 weights for computation
Broad Applicability: Supports both language (Llama series proven) and vision (Stable Diffusion family, Flux proven) models

*Stable Diffusion Example: Cat in front of a laptop looking up from the screen

FP4

NeuKompression

FP16

Benefits

Seamless Integration & Flexibility

Designed as a standalone IP, NeuKompression offers a streamlined licensing model and architecture that is easily integrated into diverse customer chips and platforms, significantly accelerating time-to-market

Cloud-to-Edge Versatility

A highly scalable solution optimized for both high-throughput cloud environments and power-constrained edge AI inference applications

Market Differentiation & Cost Leadership

Empowers customers to launch superior AI solutions that achieve higher efficiency and lower total cost of ownership (TCO) through significant memory and bandwidth savings