# LLM Inference Speed of Light – the AI powered human

Exploring the Speed Limits of Language Model Inference

The intricate process of computing language models’ inferences possesses a theoretical speed limit that’s fundamentally bound to the ratio between floating-point operations and bytes of data processed. This limit is evidenced in the Mistral 7B model, where the speed of matrix-vector multiplications and the attention mechanisms significantly dictate the swiftness of inference.

To approximate the maximum potential speed, there are clear prerequisites: a sophisticated software design paired with high-performance hardware, highlighting abundant memory bandwidth. Innovations like group-query attention surface as promising optimizations, reducing the cache size for keys and values, subsequently lowering the required bandwidth.

Ultimately, the realm of theoretical modeling plays a pivotal role—not only in the validation of current technological approaches but also in forecasting the ramifications of architectural advancements. Amidst evolving computational designs, the idea of integrating group-query attention into transformer-based models stands out as a potentially advantageous strategy, warranting further assessment for its practicality and efficiency across various models.