1 repo
Techniques for reducing the numerical precision of model weights and activations to optimize inference speed and memory usage.
Explore 1 awesome GitHub repository matching artificial intelligence & ml · Quantization Strategies. Refine with filters or upvote what's useful.
Llama is a computational framework and runtime environment designed for executing transformer-based neural networks locally. It functions as a generative AI inference engine, enabling the processing of input sequences through pre-trained model weights to produce text completions and structured data outputs directly on
Reduces numerical precision in model weights to lower memory footprint and accelerate inference on local devices.