2 repos

Awesome GitHub RepositoriesHardware Acceleration Kernels

Optimized computational kernels designed to offload intensive mathematical operations to specialized hardware.

Distinguishing note: Focuses on low-level kernel optimization for tensor operations rather than high-level machine learning model management.

Explore 2 awesome GitHub repositories matching artificial intelligence & ml · Hardware Acceleration Kernels. Refine with filters or upvote what's useful.

Find the best repos with AI.We'll search the best matching repositories with AI.

ggml-org/whisper.cpp
ggml-org/whisper.cpp
46,843View on GitHub
Whisper.cpp is a high-performance, local-first speech recognition engine designed to run large-scale machine learning models on consumer hardware. It functions as a portable library that converts audio into text, supporting both static file transcription and real-time stream processing. By utilizing a lightweight inference engine and weight quantization, the project minimizes memory and compute overhead, allowing for efficient execution without reliance on external cloud APIs or internet connectivity. The project distinguishes itself through a hardware-agnostic compute abstraction that offloa
A collection of optimized kernels that offload intensive tensor operations to specialized graphics and neural processing units for maximum throughput.
C++inferenceopenaispeech-recognition
46,843View on GitHub
deepspeedai/DeepSpeed
deepspeedai/DeepSpeed
41,638View on GitHub
DeepSpeed is a high-performance library designed to scale deep learning model training and inference across massive clusters of GPUs and compute nodes. It provides a comprehensive suite of tools for distributed training, enabling the execution of models that exceed the memory capacity of single devices through advanced parameter partitioning, pipeline-based model parallelism, and memory-efficient state offloading. The framework distinguishes itself through specialized communication-efficient optimizers and hardware-aware acceleration techniques. By utilizing gradient compression, quantization
Custom-compiled kernels optimize mathematical operations for specific hardware architectures to maximize throughput and reduce computational latency.
Pythonbillion-parameterscompressiondata-parallelism
41,638View on GitHub