# sesameailabs/csm

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/sesameailabs-csm).**

14,669 stars · 1,483 forks · Python · Apache-2.0

## Links

- GitHub: https://github.com/SesameAILabs/csm
- awesome-repositories: https://awesome-repositories.com/repository/sesameailabs-csm.md

## Description

CSM is a conversational speech generation model and text-to-speech engine that converts text and audio inputs into synthetic speech. It utilizes a large language model architecture to predict and decode audio tokens for voice synthesis.

The system functions as a zero-shot voice cloner, replicating specific speaker identities using short audio samples without requiring additional training. This enables precise control over speaker identity and the creation of synthetic speech that mimics a specific person.

The model covers conversational speech synthesis and text-to-speech generation, transforming written text into spoken audio while maintaining natural flow and cadence.

## Tags

### Artificial Intelligence & ML

- [Synthetic Speech Generation](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/multimodal-processing-tools/synthetic-speech-generation.md) — Generates natural-sounding synthetic speech that replicates human conversational cadence and vocal characteristics. ([source](https://github.com/sesameailabs/csm#readme))
- [Audio Generation Models](https://awesome-repositories.com/f/artificial-intelligence-ml/audio-generation-models.md) — Utilizes a large language model architecture to predict and decode audio tokens for voice synthesis.
- [Waveform Decoders](https://awesome-repositories.com/f/artificial-intelligence-ml/audio-tokenization/waveform-decoders.md) — Uses specialized neural waveform decoders to transform internal latent representations into audible speech.
- [Voice Conditioning Encoders](https://awesome-repositories.com/f/artificial-intelligence-ml/diffusion-conditioning-architectures/voice-conditioning-encoders.md) — Utilizes voice conditioning encoders to extract vocal identity from audio samples for speech synthesis.
- [Speech Synthesis](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis.md) — Synthesizes realistic spoken dialogue that maintains natural conversational flow and cadence.
- [Zero-Shot Voice Cloning](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/zero-shot-voice-cloning.md) — Implements zero-shot voice cloning to replicate speaker identities from short audio samples without additional training.
- [Text-to-Speech](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech.md) — Transforms written text into high-quality spoken audio using neural processing and decoding.
- [Text-to-Speech Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech-engines.md) — Provides a processing pipeline that transforms written text into high-fidelity spoken audio.
- [Speaker Identity Control](https://awesome-repositories.com/f/artificial-intelligence-ml/voice-cloning/voice-identity-conversions/speaker-identity-control.md) — Provides precise control over speaker identity using audio samples to mimic specific people. ([source](https://github.com/sesameailabs/csm#readme))
- [Multi-Modal Input Processors](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/multimodal-processing-tools/multi-modal-input-processors.md) — Integrates diverse inputs, including text prompts and reference audio segments, into a shared representation.
- [Sequence-to-Sequence Tasks](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/speech-processing/sequence-to-sequence-tasks.md) — Employs sequence-to-sequence neural processing to map text and audio inputs to speech outputs.
- [Vocal Characteristic Adjustments](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/speech-processing/voice-synthesis/vocal-characteristic-adjustments.md) — Allows adjustment of vocal characteristics and tone to match a desired persona or individual.
- [Latent Acoustic Mapping](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/latent-acoustic-mapping.md) — Implements latent acoustic mapping to translate textual and acoustic cues into speech representations.