# rare-technologies/gensim

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/rare-technologies-gensim).**

16,442 stars · 4,407 forks · Python · LGPL-2.1

## Links

- GitHub: https://github.com/RaRe-Technologies/gensim
- Homepage: https://radimrehurek.com/gensim
- awesome-repositories: https://awesome-repositories.com/repository/rare-technologies-gensim.md

## Description

Gensim is an unsupervised natural language processing toolkit designed for topic modeling, word embedding training, and the processing of large-scale text corpora. It provides a framework for discovering latent themes and semantic structures in text without the need for labeled data.

The toolkit is distinguished by its ability to handle datasets that exceed system memory through iterator-based data streaming from disk. It also supports distributed model training, allowing complex modeling tasks to be executed across computer clusters.

The library covers a broad range of analysis capabilities, including semantic document similarity calculations and the creation of dense vector representations of words. It further includes mechanisms for model serialization and recovery to maintain continuity across sessions.

## Tags

### Artificial Intelligence & ML

- [Topic Models](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/diffusion-visual-models/generative-ai-models/latent-space-generative-models/topic-models.md) — Implements Latent Dirichlet Allocation to discover hidden themes and semantic structures in text.
- [Word Embeddings](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/word-embeddings.md) — Implements a framework for training dense vector representations of words to capture semantic relationships.
- [NLP Toolkits](https://awesome-repositories.com/f/artificial-intelligence-ml/nlp-toolkits.md) — Offers a set of unsupervised algorithms for processing natural language to discover patterns without labeled data.
- [Topic Modeling Libraries](https://awesome-repositories.com/f/artificial-intelligence-ml/topic-modeling-libraries.md) — Provides a comprehensive collection of unsupervised statistical tools for identifying latent themes in text.
- [Distributed Model Execution](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-model-execution.md) — Supports spreading large model training workloads across multiple compute devices to accelerate processing.
- [Distributed Training](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-training-frameworks/distributed-training.md) — Allows executing complex modeling algorithms across clusters to process massive datasets. ([source](https://github.com/rare-technologies/gensim#readme))
- [Semantic Similarity Calculation](https://awesome-repositories.com/f/artificial-intelligence-ml/semantic-analysis-tools/semantic-similarity-calculation.md) — Calculates the mathematical and semantic relationship between documents using cosine distance of embeddings.

### Data & Databases

- [Distributed Computing Engines](https://awesome-repositories.com/f/data-databases/data-engineering/distributed-compute-frameworks/distributed-computing-engines.md) — Functions as a distributed computing engine for processing and transforming massive text corpora.
- [Data Iterators](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/batch-processing-systems/data-iterators.md) — Implements data iterators to stream large text collections from disk, avoiding memory exhaustion.
- [Large-Scale Data Computation](https://awesome-repositories.com/f/data-databases/large-scale-data-computation.md) — Processes massive text datasets that exceed system memory through distributed computation and streaming.

### Part of an Awesome List

- [General Machine Learning](https://awesome-repositories.com/f/awesome-lists/ai/general-machine-learning.md) — Toolkit for topic modeling and document similarity.