# piskvorky/gensim

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/piskvorky-gensim).**

16,361 stars · 4,410 forks · Python · lgpl-2.1

## Links

- GitHub: https://github.com/piskvorky/gensim
- Homepage: https://radimrehurek.com/gensim
- awesome-repositories: https://awesome-repositories.com/repository/piskvorky-gensim.md

## Topics

`data-mining` `data-science` `document-similarity` `fasttext` `gensim` `information-retrieval` `machine-learning` `natural-language-processing` `neural-network` `nlp` `python` `topic-modeling` `word-embeddings` `word-similarity` `word2vec`

## Description

Gensim is a natural language processing toolkit designed for large-scale text analysis and the training of semantic vector embeddings. It provides a framework for identifying latent thematic structures within document collections and calculating semantic similarity between text segments using unsupervised statistical algorithms.

The project is distinguished by its ability to handle datasets that exceed available system memory through incremental corpus streaming, which processes documents one at a time from disk. It utilizes sparse vector representations and dictionary-based token mapping to maintain efficiency, while supporting distributed multiprocessing to accelerate training and inference across multiple processor cores.

The library covers a broad range of capabilities including the transformation of document representations through term frequency weighting and the indexing of high-dimensional vectors for rapid similarity retrieval. It also facilitates the integration of pre-trained models to bootstrap analysis tasks without requiring local training from scratch.

## Tags

### Artificial Intelligence & ML

- [Natural Language Processing Libraries](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing-libraries.md) — Offers a comprehensive toolkit for processing large text corpora, calculating similarity, and performing semantic analysis.
- [Word Embeddings](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/word-embeddings.md) — Trains semantic word embeddings to capture relationships and context from large text collections. ([source](https://radimrehurek.com/gensim/auto_examples/))
- [Text Analysis Tools](https://awesome-repositories.com/f/artificial-intelligence-ml/text-analysis-tools.md) — Processes massive text collections incrementally to build machine learning models without exceeding system memory.
- [Topic Modeling Libraries](https://awesome-repositories.com/f/artificial-intelligence-ml/topic-modeling-libraries.md) — Provides unsupervised statistical algorithms to identify and categorize latent thematic structures within large document collections. ([source](https://cdn.jsdelivr.net/gh/piskvorky/gensim@develop/README.md))
- [Topic Modeling Toolkits](https://awesome-repositories.com/f/artificial-intelligence-ml/topic-modeling-toolkits.md) — Provides a specialized toolkit for identifying latent thematic structures in large text collections.
- [Vector Embeddings](https://awesome-repositories.com/f/artificial-intelligence-ml/vector-embeddings.md) — Computes high-performance semantic vector representations of text using optimized and parallelized routines. ([source](https://radimrehurek.com/gensim/))
- [Document Summarization](https://awesome-repositories.com/f/artificial-intelligence-ml/document-summarization.md) — Identifies latent thematic structures within document collections to categorize and summarize content. ([source](https://radimrehurek.com/gensim/auto_examples/))
- [Large-Scale Model Training](https://awesome-repositories.com/f/artificial-intelligence-ml/large-scale-model-training.md) — Processes massive text corpora by streaming data from disk to train models without exceeding system memory. ([source](https://radimrehurek.com/gensim/))
- [Vector Similarity Search](https://awesome-repositories.com/f/artificial-intelligence-ml/vector-similarity-search.md) — Index high-dimensional vector representations to enable rapid retrieval of similar items from large datasets using approximate nearest neighbor techniques. ([source](https://radimrehurek.com/gensim/auto_examples/))
- [Pretrained Model Integrations](https://awesome-repositories.com/f/artificial-intelligence-ml/custom-model-training/pretrained-model-integrations.md) — Facilitates the integration of external pre-trained models to bootstrap analysis tasks. ([source](https://radimrehurek.com/gensim/auto_examples/))
- [Vocabulary Mappers](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/language-tools/tokenization-interfaces/tokenizer-base-interfaces/vocabulary-mappers.md) — Maps vocabulary terms to unique integer identifiers to create a consistent numerical index for model training.
- [Similarity Query Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/vector-similarity-search/similarity-query-engines.md) — Calculates and ranks the semantic closeness of query documents against indexed collections using vector space models. ([source](https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html))

### Data & Databases

- [Vector Search Frameworks](https://awesome-repositories.com/f/data-databases/database-management-systems/database-engines/vector-databases/vector-search-frameworks.md) — Provides a framework for training and managing high-dimensional semantic vector representations using optimized machine learning routines.
- [Large Data Streamers](https://awesome-repositories.com/f/data-databases/large-data-streamers.md) — Process documents one at a time from a collection to enable analysis of datasets that exceed available system memory. ([source](https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html))
- [Latent Semantic Models](https://awesome-repositories.com/f/data-databases/semantic-data-models/latent-semantic-models.md) — Identifies latent thematic structures within document collections using unsupervised statistical algorithms.
- [Incremental Data Streaming](https://awesome-repositories.com/f/data-databases/incremental-data-streaming.md) — Enables processing of massive datasets that exceed system memory by streaming documents incrementally from disk.
- [Semantic Information Retrieval](https://awesome-repositories.com/f/data-databases/semantic-information-retrieval.md) — Calculates mathematical distance between text segments to enable accurate semantic information retrieval.
- [Approximate Nearest Neighbor Search](https://awesome-repositories.com/f/data-databases/approximate-nearest-neighbor-search.md) — Provides efficient approximate nearest neighbor search algorithms for high-dimensional vector spaces.
- [Document Relationship Resolvers](https://awesome-repositories.com/f/data-databases/relational-association-apis/document-relationship-resolvers.md) — Calculates semantic relationships between documents to enable efficient information retrieval. ([source](https://cdn.jsdelivr.net/gh/piskvorky/gensim@develop/README.md))
- [Category Identifier Mappings](https://awesome-repositories.com/f/data-databases/enum-definitions/enum-label-mappings/category-identifier-mappings.md) — Maps vocabulary terms to unique integer identifiers to create a consistent dictionary for vectorization. ([source](https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html))
- [Text Vectorizers](https://awesome-repositories.com/f/data-databases/vector-storage/text-vectorizers.md) — Converts text into sparse numerical representations based on word frequency counts for semantic analysis. ([source](https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html))
- [Distributed Computing](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-processing/distributed-processing-frameworks/distributed-computing.md) — Distributes heavy computational tasks across multiple processor cores or clusters to accelerate data operations. ([source](https://cdn.jsdelivr.net/gh/piskvorky/gensim@develop/README.md))
- [Term Weighting Algorithms](https://awesome-repositories.com/f/data-databases/search-indexing-technologies/search-indexing/search-information-retrieval/query-interfaces-dsls/multi-term-search-processors/term-weighting-algorithms.md) — Transforms raw document counts into normalized numerical representations by adjusting for term rarity.

### Education & Learning Resources

- [Technical Topics](https://awesome-repositories.com/f/education-learning-resources/technical-topics.md) — Identifies latent thematic structures within large document collections using unsupervised algorithms.

### Scientific & Mathematical Computing

- [Distance Metrics](https://awesome-repositories.com/f/scientific-mathematical-computing/numerical-mathematical-foundations/mathematical-libraries-and-utilities/core-mathematical-concepts/distance-metrics.md) — Computes semantic distance between text segments using mathematical metrics for document similarity. ([source](https://radimrehurek.com/gensim/auto_examples/))

### Part of an Awesome List

- [Natural Language Processing](https://awesome-repositories.com/f/awesome-lists/ai/natural-language-processing.md) — Listed in the “Natural Language Processing” section of the Awesome Python awesome list.

### Programming Languages & Runtimes

- [Sparse Data Structures](https://awesome-repositories.com/f/programming-languages-runtimes/programming-utilities/data-structure-type-helpers/data-structures/specialized-memory-formats/sparse-data-structures.md) — Utilizes memory-efficient sparse vector representations to handle high-dimensional data.

### Software Engineering & Architecture

- [Distributed Task Processors](https://awesome-repositories.com/f/software-engineering-architecture/distributed-task-processors.md) — Distributes heavy computational tasks across multiple processor cores to accelerate training and inference.

### Content Management & Publishing

- [Weighting Transformers](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-transformation-pipelines/weighting-transformers.md) — Transforms document representations by weighting terms based on relative rarity to improve search and classification accuracy. ([source](https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html))