BERTopic

BERTopic is a topic modeling library used to extract interpretable themes from collections of text documents and images. It functions as a document clustering framework that transforms unstructured data into numerical vectors to group semantically similar content.

The project distinguishes itself through a multimodal embedding tool that allows for joint clustering of text and images in a shared vector space. It also features a class-based TF-IDF representation engine to identify representative words for clusters and an integrated system for using large language models to generate natural language labels and summaries for discovered topics.

The library covers a broad range of capabilities, including dynamic topic analysis to track themes over time, guided discovery for steering extraction with seed words, and online incremental learning for processing data streams. Its analytical surface includes the creation of topic hierarchies, outlier reduction, and a variety of visualization tools such as 2D document mapping and temporal evolution graphs.

The framework provides modular pipeline customization and supports GPU acceleration for dimensionality reduction and clustering.

Features

Document Clustering Frameworks - Provides a comprehensive pipeline for transforming unstructured text into vectors to group semantically similar documents.
Topic Modeling Libraries - Provides a comprehensive toolkit for identifying latent thematic structures in large text collections using unsupervised algorithms.
Class-Based TF-IDF - Implements a class-based TF-IDF engine to identify the most representative words for discovered topic clusters.
Clustering Algorithms - Groups similar documents into clusters using various algorithms to identify underlying themes.
Density-Based Clustering - Groups documents into clusters by identifying high-density regions in vector space while isolating noise as outliers.
Document Analysis - Groups semantically similar documents into dense clusters while identifying and excluding noise as outliers.
Dynamic Topic Analysis - Tracks how the prevalence and keywords of specific themes evolve over time within a dataset.
Text Embedding Extraction - Converts raw text into numerical embeddings using semantic models to facilitate document clustering.
Cluster-Based Keyword Scoring - Calculates importance scores for words within a cluster using a modified TF-IDF algorithm to describe the topic.
Multimodal Document Clustering - Groups mixed-media data by creating shared vector representations for both text and images in a single space.
Multimodal Embedding Models - Creates shared vector spaces for text and images to perform joint multimodal clustering and analysis.
Semantic Document Categorization - Groups unstructured text into predefined or discovered categories based on semantic similarity.
Predefined Topic Identification - Matches documents to user-defined labels using cosine similarity while clustering remaining documents.
Temporal Topic Analysis - Calculates how topic representations and frequencies evolve across a series of timestamps.
Text Embedding Generators - Transforms text documents into numerical vector representations to enable semantic grouping and clustering.
Theme Extraction - Provides the core capability of identifying recurring themes within document collections to describe core subjects.
Topic Label Management - Enables the creation and updating of user-defined labels for topics to improve result interpretability.
Generative Labeling - Produces natural language topic summaries by passing representative keywords and documents to a generative text model.
Generative Labeling Pipelines - Utilizes text-generation pipelines to produce descriptive natural language labels for discovered topic clusters.
Label-Based Discovery Guidance - Steers the modeling process using pre-defined categories while still discovering new unknown topics.
Guided Topic Discovery - Steers the topic extraction process using seed words or predefined categories to prioritize specific known themes.
Multimodal Topic Modeling - Processes both text and images simultaneously to identify topics across different media types.
Supervised Topic Modeling - Integrates external labels during the fitting process to guide topic creation or analyze classes.
Topic Convergence Guidance - Nudges the model toward specific themes by using seed keywords to influence clustering.
Multimodal Embedding Generation - Generates shared vector representations for both text and images to enable joint multimodal clustering.
Multimodal Embeddings - Projects text and images into a shared vector space to enable joint clustering of different media types.
Multimodal Clustering - Identifies topics by clustering a combination of text and images using multimodal embeddings.
Semantic - Organizes documents into groups based on semantic similarity using embedding models and dimensionality reduction.
Dimensionality Reduction - Compresses high-dimensional embeddings into a lower-dimensional space to optimize clustering performance.
Document Topic Prediction - Predicts the most likely topic and probability for unseen documents using a previously fitted model.
Manual Topic Modeling - Processes documents using pre-generated labels instead of discovering new clusters automatically.
Hierarchical Topic Aggregation - Builds a tree structure of topics by iteratively merging the most similar clusters based on embedding distances.
Incremental Model Updating - Learns from mini-batches of data to update topics without retraining the entire dataset.
Inference Execution - Assigns topics to new documents via cosine similarity between embeddings to bypass full clustering pipelines.
Interactive Topic Hierarchies - Generates interactive diagrams that show how topics relate and merge at different granularities.
Keyword Diversification Strategies - Limits duplicate words within a topic by comparing word embeddings to ensure a diverse set of representative keywords.
Outlier Reassignment - Assigns documents labeled as noise to the nearest existing topic based on probability or embeddings.
Topic Cluster Management - Provides tools to merge similar topics or assign outlier documents to existing clusters.
Topic Outlier Reduction - Maps documents labeled as noise to the most similar existing topics using embedding strategies.
Pipeline Component Customization - Allows swapping or removing specific components for embedding and clustering to tailor the modeling process.
Term Frequency Analyzers - Plots the rank of words across topics to determine the necessary number of representative terms.
Online Learning - Updates document vocabularies and cluster centroids using mini-batches to process data streams without full retraining.
Incremental Vectorization - Provides the ability to update document vocabularies and representations incrementally as new data arrives.
Semantic Similarity Calculation - Calculates the mathematical relationship between vector representations to assign documents to topics or merge clusters.
Semantic Cluster Relationship Mapping - Maps topics and their relative sizes in a 2D space to reveal relationships between semantic clusters.
Embedding Similarity Analysis - Calculates and analyzes the cosine similarity between topic embeddings to identify related themes.
Keyword Extraction - Extracts the most similar words to a topic using cosine similarity between word embeddings.
Temporal Topic Evolution Graphing - Graphs the change in prevalence of different topics over time to identify thematic trends.
Topic Count Optimization - Implements logic to merge similar topics and optimize the final number of extracted themes.
Topic Distribution Analysis - Calculates the relative contribution of multiple topics to a document by analyzing sliding windows of tokens.
Topic Evolution Tracking - Tracks how topic representations and key terms evolve across different time intervals.
Topic Granularity Control - Adjusts cluster size and total topic counts to control the number and size of generated themes.
Topic Hierarchy Creation - Builds a hierarchical tree of topics based on cosine distance to visualize merged topic levels.
Topic Hierarchy Dendrograms - Generates dendrograms that visualize how topics merge, assisting in the determination of optimal topic counts.
Label-Based Topic Modeling - Transforms predefined document clusters into interpretable topics using class-based TF-IDF representations.
LLM Pipeline Integration - Integrates custom chains to create topic labels using various LLM providers and prompt pipelines.
Zero-Shot Labeling - Maps discovered topics to a predefined list of candidate labels without requiring labeled training data.
Cluster Merging - Combines multiple specific topics into a single consolidated topic and updates the resulting representations.
Dynamic Topic Analysis - Tracks how topic representations and frequencies evolve over time using temporal analysis.
LLM Topic Labeling - Uses large language models to generate descriptive natural language labels for discovered topic clusters.
Representation Refinement - Updates the words that describe a topic by adjusting parameters without requiring a full model retrain.
Representation Updating - Recalculates term importance using updated parameters such as stop words without re-fitting the model.
Topic Projection Visualizations - Generates interactive 2D projections to visualize the global relationships and relative sizes of extracted topics.
Topic Similarity Heatmaps - Generates heatmaps based on cosine similarity matrices to visually show relationships between topics.
Hybrid Word and Document Embeddings - Merges word-level and document-level embedding models into a single workflow for comprehensive text representations.
Incremental - Updates tokenization dynamically and uses filters to prevent vocabulary bloat in online settings.
Embedding Model Integrations - Connects third-party libraries or custom cloud APIs to transform text into vector representations for topic modeling.
Topic Statistics Extraction - Retrieves detailed statistics for topics, including frequency, top words, and representative documents.
Cluster Visualizations - Plots documents and their assigned topics in 2D space to show spatial distribution.
Thematic - Creates topic representations specific to different user groups to compare how various cohorts discuss themes.
Scatter Charts - Creates interactive 2D scatter plots of topics to visualize the relationships between extracted themes.
Document Embedding Projections - Plots individual documents in a 2D plane to verify topic assignments and cluster distributions.
Document Topic Mapping - Maps documents and associated topics into a 2D space to inspect cluster relationships.
Seed Word Prioritization - Increases the weight of seed words to ensure domain-specific terminology appears in topic representations.
Diversity-Aware - Selects keywords that balance semantic similarity with diversity to avoid redundant terms in topic descriptions.
Visual Topic Summaries - Assigns representative images or captions to identified topics to provide a visual summary of the content.
Modular Modeling Pipelines - Provides a modular pipeline architecture allowing the swapping of embedding, dimensionality reduction, and clustering components.
Trend Analysis - Monitors how the prevalence and nature of specific themes evolve over time across a dataset.
Language Model Development - Topic modeling technique using transformers and dense clustering.

lancedb/lancedb

9,031View on GitHub

LanceDB is a vector database and columnar data store designed to function as a versioned dataset manager and vector search engine. It serves as a high-performance backend for indexing and retrieving high-dimensional embeddings, providing the foundation for machine learning data pipelines. The system distinguishes itself through a combination of cloud-native object storage and immutable version tracking, allowing for data time-travel and reproducible AI experiments. It integrates hybrid search capabilities, merging dense vector similarity with BM25 full-text search and SQL-like scalar filters

MilaNLProc/contextualized-topic-models

1,271View on GitHub

Contextualized topic modeling is a framework that integrates deep learning architectures with statistical word frequency distributions to extract coherent themes from large document collections. By combining pre-trained transformer-based embeddings with variational inference, the system identifies hidden patterns in text while maintaining the interpretability of traditional generative models. The library distinguishes itself by mapping diverse languages into a shared semantic space, enabling topic discovery and classification across multilingual datasets without requiring language-specific tr

OFA-Sys/Chinese-CLIP

5,942View on GitHub

Chinese-CLIP is a multimodal framework and vision-language model designed for cross-modal retrieval and representation generation using Chinese text and images. It employs a contrastive learning architecture to map visual and textual data into a shared vector space for similarity calculations. The system enables bidirectional search, allowing for text-to-image and image-to-text retrieval. It also provides zero-shot image classification, which identifies objects within images without requiring task-specific training. The project includes tools for fine-tuning pre-trained models on specialized

huggingface/sentence-transformers

18,817View on GitHub

This project is a transformer-based framework for generating dense and sparse vector embeddings of text and multimodal data. It serves as a library for fine-tuning models to perform semantic similarity tasks, retrieval, and reranking. The system is distinguished by its support for diverse architectural patterns, including bi-encoders for fast similarity search and cross-encoders for high-precision reranking. It provides dedicated pipelines for multimodal embeddings, mapping text and images into a shared vector space, and implements knowledge distillation to compress large models into smaller,

MaartenGrBERTopic

Features