BERTopic is a topic modeling library used to extract interpretable themes from collections of text documents and images. It functions as a document clustering framework that transforms unstructured data into numerical vectors to group semantically similar content.
The project distinguishes itself through a multimodal embedding tool that allows for joint clustering of text and images in a shared vector space. It also features a class-based TF-IDF representation engine to identify representative words for clusters and an integrated system for using large language models to generate natural language labels and summaries for discovered topics.
The library covers a broad range of capabilities, including dynamic topic analysis to track themes over time, guided discovery for steering extraction with seed words, and online incremental learning for processing data streams. Its analytical surface includes the creation of topic hierarchies, outlier reduction, and a variety of visualization tools such as 2D document mapping and temporal evolution graphs.
The framework provides modular pipeline customization and supports GPU acceleration for dimensionality reduction and clustering.