# minishlab/semhash

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/minishlab-semhash).**

936 stars · 57 forks · Python · MIT

## Links

- GitHub: https://github.com/MinishLab/semhash
- Homepage: https://minish.ai/packages/semhash/introduction
- awesome-repositories: https://awesome-repositories.com/repository/minishlab-semhash.md

## Topics

`datasets` `deduplication` `image-dataset-cleaning` `model2vec` `preprocessing` `semantic-deduplication` `text-dataset-cleaning` `vicinity`

## Description

Fast Multimodal Semantic Deduplication & Filtering

## Tags

### Part of an Awesome List

- [Data Curation and Filtering](https://awesome-repositories.com/f/awesome-lists/ai/data-curation-and-filtering.md) — Fuzzy deduplication tool using fast embedding generation.
- [LLM Development Tools](https://awesome-repositories.com/f/awesome-lists/ai/llm-development-tools.md) — Library for near-deduplication and decontamination of text datasets.
- [Training Datasets](https://awesome-repositories.com/f/awesome-lists/ai/training-datasets.md) — Listed in the “Training Datasets” section of the Llm Course awesome list.
