# alex000kim/nsfw_data_scraper

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/alex000kim-nsfw-data-scraper).**

12,541 stars · 2,866 forks · Shell · mit

## Links

- GitHub: https://github.com/alex000kim/nsfw_data_scraper
- awesome-repositories: https://awesome-repositories.com/repository/alex000kim-nsfw-data-scraper.md

## Topics

`content-moderation` `deep-learning` `machine-learning` `nsfw` `nsfw-classifier` `pornography`

## Description

This project is a machine learning data pipeline designed to automate the collection, curation, and preparation of large-scale image datasets. It functions as an image dataset scraper and computer vision curator, providing the necessary infrastructure to aggregate categorized files from web sources and organize them into structured directories for model development.

The system distinguishes itself through a batch-processing architecture that integrates data acquisition with automated integrity validation. By scanning files to remove corrupted or invalid images and applying deterministic partitioning to split collections into training and validation subsets, the framework ensures that datasets remain consistent and ready for machine learning workflows.

Beyond data management, the project includes capabilities for training convolutional neural networks. These tools allow users to develop and refine image classification models specifically for automated content moderation and pattern recognition tasks. The repository provides a collection of scripts that manage the entire lifecycle of image data, from initial web traversal to the final preparation of training sets.

## Tags

### Artificial Intelligence & ML

- [Convolutional Neural Networks](https://awesome-repositories.com/f/artificial-intelligence-ml/convolutional-neural-networks.md) — Provides a framework for training image classification models to automate content moderation and pattern recognition.
- [Dataset Curators](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/frameworks/computer-vision/dataset-curators.md) — Manages large-scale image collections through automated integrity validation, directory structuring, and deterministic partitioning.
- [Image Dataset Scrapers](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/machine-learning-datasets/image-classification-datasets/image-dataset-scrapers.md) — Automates the collection and organization of categorized image files from web sources to build training sets.
- [Computer Vision Training](https://awesome-repositories.com/f/artificial-intelligence-ml/computer-vision-training.md) — Provides standardized training routines for preparing and validating image-based neural network models.
- [Moderation Classifiers](https://awesome-repositories.com/f/artificial-intelligence-ml/image-classification/transformer-based-image-classifiers/convolutional-classifiers/moderation-classifiers.md) — Trains convolutional neural networks to identify specific content patterns for automated moderation tasks. ([source](https://github.com/alex000kim/nsfw_data_scraper#readme))
- [Model Training Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/training-frameworks/model-training-pipelines.md) — Provides scripts for preparing, cleaning, and splitting image datasets to facilitate model training and validation.
- [Machine Learning Datasets](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/machine-learning-datasets.md) — Collects and organizes large sets of images from the web to build high-quality training data.
- [Data Preparation Tools](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/data-ingestion-preparation/data-preparation-tools.md) — Cleans and structures downloaded images into directories suitable for machine learning ingestion. ([source](https://github.com/alex000kim/nsfw_data_scraper#readme))
- [Deterministic Partitioners](https://awesome-repositories.com/f/artificial-intelligence-ml/dataset-preparation-tools/dataset-sampling-utilities/deterministic-partitioners.md) — Splits image collections into training and validation subsets using deterministic sampling to ensure unbiased evaluation.
- [Training and Evaluation Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/training-frameworks/training-and-evaluation-pipelines.md) — Partitions image collections into training and testing sets to facilitate model evaluation and performance monitoring. ([source](https://github.com/alex000kim/nsfw_data_scraper#readme))

### Content Management & Publishing

- [Content Moderation Tools](https://awesome-repositories.com/f/content-management-publishing/content-moderation-tools.md) — Trains neural networks to automatically detect and filter specific types of visual content within digital platforms.

### Data & Databases

- [Automated Dataset Aggregators](https://awesome-repositories.com/f/data-databases/data-collections-datasets/classification-datasets/automated-dataset-aggregators.md) — Aggregates categorized image files from web sources to build comprehensive training sets. ([source](https://github.com/alex000kim/nsfw_data_scraper#readme))
- [Web Crawlers](https://awesome-repositories.com/f/data-databases/data-engineering-infrastructure/data-extraction-ingestion/data-collection-tools/web-crawlers.md) — Automates the traversal of web sources to fetch and store raw image files for dataset construction.

### Web Development

- [Web Scraping](https://awesome-repositories.com/f/web-development/web-scraping.md) — Automates the collection and cleaning of large-scale image files from web sources for research and development.

### Development Tools & Productivity

- [Batch Processing Pipelines](https://awesome-repositories.com/f/development-tools-productivity/batch-processing-pipelines.md) — Orchestrates sequential data collection, cleaning, and partitioning tasks into efficient processing workflows.
