# codelucas/newspaper

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/codelucas-newspaper).**

14,982 stars · 2,134 forks · HTML · mit

## Links

- GitHub: https://github.com/codelucas/newspaper
- Homepage: https://goo.gl/VX41yK
- awesome-repositories: https://awesome-repositories.com/repository/codelucas-newspaper.md

## Topics

`crawler` `crawling` `news` `news-aggregator` `python` `scraper`

## Description

Newspaper is a Python library designed for scraping, parsing, and analyzing web-based information. It functions as a framework for automated news aggregation and large-scale web content extraction, providing tools to download, clean, and structure text, metadata, and media from diverse online sources.

The project distinguishes itself through a pipeline-oriented architecture that combines heuristic-based content extraction with natural language processing. It automatically identifies and isolates article bodies from web page boilerplate while simultaneously performing language detection, keyword extraction, and text summarization. The library supports multilingual content processing by allowing users to provide custom stopword lists and tokenization rules for diverse international languages.

To optimize performance during large-scale operations, the framework utilizes thread-pool-based concurrency for simultaneous data retrieval and persistent key-value caching to avoid redundant network requests. It also preserves the original semantic HTML structure of extracted content, ensuring that data remains available for downstream processing or display.

The toolkit includes command-line utilities for executing concurrent retrieval tasks and fetching trending search terms. Users can customize extraction parameters, such as timeouts and content filtering, through global or instance-specific configuration settings.

## Tags

### Data & Databases

- [Content Extraction](https://awesome-repositories.com/f/data-databases/content-extraction.md) — Downloads and parses web pages to isolate text, authors, publication dates, and media from articles. ([source](https://cdn.jsdelivr.net/gh/codelucas/newspaper@master/README.md))
- [News Aggregators](https://awesome-repositories.com/f/data-databases/full-text-search-engines/news-aggregators.md) — Automates the discovery and collection of news feeds and article sources from diverse online publications. ([source](https://newspaper.readthedocs.io/en/latest/user_guide/advanced.html))
- [Web Data Extraction](https://awesome-repositories.com/f/data-databases/web-data-extraction.md) — Provides automated tools for programmatically scraping, parsing, and extracting structured content from diverse web-based news sources.
- [Key-Value Persistence Stores](https://awesome-repositories.com/f/data-databases/key-value-persistence-stores.md) — Caches processed article metadata and content on disk to minimize redundant network requests.
- [Source Discovery Mechanisms](https://awesome-repositories.com/f/data-databases/data-ingestion-sources/source-metadata-capture/source-discovery-mechanisms.md) — Parses site feeds and structural markers to automatically identify and catalog new content categories and article sources.
- [Extraction Configurations](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-processing/document-unstructured-extraction/extraction-configurations.md) — Provides global and instance-specific settings for customizing extraction parameters like timeouts and content filtering. ([source](https://newspaper.readthedocs.io/en/latest/user_guide/advanced.html))

### Web Development

- [Web Scraping](https://awesome-repositories.com/f/web-development/web-scraping.md) — Collects high volumes of web data efficiently using multi-threaded processes and caching.
- [Web Scraping and Automation](https://awesome-repositories.com/f/web-development/web-automation-scraping/web-scraping-automation.md) — Automates the retrieval and parsing of news articles from the web using a multi-threaded crawling and extraction pipeline.

### Artificial Intelligence & ML

- [Natural Language Processing](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing.md) — Analyzes extracted text to generate summaries, identify keywords, and detect languages.
- [Natural Language Processing Libraries](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing-libraries.md) — Integrates natural language processing capabilities for automated keyword extraction, language detection, and text summarization of web content.
- [Text Summarization](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/nlp-applications/text-summarization.md) — Generates concise summaries and extracts relevant keywords from article text using NLP techniques. ([source](https://newspaper.readthedocs.io/en/latest/_sources/index.rst.txt))
- [Text Analysis Tools](https://awesome-repositories.com/f/artificial-intelligence-ml/text-analysis-tools.md) — Distills extracted content into concise formats using natural language processing for summarization and keyword identification. ([source](https://newspaper.readthedocs.io/en/latest/))
- [Content Processing Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/content-processing-pipelines.md) — Processes news content across diverse international languages using automated detection and parsing workflows. ([source](https://newspaper.readthedocs.io/en/latest/))
- [Language Detection Tools](https://awesome-repositories.com/f/artificial-intelligence-ml/language-detection-tools.md) — Automatically identifies the language of web pages to ensure accurate parsing of international content. ([source](https://cdn.jsdelivr.net/gh/codelucas/newspaper@master/README.md))
- [Stopword-Driven Tokenizers](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/stopword-driven-tokenizers.md) — Utilizes language-specific dictionaries and tokenization rules to perform accurate keyword extraction across international languages.

### Part of an Awesome List

- [Content Extraction](https://awesome-repositories.com/f/awesome-lists/devtools/content-extraction.md) — Extraction and curation of news articles.

### Content Management & Publishing

- [Content Extraction Engines](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/content-extraction-engines.md) — Parses and normalizes raw web content into structured data, including text, metadata, and media, from various online publications.

### Operating Systems & Systems Programming

- [Concurrent Downloaders](https://awesome-repositories.com/f/operating-systems-systems-programming/system-administration-maintenance/system-administration-utilities/system-utilities/download-managers/concurrent-downloaders.md) — Optimizes collection speed by executing multi-threaded downloads while respecting server rate limits. ([source](https://newspaper.readthedocs.io/en/latest/user_guide/advanced.html))

### Programming Languages & Runtimes

- [Thread Pools](https://awesome-repositories.com/f/programming-languages-runtimes/language-features-paradigms/concurrency-models/concurrency/task-orchestration-frameworks/thread-pools.md) — Distributes network requests across worker threads to accelerate simultaneous retrieval of web content.

### Software Engineering & Architecture

- [Data Processing Pipelines](https://awesome-repositories.com/f/software-engineering-architecture/data-processing-pipelines.md) — Implements modular pipelines for language detection, text cleaning, and summarization of web content.

### User Interface & Experience

- [HTML Content Processing](https://awesome-repositories.com/f/user-interface-experience/html-content-processing.md) — Preserves original semantic HTML structures of extracted content for downstream processing or display. ([source](https://newspaper.readthedocs.io/en/latest/user_guide/advanced.html))
