# kepano/defuddle

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/kepano-defuddle).**

3,189 stars · 109 forks · TypeScript · mit

## Links

- GitHub: https://github.com/kepano/defuddle
- Homepage: https://kepano.github.io/defuddle/
- awesome-repositories: https://awesome-repositories.com/repository/kepano-defuddle.md

## Description

Defuddle is a command line web parser and content extractor designed to isolate the primary article body from web pages and convert the result into standardized markdown. It functions as a content cleaner that removes layout clutter, such as sidebars and headers, to retrieve the main text and associated metadata.

The tool provides a terminal interface that processes content from remote URLs, local files, or piped HTML streams. It supports custom content targeting, allowing users to specify CSS selectors to manually define the main content area when automatic detection is insufficient.

The system employs heuristic-based extraction and DOM-tree sanitization to identify core content and standardize page elements. It also includes metadata schema parsing to extract structured information including titles, authors, and publication dates.

## Tags

### Development Tools & Productivity

- [Web Content Parsing CLI](https://awesome-repositories.com/f/development-tools-productivity/web-content-parsing-cli.md) — Provides a command-line interface utility to parse web pages using URLs, local files, or piped HTML input. ([source](https://cdn.jsdelivr.net/gh/kepano/defuddle@main/README.md))
- [CLI Input Resolvers](https://awesome-repositories.com/f/development-tools-productivity/cli-input-resolvers.md) — Implements a command-line interface that resolves input content from remote URLs, local files, or piped HTML streams.
- [Web Content Parser CLI](https://awesome-repositories.com/f/development-tools-productivity/web-content-parser-cli.md) — Provides a terminal interface for retrieving and parsing web content using URLs, local files, or piped HTML.

### Web Development

- [Web Page Content Cleaning](https://awesome-repositories.com/f/web-development/web-page-content-cleaning.md) — Provides a system for isolating primary article text by removing layout clutter, sidebars, and headers from web pages.
- [Web Scraping](https://awesome-repositories.com/f/web-development/web-automation-scraping/web-scraping-automation/web-scraping.md) — Processes URLs or local HTML files through a terminal to extract structured information without a browser.

### Content Management & Publishing

- [Content Extractors](https://awesome-repositories.com/f/content-management-publishing/content-extractors.md) — Extracts the main text content from webpages by removing boilerplate and navigation elements.
- [Heuristic Extraction Methods](https://awesome-repositories.com/f/content-management-publishing/full-text-content-extraction/heuristic-extraction-methods.md) — Uses algorithms analyzing tag frequency and structural patterns to identify and extract the primary article body.
- [HTML to Markdown Converters](https://awesome-repositories.com/f/content-management-publishing/html-to-markdown-converters.md) — Transforms cleaned DOM structures into a standardized markdown format for improved readability and storage.
- [Web Article Extraction](https://awesome-repositories.com/f/content-management-publishing/web-article-extraction.md) — Isolates main body text and core content from web pages to extract titles, authors, and publication dates.
- [Automatic Page Metadata Extraction](https://awesome-repositories.com/f/content-management-publishing/automatic-page-metadata-extraction.md) — Retrieves structured page information including authors, titles, publication dates, and languages. ([source](https://cdn.jsdelivr.net/gh/kepano/defuddle@main/README.md))
- [Selector-Based Content Targeting](https://awesome-repositories.com/f/content-management-publishing/custom-content-sources/selector-based-content-targeting.md) — Defines specific CSS selectors to extract content from websites where automatic detection fails.
- [Web Metadata Extraction](https://awesome-repositories.com/f/content-management-publishing/web-metadata-extraction.md) — Extracts structured page information by mapping OpenGraph, JSON-LD, and meta tags to a unified data model.

### Data & Databases

- [Web Content Scrapers](https://awesome-repositories.com/f/data-databases/data-engineering-infrastructure/data-extraction-ingestion/web-extraction-engines/web-content-scrapers.md) — Extracts information from web pages and converts the retrieved content into structured markdown.
- [Extraction Logic Overrides](https://awesome-repositories.com/f/data-databases/content-type-detection/detection-pipeline-configurations/extraction-logic-overrides.md) — Allows users to specify custom CSS selectors to manually define the main content area. ([source](https://cdn.jsdelivr.net/gh/kepano/defuddle@main/README.md))

### Security & Cryptography

- [Clutter Removal](https://awesome-repositories.com/f/security-cryptography/dom-based-xss-protections/dom-tree-sanitizers/clutter-removal.md) — Strips non-essential HTML elements like navigation and ads to isolate core article content.

### User Interface & Experience

- [Body Content Extractors](https://awesome-repositories.com/f/user-interface-experience/html-content-processing/pdf-and-html-content-extraction/body-content-extractors.md) — Parses HTML files and retrieves the main body text by stripping markup and navigation elements. ([source](https://kepano.github.io/defuddle/))

### Software Engineering & Architecture

- [Content Extraction Overrides](https://awesome-repositories.com/f/software-engineering-architecture/syntax-query-definitions/css-selector-engines/content-extraction-overrides.md) — Allows manual specification of CSS selectors to define main content areas when automatic detection fails.
