# digininja/cewl

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/digininja-cewl).**

2,575 stars · 313 forks · Ruby

## Links

- GitHub: https://github.com/digininja/CeWL
- awesome-repositories: https://awesome-repositories.com/repository/digininja-cewl.md

## Description

CeWL is a custom wordlist generator and web crawling security tool designed to extract unique words and metadata from websites. It functions as an OSINT metadata extractor and security scanner, identifying potential passwords and usernames by analyzing HTML and JavaScript content.

The tool differentiates itself by combining recursive spidering with metadata extraction, allowing it to collect email addresses, author names, and creator metadata from web pages and linked files. It also captures domains, subdomains, and path components to include in generated lists.

Broad capabilities include web application spidering with depth control and regular expression filtering, as well as network request management using custom headers and proxy authentication. The system supports accessing restricted sites via Basic or Digest authentication and provides data processing utilities for word frequency analysis and list formatting.

The project is available as a containerized security scanner, packaged as a portable image to eliminate manual environment setup.

## Tags

### Part of an Awesome List

- [Web Spiders](https://awesome-repositories.com/f/awesome-lists/devtools/web-and-html-processing/web-spiders.md) — Implements a recursive web spider that traverses links to a specified depth to harvest content.
- [Information Gathering](https://awesome-repositories.com/f/awesome-lists/security/information-gathering.md) — Performs reconnaissance by collecting emails and author names from website metadata.

### Security & Cryptography

- [Custom Wordlist Generation](https://awesome-repositories.com/f/security-cryptography/custom-wordlist-generation.md) — Spiders websites to extract unique words of a minimum length from HTML and JavaScript for use in password attacks. ([source](https://github.com/digininja/CeWL/blob/master/changelog.md))
- [Brute Force Attack Preparation](https://awesome-repositories.com/f/security-cryptography/brute-force-attack-preparation.md) — Prepares tailored wordlists from target domains to increase the effectiveness of dictionary attacks.
- [Containerized Scanners](https://awesome-repositories.com/f/security-cryptography/security-scanners/containerized-scanners.md) — Provides a portable Docker image for performing website security analysis.
- [Security Testing and Auditing](https://awesome-repositories.com/f/security-cryptography/vulnerability-assessment-testing/security-testing-auditing.md) — Supports security auditing by creating potential credential lists based on a company's public web presence.

### Data & Databases

- [Entity Extraction](https://awesome-repositories.com/f/data-databases/link-metadata-extraction/entity-extraction.md) — Extracts email addresses and author names from mailto links and file properties for reconnaissance.
- [HTML Parsing and Extraction](https://awesome-repositories.com/f/data-databases/url-crawl-queues/url-filtering-strategies/content-and-language-filtering/html-parsing-and-extraction.md) — Parses HTML and JavaScript to extract unique strings and words based on defined length criteria.
- [Email and Identity Extraction](https://awesome-repositories.com/f/data-databases/web-data-extraction/metadata-extraction/email-and-identity-extraction.md) — Extracts email addresses and author names from web pages and linked files to build username lists. ([source](https://github.com/digininja/CeWL/blob/master/changelog.md))
- [Domain Structure Analyzers](https://awesome-repositories.com/f/data-databases/search-indexing-technologies/search-domains/domain-structure-analyzers.md) — Extracts domains, subdomains, and path components to analyze and catalog the hierarchical structure of a target website. ([source](https://github.com/digininja/CeWL/blob/master/README.md))

### Software Engineering & Architecture

- [Web Metadata Extractors](https://awesome-repositories.com/f/software-engineering-architecture/metadata-extraction-tools/array-metadata-extraction/technical-concept-extraction/technical-data-extraction/web-metadata-extractors.md) — Extracts email addresses and author names from web responses and markup for OSINT purposes.
- [Word Frequency Counters](https://awesome-repositories.com/f/software-engineering-architecture/frequency-counting-algorithms/word-frequency-counters.md) — Tallies word occurrences during crawling to prioritize the most frequent terms in generated lists.

### Web Development

- [Security Crawlers](https://awesome-repositories.com/f/web-development/web-automation-scraping/web-scraping-automation/web-crawling/security-crawlers.md) — Analyzes HTML and JavaScript content to discover potential usernames and passwords.

### DevOps & Infrastructure

- [Crawl Boundary Controls](https://awesome-repositories.com/f/devops-infrastructure/dependency-management/environment-scoping-controls/crawl-boundary-controls.md) — Uses regular expressions to define inclusion and exclusion rules to restrict automated traversal to specific domains or paths. ([source](https://github.com/digininja/CeWL/blob/master/changelog.md))

### Networking & Communication

- [Proxy-Aware Network Clients](https://awesome-repositories.com/f/networking-communication/network-infrastructure-routing/network-utilities/proxy-aware-network-clients.md) — Supports routing network traffic through proxies with custom authentication to bypass restrictions.
