# bee-san/pywhat

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/bee-san-pywhat).**

7,150 stars · 386 forks · Python · mit

## Links

- GitHub: https://github.com/bee-san/pyWhat
- awesome-repositories: https://awesome-repositories.com/repository/bee-san-pywhat.md

## Topics

`cyber` `cybersecurity` `hacking` `hacktoberfest` `malware` `malware-analysis` `malware-research` `pcap` `python` `re` `security` `tryhackme`

## Description

pyWhat is a Python-based data extraction tool designed to scan files and text for sensitive identifiers, credentials, and network artifacts using regular expressions. It functions as a pattern matching engine and PII scanner capable of identifying personal identifiers and sensitive data patterns across directories and binary files.

The project specializes in the identification of unknown data formats through file signatures and the extraction of high-value identifiers, such as URLs, IP addresses, and phone numbers, from network capture files. It utilizes a rarity-based filtering system and specific tags to reduce false positives during the discovery process.

The tool provides broad capabilities for recursive directory scanning, data type identification, and digital forensic processing. Findings can be organized through custom sorting and exported into structured JSON or plain text formats for integration into external security pipelines.

## Tags

### Part of an Awesome List

- [Sensitive Data Identification](https://awesome-repositories.com/f/awesome-lists/devtools/information-extraction/sensitive-data-identification.md) — Scans data for personal identifiers, credentials, and API keys using a comprehensive set of identifiable markers. ([source](https://github.com/bee-san/pyWhat/blob/main/README.md))
- [Identifier Extractions](https://awesome-repositories.com/f/awesome-lists/data/metadata-and-file-analysis/technical-file-attribute-extraction/capture-file-metadata-extraction/identifier-extractions.md) — Extracts high-value identifiers such as URLs and phone numbers from network capture files to accelerate traffic analysis. ([source](https://github.com/bee-san/pyWhat#readme))

### Security & Cryptography

- [Sensitive Data Scanners](https://awesome-repositories.com/f/security-cryptography/vulnerability-scanning/sensitive-data-scanners.md) — Scans files, directories, and text to locate personal identifiers, API keys, and credentials for security auditing.
- [Pattern Matching Engines](https://awesome-repositories.com/f/security-cryptography/pattern-matching-engines.md) — Uses a pattern matching engine with regular expressions and rarity scores to identify specific data formats.
- [PII Detection and Screening](https://awesome-repositories.com/f/security-cryptography/pii-detection-and-screening.md) — Identifies personal identifiers and sensitive data patterns across directories and binary files.
- [Sensitive Data Extraction Tools](https://awesome-repositories.com/f/security-cryptography/sensitive-data-extraction-tools.md) — Provides a command line tool that scans files and text for sensitive identifiers, credentials, and network artifacts.
- [Digital Forensics](https://awesome-repositories.com/f/security-cryptography/vulnerability-assessment-testing/digital-forensics.md) — Filters and sorts identified data patterns to isolate relevant evidence and reduce false positives during investigations.
- [Network Capture Parsers](https://awesome-repositories.com/f/security-cryptography/forensic-parsers/network-capture-parsers.md) — Extracts URLs, IP addresses, and phone numbers from network traffic files to accelerate forensic analysis.

### Data & Databases

- [Text Pattern Matching](https://awesome-repositories.com/f/data-databases/text-pattern-matching.md) — Uses predefined regular expressions to scan text and binary data for specific identifiers and sensitive information.
- [Automated Data Extraction](https://awesome-repositories.com/f/data-databases/automated-data-extraction.md) — Converts unstructured data analysis findings into structured JSON formats for use in security pipelines.
- [Search Result Filtering](https://awesome-repositories.com/f/data-databases/search-result-filtering.md) — Processes raw matches through rarity scores and category tags to refine results and reduce noise.

### Development Tools & Productivity

- [Signature-Based Identification](https://awesome-repositories.com/f/development-tools-productivity/signature-based-identification.md) — Matches file headers and content against known signatures to determine the format of unknown data.

### Software Engineering & Architecture

- [Pattern-Based Data Identification](https://awesome-repositories.com/f/software-engineering-architecture/pattern-based-data-identification.md) — Uses regular expressions and priority logic to identify emails, IP addresses, and system credentials. ([source](https://github.com/bee-san/pyWhat#readme))
- [Recursive Directory Traversers](https://awesome-repositories.com/f/software-engineering-architecture/recursive-validation-engines/recursive-tree-traversers/file-system-traversers/recursive-directory-traversers.md) — Recursively walks through nested folder structures to process all files within a path for content analysis.
- [Partial String Matching](https://awesome-repositories.com/f/software-engineering-architecture/string-matching-algorithms/partial-string-matching.md) — Identifies patterns embedded in larger data blobs by relaxing word boundary constraints for more flexible discovery.

### User Interface & Experience

- [Rarity-Based Filtering](https://awesome-repositories.com/f/user-interface-experience/search-filters/rarity-based-filtering.md) — Restrict searches to specific data categories using rarity scores and tags to significantly reduce false positives. ([source](https://github.com/bee-san/pyWhat/wiki/API))

### System Administration & Monitoring

- [Network Forensic Extractions](https://awesome-repositories.com/f/system-administration-monitoring/network-traffic-analysis/network-forensic-extractions.md) — Extracts high-value identifiers like URLs and phone numbers from network capture files to accelerate forensic investigations.
