# first20hours/google-10000-english

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/first20hours-google-10000-english).**

4,319 stars · 1,938 forks · other

## Links

- GitHub: https://github.com/first20hours/google-10000-english
- awesome-repositories: https://awesome-repositories.com/repository/first20hours-google-10000-english.md

## Description

This project provides a static dataset of the 10,000 most common English words, ranked by frequency of occurrence and derived from a trillion-word web corpus. The core offering is a frequency-ordered word list that serves as a foundational resource for language analysis, vocabulary study, and text processing.

The dataset is delivered in multiple curated variants to support different use cases. Words are grouped into short, medium, and long categories based on character count while preserving their original frequency ranking. A USA-specific variant focuses on American English usage patterns and spelling conventions. Additionally, a profanity-filtered version is produced by applying a multi-library scrubbing pipeline and a curated blacklist of offensive terms, providing a cleaner output suitable for educational or analytical contexts.

All word lists are provided as static plain-text assets in multiple formats, making the data straightforward to access and integrate into various applications without requiring any runtime dependencies.

## Tags

### Artificial Intelligence & ML

- [English Language Corpora](https://awesome-repositories.com/f/artificial-intelligence-ml/english-language-corpora.md) — Provides frequency-ordered lists of the most common English words for language learning or text analysis.
- [Frequency-Ordered Word Lists](https://awesome-repositories.com/f/artificial-intelligence-ml/english-text-parsers/frequency-ordered-word-lists.md) — Provides a plain-text list of the 10,000 most frequent English words ordered by occurrence. ([source](https://github.com/first20hours/google-10000-english/blob/master/LICENSE.md))
- [Frequency-Ordered Word Lists](https://awesome-repositories.com/f/artificial-intelligence-ml/model-pretraining-frameworks/corpus-compilation/frequency-ordered-word-lists.md) — Provides a frequency-ordered list of the 10,000 most common English words from a trillion-word corpus.
- [Frequency-Based Vocabularies](https://awesome-repositories.com/f/artificial-intelligence-ml/stop-word-filters/frequency-based-vocabularies.md) — Analyzes word occurrence data from a trillion-word corpus to understand language patterns.

### Part of an Awesome List

- [Word List Assets](https://awesome-repositories.com/f/awesome-lists/productivity/document-and-text-tools/plain-text-markup/word-list-assets.md) — Provides static word list files in multiple formats for language processing and vocabulary analysis.

### Data & Databases

- [Blacklist Filtering](https://awesome-repositories.com/f/data-databases/data-access-querying/blacklist-filtering.md) — Uses a curated blacklist of offensive terms to filter unwanted words from the dataset.

### DevOps & Infrastructure

- [Multi-Variant Word Lists](https://awesome-repositories.com/f/devops-infrastructure/static-asset-delivery/multi-variant-word-lists.md) — Ships word lists in plain text, frequency-ordered, length-categorized, and profanity-filtered variants.

### Education & Learning Resources

- [Word Dictionaries](https://awesome-repositories.com/f/education-learning-resources/word-dictionaries.md) — Loads a frequency-ordered list of the 10,000 most common English words from a trillion-word corpus. ([source](https://github.com/first20hours/google-10000-english#readme))
- [Regional English Word Lists](https://awesome-repositories.com/f/education-learning-resources/regional-english-word-lists.md) — Provides a USA-specific variant of the frequency list focused on American English usage patterns. ([source](https://github.com/first20hours/google-10000-english#readme))

### Security & Cryptography

- [Word Lists](https://awesome-repositories.com/f/security-cryptography/data-scrubbing/profanity-scrubbing/word-lists.md) — Ships a profanity-filtered word list derived from a frequency-ranked English corpus for clean vocabulary analysis.
- [Profanity Scrubbing](https://awesome-repositories.com/f/security-cryptography/data-scrubbing/profanity-scrubbing.md) — Removes offensive terms using multiple profanity-detection libraries and a blacklist filter.

### User Interface & Experience

- [Length-Based Word Groupings](https://awesome-repositories.com/f/user-interface-experience/item-lists/categorized-grouping/length-based-word-groupings.md) — Groups words into short, medium, and long categories based on character count.
