# github-linguist/linguist

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/github-linguist-linguist).**

13,327 stars · 5,020 forks · Ruby · mit

## Links

- GitHub: https://github.com/github-linguist/linguist
- awesome-repositories: https://awesome-repositories.com/repository/github-linguist-linguist.md

## Topics

`language-grammars` `language-statistics` `linguistic` `syntax-highlighting`

## Description

Linguist is a programming language detection library designed to identify the languages used within source code files and software repositories. It functions as a repository metadata classifier, providing the automated analysis necessary to generate language statistics and insights for version control platforms.

The tool employs a strategy-based detection pipeline that combines multiple identification methods to ensure accuracy. It utilizes heuristic-based pattern matching for file extensions and filenames, supplemented by regex-driven content analysis and Bayesian statistical classification to resolve ambiguities. Users can further refine these results through configuration-driven override logic, which allows for the manual mapping of specific paths or patterns to designated languages.

Beyond basic identification, the library supports comprehensive software project analytics by auditing the composition of codebases. It provides the infrastructure to manage repository metadata, enabling developers to correct or customize language detection settings for specific files or directories to meet project requirements.

## Tags

### Artificial Intelligence & ML

- [Source Code Language Detectors](https://awesome-repositories.com/f/artificial-intelligence-ml/language-detection-tools/source-code-language-detectors.md) — Identifies the language of source code files by analyzing extensions, filenames, and content patterns.

### Programming Languages & Runtimes

- [Language Identification Utilities](https://awesome-repositories.com/f/programming-languages-runtimes/programming-language-varieties/programming-languages/language-identification-utilities.md) — Examines source code files to determine the underlying programming language by analyzing extensions, filenames, and content patterns. ([source](https://github.com/github-linguist/linguist/tree/main/docs/))

### Development Tools & Productivity

- [Code Analysis Tools](https://awesome-repositories.com/f/development-tools-productivity/code-quality-analysis/static-analysis-engines/static-analysis-tools/code-analysis-tools.md) — Categorizes and audits codebases by automatically detecting the primary programming languages used in a repository.
- [Language Classifiers](https://awesome-repositories.com/f/development-tools-productivity/documentation-discovery-metadata/metadata-processing-analysis/metadata-analysis-tools/language-classifiers.md) — Determines language statistics for version control platforms by scanning file contents and applying custom override rules.
- [Language Identification Services](https://awesome-repositories.com/f/development-tools-productivity/source-code-repositories/language-identification-services.md) — Automatically detects programming languages used in a repository to provide accurate statistics and syntax highlighting.
- [Codebase Composition Analyzers](https://awesome-repositories.com/f/development-tools-productivity/project-analytics/codebase-composition-analyzers.md) — Analyzes the composition of codebases to understand the distribution of languages and technologies used across software projects.
- [Version Control and Repository Tools](https://awesome-repositories.com/f/development-tools-productivity/version-control-repository-tools.md) — Processes file contents within version control systems to generate language breakdowns and insights for hosted repositories.
- [File Pattern Matching](https://awesome-repositories.com/f/development-tools-productivity/file-pattern-matching.md) — Analyzes file extensions and filenames against a predefined database to determine the primary language of a source file.

### Security & Cryptography

- [Detection Overrides](https://awesome-repositories.com/f/security-cryptography/detection-overrides.md) — Allows users to specify or ignore programming languages for particular files or directories to correct automatic identification inaccuracies. ([source](https://github.com/github-linguist/linguist/tree/main/docs/))

### DevOps & Infrastructure

- [Configuration Overrides](https://awesome-repositories.com/f/devops-infrastructure/configuration-management/configuration-resolution-engines/configuration-overrides.md) — Provides configuration-driven override logic to manually map specific file paths or patterns to designated programming languages.

### Data & Databases

- [Regex-Driven Parsers](https://awesome-repositories.com/f/data-databases/text-processing-utilities/text-processing-tools/regex-driven-parsers.md) — Applies regular expressions to file contents to identify language-specific syntax patterns when metadata is ambiguous.

### Scientific & Mathematical Computing

- [Language Detection Classifiers](https://awesome-repositories.com/f/scientific-mathematical-computing/numerical-mathematical-foundations/statistics-probability/probability-distributions/recursive-bayesian-updates/language-detection-classifiers.md) — Uses Bayesian statistical classification to calculate the probability of a language match based on source code token frequencies.
