Linguist is a programming language detection library designed to identify the languages used within source code files and software repositories. It functions as a repository metadata classifier, providing the automated analysis necessary to generate language statistics and insights for version control platforms.
The tool employs a strategy-based detection pipeline that combines multiple identification methods to ensure accuracy. It utilizes heuristic-based pattern matching for file extensions and filenames, supplemented by regex-driven content analysis and Bayesian statistical classification to resolve ambiguities. Users can further refine these results through configuration-driven override logic, which allows for the manual mapping of specific paths or patterns to designated languages.
Beyond basic identification, the library supports comprehensive software project analytics by auditing the composition of codebases. It provides the infrastructure to manage repository metadata, enabling developers to correct or customize language detection settings for specific files or directories to meet project requirements.