Stanza is a Python natural language processing library designed for tokenization, lemmatization, and dependency parsing across many human languages using neural models. It provides a neural processing pipeline that converts raw text into structured linguistic data objects, alongside a specialized analyzer for extracting medical insights from clinical and biomedical language.
The project includes a wrapper that connects Python scripts to Java-based natural language processing tools and remote annotation servers. This enables a bridge for extracting linguistic annotations and analysis data from Java-based software.
The library covers a broad range of linguistic analysis, including named entity recognition, coreference resolution, and syntactic dependency parsing. It supports the construction of annotation pipelines to extract features such as parts of speech and morphological properties across diverse linguistic datasets.
Users can perform custom training of neural network modules using project-specific data to refine the accuracy of tokenizers and parsers.