Presidio

Presidio is a PII detection and anonymization framework designed to identify and mask personally identifiable information in text. It functions as a PII recognition pipeline and a data masking engine, using a combination of machine learning, regular expressions, and rule-based logic to locate sensitive entities.

The system acts as an NER model orchestrator, allowing for the integration of external named entity recognition models and PII detectors to support multi-language privacy scrubbing. It employs a plugin-based recognizer architecture that can be extended with custom recognizers, deny-lists, and specialized detection logic via configuration files.

The framework covers a broad range of data protection capabilities, including automated data redaction, hashing, and encryption. It provides tools for context-aware confidence scoring to reduce false positives and offers a standardized entity mapping system to ensure consistency across different processing engines.

Features

Data Anonymization - Provides a comprehensive engine for removing or masking personally identifiable information and sensitive data.

Data Masking Tools - Provides a comprehensive engine for obscuring, hashing, or encrypting sensitive information to preserve data utility.

NER - Integrates and manages multiple Named Entity Recognition models to detect sensitive data across different languages.

Custom Entity Recognizers - Allows the definition of custom detection rules using regular expressions and deny-lists within configuration files.

Language Detection Tools - Supports multi-language PII analysis by utilizing language-specific models and codes.

Sensitive Data Identification - Identifies personal identifiers and credentials within unstructured text to facilitate privacy protection.

Anonymization Operators - Provides a set of interchangeable operators to redact, mask, hash, or encrypt identified PII.

Custom PII Recognition - Enables the creation of specialized detectors and regex rules to identify industry-specific sensitive data.

Data Redaction Tools - Automatically identifies and removes sensitive information from documents and datasets to ensure privacy compliance.

Multi-Language Privacy Scrubbing - Identifies and masks sensitive information across various languages using internal and external processing engines.

PII Detection and Screening - Provides an automated framework for identifying and screening personally identifiable information to ensure data privacy.

PII Recognition Pipelines - Implements a structured workflow for scanning text with regular expressions, deny-lists, and context rules.

Plugin-Based Architectures - Uses a modular architecture to extend PII recognition capabilities via pluggable regex, rule-based, and ML recognizers.

Custom PII Recognizers - Allows the implementation of custom detection classes to identify industry-specific or unique PII patterns.

Contextual Confidence Boosting - Improves detection accuracy by adjusting confidence scores based on the presence of nearby contextual keywords.

Entity Label Mapping - Translates varying labels from different machine learning models into a consistent set of unified entity types.

Sequential Text Processing Pipelines - Employs a sequential pipeline of detection logic and post-processing layers to analyze text for sensitive information.

Contextual - Uses surrounding keywords and metadata to refine the probability of PII detection and reduce false positives.

Entity Mappings - Maps raw entity labels from various processing engines to standardized types for consistent classification.

Regular Expression Libraries - Provides a library of predefined regular expression patterns to identify sensitive data and assign confidence scores.

Configuration File Loading - Implements runtime loading of PII recognizer settings and language configurations from external structured files.

Custom Anonymization Logic - Enables the application of custom functions to entities for specialized tasks such as pseudonymization.

Reversible Anonymization - Restores original values from encrypted entities using a cryptographic key to recover information.

Field Level Encryption - Replace identified information with encrypted values using a cryptographic key to protect data while maintaining recovery options.

Deny-List Detection - Identifies sensitive information by matching text against predefined lists of specific tokens or keywords.

External Service Integrations - Connects to remote services and specialized libraries to leverage external ML models for PII detection.

Detection Coverage Extensions - Supports the addition of new entity types and languages to broaden the scope of identifiable sensitive information.

Third-Party API Integrations - Integrates external named entity recognition services to support PII detection in additional languages.

Data Anonymization - Service for context-aware PII anonymization and data protection.

Security And Privacy - Framework for detecting and anonymizing sensitive PII data.

microsoftpresidio

Features

Star history