# microsoft/unilm

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/microsoft-unilm).**

22,030 stars · 2,694 forks · Python · mit

## Links

- GitHub: https://github.com/microsoft/unilm
- Homepage: https://aka.ms/GeneralAI
- awesome-repositories: https://awesome-repositories.com/repository/microsoft-unilm.md

## Topics

`beit` `beit-3` `bitnet` `deepnet` `document-ai` `foundation-models` `kosmos` `kosmos-1` `layoutlm` `layoutxlm` `llm` `minilm` `mllm` `multimodal` `nlp` `pre-trained-model` `textdiffuser` `trocr` `unilm` `xlm-e`

## Description

This project is a comprehensive framework and toolkit for developing, optimizing, and deploying transformer-based models across multimodal, document intelligence, and natural language processing tasks. It provides a unified neural architecture that processes text, vision, audio, and document layout data through a shared set of weights, enabling researchers and developers to build foundational models that align cross-modal representations.

The platform distinguishes itself through advanced training and inference strategies designed for large-scale deep learning. It incorporates specialized mechanisms such as retentive state processing for efficient sequence generation, differential attention for improved focus, and distributed weight partitioning to handle memory-intensive computations. These capabilities are complemented by techniques for sparse decoding and model compression, which maintain performance while reducing the computational footprint of large-scale architectures.

The project covers a broad capability surface, including end-to-end pipelines for data curation, synthetic data generation, and tokenization across diverse modalities. It supports extensive workflows for pre-training, instruction tuning, and fine-tuning, with specific focus areas in document understanding, speech synthesis, and cross-lingual transfer. Diagnostic tools for attention analysis and benchmarking further assist in evaluating model performance on complex reasoning and retrieval tasks.

## Tags

### Artificial Intelligence & ML

- [Intelligent Document Processing](https://awesome-repositories.com/f/artificial-intelligence-ml/intelligent-document-processing.md) — Provides a comprehensive framework for extracting information from visually-rich documents by integrating text, layout, and image analysis.
- [Language Model Fine-Tuning](https://awesome-repositories.com/f/artificial-intelligence-ml/language-model-fine-tuning.md) — Supports large language model fine-tuning to adapt pre-trained models to specific domains and downstream tasks.
- [Large Language Models](https://awesome-repositories.com/f/artificial-intelligence-ml/language-model-orchestration/large-language-models.md) — Offers a complete toolkit for pretraining, instruction tuning, and optimizing transformer-based models for diverse natural language tasks.
- [Language Model Training](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/model-fine-tuning-adaptation/language-model-training.md) — Provides distributed pre-training pipelines for building large-scale language models from scratch. ([source](https://github.com/microsoft/unilm/tree/master/retnet))
- [Multimodal AI Systems](https://awesome-repositories.com/f/artificial-intelligence-ml/multimodal-ai-systems.md) — Provides a comprehensive framework for multimodal AI development, integrating text, vision, audio, and document layout data.
- [Structured Document Extraction](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/structured-document-extraction.md) — Extracts information from structured documents like forms and receipts by analyzing both textual content and visual layout features. ([source](https://github.com/microsoft/unilm/tree/master/layoutlmv3))
- [Document Question Answering Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/document-data-intelligence/question-answering/document-question-answering-pipelines.md) — Analyzes the structure and content of web pages to provide accurate answers to natural language queries about the document. ([source](https://github.com/microsoft/unilm/tree/master/markuplm))
- [Unified Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/model-fine-tuning-adaptation/multimodal-fine-tuning/unified-frameworks.md) — Provides a research platform for training and fine-tuning unified transformer models across text, vision, audio, and document modalities.
- [Training Efficiency](https://awesome-repositories.com/f/artificial-intelligence-ml/model-optimization/training-efficiency.md) — Provides efficient model training workflows through distributed training, memory optimization, and hardware-aware kernels.
- [Modular Backbone Architectures](https://awesome-repositories.com/f/artificial-intelligence-ml/modular-backbone-architectures.md) — Provides a unified transformer backbone that processes text, vision, and audio inputs through a shared set of weights.
- [Multimodal Models](https://awesome-repositories.com/f/artificial-intelligence-ml/multimodal-models.md) — Enables the development of foundational models that align and process cross-modal data including speech, images, and text.
- [Document Layout Analysis](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/document-layout-analysis.md) — Identifies and segments structural elements within document images such as text blocks, figures, and tables. ([source](https://github.com/microsoft/unilm/tree/master/dit))
- [Speech Synthesis](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis.md) — Generates natural-sounding human speech from short text prompts using neural codec language models. ([source](https://github.com/microsoft/unilm/tree/master/valle))
- [Information Extraction](https://awesome-repositories.com/f/artificial-intelligence-ml/information-extraction.md) — Processes text and markup from visually-rich documents to identify and pull specific data points. ([source](https://github.com/microsoft/unilm/tree/master/markuplm))
- [Long Context Training Optimizations](https://awesome-repositories.com/f/artificial-intelligence-ml/long-context-training-optimizations.md) — Implements dilated attention mechanisms to handle context windows of up to one billion tokens efficiently. ([source](https://github.com/microsoft/unilm/tree/master/longnet))
- [Model-Driven Text Extraction](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/document-data-intelligence/model-driven-text-extraction.md) — Identifies text within images and assigns precise spatial coordinates to enable document-level text recognition. ([source](https://github.com/microsoft/unilm/tree/master/kosmos-2.5))
- [Multimodal Layout Analysis](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/document-data-intelligence/multimodal-layout-analysis.md) — Integrates text, spatial layout, and visual image data into a unified model to extract information and understand visually-rich documents. ([source](https://github.com/microsoft/unilm/tree/master/layoutxlm))
- [Vision Model Fine-Tuning](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/machine-learning-training/fine-tuning-and-alignment/fine-tuning-frameworks/vision-model-fine-tuning.md) — Adapts pre-trained transformer models to specific downstream vision tasks like image classification and semantic segmentation. ([source](https://github.com/microsoft/unilm/tree/master/beit))
- [Inference Optimization](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-inference-serving/inference-optimization.md) — Reduces memory usage and improves computational efficiency during sequence generation using gated retention mechanisms. ([source](https://github.com/microsoft/unilm/tree/master/retnet))
- [Language Model Fine-Tuning](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/fine-tuning-and-customization/language-model-fine-tuning.md) — Adapts pre-trained document understanding models to specific downstream tasks like question answering or form extraction using task-specific datasets. ([source](https://github.com/microsoft/unilm/tree/master/xdoc))
- [Mixture of Experts](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/fine-tuning-and-customization/model-customization/mixture-of-experts.md) — Scales deep learning models using specialized architectures like mixture-of-experts and retentive networks. ([source](https://github.com/microsoft/unilm#readme))
- [Unified Understanding and Generation Training](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/model-fine-tuning-adaptation/language-model-training/unified-understanding-and-generation-training.md) — Trains neural networks on combined datasets to perform both natural language understanding and text generation within a single architecture. ([source](https://github.com/microsoft/unilm/tree/master/unilm))
- [Speech Processing](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/speech-processing.md) — Supports speech and audio processing for automatic speech recognition, voice synthesis, and acoustic representation learning.
- [Memory Optimization Techniques](https://awesome-repositories.com/f/artificial-intelligence-ml/memory-optimization-techniques.md) — Reduces GPU memory consumption during training using distributed strategies and activation checkpointing. ([source](https://github.com/microsoft/unilm/tree/master/vlmo))
- [Model Optimization Suites](https://awesome-repositories.com/f/artificial-intelligence-ml/model-optimization-suites.md) — Implements a suite of techniques for accelerating inference and reducing memory usage in large-scale deep learning architectures.
- [Model Compression Suites](https://awesome-repositories.com/f/artificial-intelligence-ml/model-optimization/compression-techniques/model-pruning/model-compression-suites.md) — Reduces model size and computational requirements through self-attention distillation. ([source](https://github.com/microsoft/unilm/tree/master/minilm))
- [Attention Kernel Fusion](https://awesome-repositories.com/f/artificial-intelligence-ml/model-optimization/inference-deployment/attention-backends/attention-kernel-fusion.md) — Executes differential attention operations efficiently using hardware-aware kernels to accelerate training and inference. ([source](https://github.com/microsoft/unilm/tree/master/Diff-Transformer))
- [Weight Distribution](https://awesome-repositories.com/f/artificial-intelligence-ml/model-weight-management/weight-distribution.md) — Provides distributed weight partitioning strategies to handle memory-intensive computations across multiple processors during large-scale model development.
- [Multimodal Large Language Models](https://awesome-repositories.com/f/artificial-intelligence-ml/multimodal-large-language-models.md) — Integrates visual and textual data into a unified model to enable multimodal understanding and generation tasks across different input modalities. ([source](https://github.com/microsoft/unilm/tree/master/kosmos-1))
- [Multimodal Training](https://awesome-repositories.com/f/artificial-intelligence-ml/multimodal-training.md) — Trains unified models capable of processing and generating across text, vision, speech, and document modalities. ([source](https://github.com/microsoft/unilm#readme))
- [Multilingual Text Recognition](https://awesome-repositories.com/f/artificial-intelligence-ml/optical-character-recognition/multilingual-text-recognition.md) — Processes visual input using transformer architectures to generate text output for handwritten and printed documents. ([source](https://github.com/microsoft/unilm/tree/master/trocr))
- [Synthetic Data Generation](https://awesome-repositories.com/f/artificial-intelligence-ml/synthetic-data-generation.md) — Provides synthetic data generation pipelines to create pseudo-test inputs for evaluating and filtering training data. ([source](https://github.com/microsoft/unilm/tree/master/PFPO))
- [Training Data Generation](https://awesome-repositories.com/f/artificial-intelligence-ml/training-data-generation.md) — Supports synthetic training data generation to create large-scale instruction-tuning datasets for model improvement. ([source](https://github.com/microsoft/unilm/tree/master/glan))
- [Image Segmentation](https://awesome-repositories.com/f/artificial-intelligence-ml/computer-vision-systems/image-segmentation.md) — Adapts pre-trained vision models to perform pixel-level semantic segmentation for identifying and labeling distinct objects within an image. ([source](https://github.com/microsoft/unilm/tree/master/beit2))
- [Dataset Preparation Tools](https://awesome-repositories.com/f/artificial-intelligence-ml/dataset-preparation-tools.md) — Provides comprehensive training dataset curation tools for organizing general knowledge, code, and mathematical datasets. ([source](https://github.com/microsoft/unilm#readme))
- [Distributed Training](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-training.md) — Distributes large model training across multiple processors by partitioning model weights to handle memory-intensive computations efficiently. ([source](https://github.com/microsoft/unilm/tree/master/PFPO))
- [Image Classification](https://awesome-repositories.com/f/artificial-intelligence-ml/image-classification.md) — Fine-tunes pre-trained vision models to categorize images into predefined classes with high accuracy. ([source](https://github.com/microsoft/unilm/tree/master/beit2))
- [Document Image Model Pre-training](https://awesome-repositories.com/f/artificial-intelligence-ml/large-scale-model-training/document-image-model-pre-training.md) — Learns visual representations from large-scale unlabeled document images using self-supervised techniques. ([source](https://github.com/microsoft/unilm/tree/master/dit))
- [Vision Transformer Pre-training](https://awesome-repositories.com/f/artificial-intelligence-ml/large-scale-model-training/vision-transformer-pre-training.md) — Trains image transformer models using masked image modeling to learn visual representations from large-scale datasets. ([source](https://github.com/microsoft/unilm/tree/master/beit))
- [Speech Model Fine-Tuning](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/machine-learning-training/fine-tuning-and-alignment/fine-tuning-frameworks/speech-model-fine-tuning.md) — Adapts pre-trained audio representations to specific downstream tasks like speaker verification and speech separation. ([source](https://github.com/microsoft/unilm/tree/master/speechlm))
- [Model Performance Benchmarking](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-evaluation-analysis/model-analysis/model-performance-benchmarking.md) — Runs standardized benchmarks on mathematical reasoning datasets to measure model accuracy and output quality. ([source](https://github.com/microsoft/unilm/tree/master/e5))
- [Model Training Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/training-frameworks/model-training-pipelines.md) — Executes instruction-tuning pipelines on large-scale grounded image-text datasets for unified vision-language systems. ([source](https://github.com/microsoft/unilm/tree/master/kosmos-2))
- [Model Fine-Tuning and Adaptation](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/model-fine-tuning-adaptation.md) — Offers comprehensive tools for refining pre-trained models across various domains and task requirements. ([source](https://github.com/microsoft/unilm#readme))
- [Synthetic Speech Generation](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/multimodal-processing-tools/synthetic-speech-generation.md) — Converts text input into natural-sounding audio using pre-trained models and vocoders. ([source](https://github.com/microsoft/unilm/tree/master/speecht5))
- [Self-Supervised Speech Representations](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/speech-processing/self-supervised-speech-representations.md) — Trains large-scale self-supervised models on extensive audio datasets to generate robust representations for speech processing. ([source](https://github.com/microsoft/unilm/tree/master/speechlm))
- [Model Deployment Toolkits](https://awesome-repositories.com/f/artificial-intelligence-ml/model-optimization/inference-deployment/model-deployment-toolkits.md) — Provides toolkits for efficient sequence-to-sequence decoding and model compression for production environments. ([source](https://github.com/microsoft/unilm#readme))
- [Multimodal Token Interleaving](https://awesome-repositories.com/f/artificial-intelligence-ml/multimodal-models/multimodal-token-interleaving.md) — Implements multimodal data tokenization to align audio, text, and visual inputs into unified sequences for model training. ([source](https://github.com/microsoft/unilm/tree/master/speechlm))
- [Self-Supervised Embedding Trainers](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/word-embeddings/self-supervised-embedding-trainers.md) — Trains vision models using self-supervised learning to create reusable feature representations. ([source](https://github.com/microsoft/unilm/tree/master/beit2))
- [Multimodal Document Pre-training](https://awesome-repositories.com/f/artificial-intelligence-ml/pre-training-pipelines/multimodal-document-pre-training.md) — Learns joint representations of text, spatial layout, and visual image features for document understanding tasks. ([source](https://github.com/microsoft/unilm/tree/master/layoutlm))
- [Preference Optimization](https://awesome-repositories.com/f/artificial-intelligence-ml/preference-optimization.md) — Refines model outputs using direct preference optimization by comparing responses against feedback. ([source](https://github.com/microsoft/unilm/tree/master/PFPO))
- [Sequence Decoders](https://awesome-repositories.com/f/artificial-intelligence-ml/sequence-decoding-models/sequence-decoders.md) — Predicts multiple tokens simultaneously during sequence generation to reduce decoding steps. ([source](https://github.com/microsoft/unilm/tree/master/decoding))
- [Subword Tokenization](https://awesome-repositories.com/f/artificial-intelligence-ml/subword-tokenization.md) — Implements subword text tokenization to convert raw text into numerical sequences for transformer architectures. ([source](https://github.com/microsoft/unilm/tree/master/beit3))
- [Vision-Language Grounding Models](https://awesome-repositories.com/f/artificial-intelligence-ml/vision-language-grounding-models.md) — Links text spans such as noun phrases and referring expressions to specific image regions to enable phrase grounding and comprehension. ([source](https://github.com/microsoft/unilm/tree/master/kosmos-2))
- [Attention Mechanisms](https://awesome-repositories.com/f/artificial-intelligence-ml/attention-mechanisms.md) — Calculates attention scores using a differential mechanism that subtracts two separate attention maps to improve model focus and performance. ([source](https://github.com/microsoft/unilm/tree/master/Diff-Transformer))
- [Audio Tokenization](https://awesome-repositories.com/f/artificial-intelligence-ml/audio-tokenization.md) — Learns acoustic representations from raw audio data using iterative tokenization for downstream classification tasks. ([source](https://github.com/microsoft/unilm/tree/master/beats))
- [Data Preparation](https://awesome-repositories.com/f/artificial-intelligence-ml/data-preparation.md) — Provides pipelines for multilingual training data preparation, converting raw text and parallel pairs into memory-mapped binary formats. ([source](https://github.com/microsoft/unilm/tree/master/infoxlm))
- [Multilingual Extractors](https://awesome-repositories.com/f/artificial-intelligence-ml/document-analysis/multilingual-extractors.md) — Extends document understanding capabilities to multiple languages by training on cross-lingual datasets to extract key-value pairs from international document formats. ([source](https://github.com/microsoft/unilm/tree/master/layoutlm))
- [Reading Order Benchmarks](https://awesome-repositories.com/f/artificial-intelligence-ml/document-analysis/reading-order-benchmarks.md) — Provides large-scale datasets of document images paired with ground-truth reading order information to evaluate and train document analysis models. ([source](https://github.com/microsoft/unilm/tree/master/layoutreader))
- [Text-to-Image Generators](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/diffusion-visual-models/generative-ai-pipelines/text-to-image-generators.md) — Creates images containing coherent text by using text prompts and layout guidance. ([source](https://github.com/microsoft/unilm/tree/master/textdiffuser))
- [Visual Text Renderers](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/diffusion-visual-models/generative-ai-pipelines/text-to-image-generators/visual-text-renderers.md) — Generates visual output from text inputs using pre-trained models or fine-tuned adapters to render specific text styles and layouts. ([source](https://github.com/microsoft/unilm/tree/master/textdiffuser-2))
- [Knowledge Distillation](https://awesome-repositories.com/f/artificial-intelligence-ml/knowledge-distillation.md) — Transfers knowledge from teacher models to student retrievers to improve performance and efficiency. ([source](https://github.com/microsoft/unilm/tree/master/simlm))
- [Mathematical Reasoning Training](https://awesome-repositories.com/f/artificial-intelligence-ml/large-scale-training/mathematical-reasoning-training.md) — Trains models on large-scale synthetic instruction datasets to enhance mathematical problem-solving capabilities. ([source](https://github.com/microsoft/unilm/tree/master/mathscale))
- [Long Context Retrieval Testing](https://awesome-repositories.com/f/artificial-intelligence-ml/long-context-training-optimizations/long-context-retrieval-testing.md) — Assesses model recall capabilities within long sequences using needle-in-a-haystack and multi-needle retrieval experiments. ([source](https://github.com/microsoft/unilm/tree/master/YOCO))
- [Custom Vision Training](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/computer-vision-and-recognition/custom-vision-training.md) — Supports modifying pre-trained vision and language weights to master downstream tasks like visual question answering. ([source](https://github.com/microsoft/unilm/tree/master/unilm-v1))
- [Biencoder Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/fine-tuning-and-customization/fine-tuning-pipelines/biencoder-pipelines.md) — Executes a multi-stage supervised fine-tuning pipeline to develop high-performance biencoder models for information retrieval tasks. ([source](https://github.com/microsoft/unilm/tree/master/simlm))
- [Model Fine-Tuning](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/fine-tuning-and-customization/model-fine-tuning.md) — Adapts pre-trained models to specific document understanding objectives like semantic entity recognition and relation extraction using labeled datasets. ([source](https://github.com/microsoft/unilm/tree/master/layoutxlm))
- [Cross-Lingual Objectives](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/model-fine-tuning-adaptation/language-model-training/cross-lingual-objectives.md) — Trains language models using masked language modeling, translation language modeling, and contrastive learning objectives to improve cross-lingual representation. ([source](https://github.com/microsoft/unilm/tree/master/infoxlm))
- [Retrieval Model Pre-training](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/model-fine-tuning-adaptation/language-model-training/retrieval-model-pre-training.md) — Compresses input information into a representation bottleneck to create specialized models for dense passage retrieval. ([source](https://github.com/microsoft/unilm/tree/master/simlm))
- [Ternary Weight Optimizations](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/model-fine-tuning-adaptation/language-model-training/ternary-weight-optimizations.md) — Optimizes large language model architectures by using ternary weights to reduce memory footprint and computational requirements. ([source](https://github.com/microsoft/unilm/tree/master/bitnet))
- [Sequence-to-Sequence Tasks](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/speech-processing/sequence-to-sequence-tasks.md) — Trains parameter-efficient transformer models to perform tasks like grammatical error correction and abstractive summarization on resource-constrained devices. ([source](https://github.com/microsoft/unilm/tree/master/edgelm))
- [Speech Translation Systems](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/speech-processing/speech-translation-systems.md) — Translates spoken language by processing audio input and generating text output through sequence-to-sequence models. ([source](https://github.com/microsoft/unilm/tree/master/speecht5))
- [Multilingual Text Processing](https://awesome-repositories.com/f/artificial-intelligence-ml/multilingual-text-processing.md) — Facilitates multilingual text processing by tokenizing and converting raw text into binary formats for large-scale training. ([source](https://github.com/microsoft/unilm/tree/master/deltalm))
- [Multimodal Prompt Adapters](https://awesome-repositories.com/f/artificial-intelligence-ml/multimodal-training/multimodal-prompt-adapters.md) — Customizes pre-trained multimodal models for text-intensive image understanding tasks by applying supervised training with task-specific prompts. ([source](https://github.com/microsoft/unilm/tree/master/kosmos-2.5))
- [Speech-to-Text Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-to-text-engines.md) — Converts model outputs into text using language models and lexicons to improve transcription accuracy. ([source](https://github.com/microsoft/unilm/tree/master/speechlm))
- [Cross-Lingual Translation Training](https://awesome-repositories.com/f/artificial-intelligence-ml/text-translation-tools/cross-lingual-translation-training.md) — Provides cross-lingual transformer encoders to improve the accuracy and scalability of automated translation workflows. ([source](https://github.com/microsoft/unilm/tree/master/xlmt))
- [Referring Expression Generators](https://awesome-repositories.com/f/artificial-intelligence-ml/zero-shot-inference/referring-expression-generators.md) — Produces descriptive text for specific image regions based on provided visual context using zero-shot or few-shot learning techniques. ([source](https://github.com/microsoft/unilm/tree/master/kosmos-2))
- [Audio Processing](https://awesome-repositories.com/f/artificial-intelligence-ml/audio-processing.md) — Applies fine-tuned acoustic models to categorize audio inputs into specific classes based on learned patterns. ([source](https://github.com/microsoft/unilm/tree/master/beats))
- [Cross-Modal Representations](https://awesome-repositories.com/f/artificial-intelligence-ml/cross-modal-representations.md) — Applies iterative word alignment and contrastive loss functions during training to synchronize semantic representations across different languages. ([source](https://github.com/microsoft/unilm/tree/master/infoxlm))
- [Custom Diffusion Model Training](https://awesome-repositories.com/f/artificial-intelligence-ml/custom-diffusion-model-training.md) — Trains two-stage diffusion models on large-scale image-text datasets annotated with character-level segmentation masks and optical character recognition data. ([source](https://github.com/microsoft/unilm/tree/master/textdiffuser))
- [Vocabulary Builders](https://awesome-repositories.com/f/artificial-intelligence-ml/embedding-adaptation-utilities/vocabulary-embedding-adapters/vocabulary-builders.md) — Supports incremental vocabulary generation to expand token sets for domain-specific terminology. ([source](https://github.com/microsoft/unilm/tree/master/adalm))
- [Embedding Generators](https://awesome-repositories.com/f/artificial-intelligence-ml/embedding-generators.md) — Transforms text inputs into high-dimensional vector representations using pre-trained language models to support semantic search and information retrieval tasks. ([source](https://github.com/microsoft/unilm/tree/master/e5))
- [Layout Planners](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/document-data-intelligence/multimodal-layout-analysis/layout-planners.md) — Optimizes models to predict spatial arrangements for text elements within generated images to ensure coherent composition. ([source](https://github.com/microsoft/unilm/tree/master/textdiffuser-2))
- [Pre-made Models](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-deployment-and-serving/model-hubs-and-pre-made-models/pre-made-models.md) — Leverages pre-trained model weights to accelerate development of systems for complex document layout analysis. ([source](https://github.com/microsoft/unilm/tree/master/markuplm))
- [Backbone Model Integration](https://awesome-repositories.com/f/artificial-intelligence-ml/pre-training-pipelines/backbone-model-integration.md) — Utilizes established transformer architectures as backbones to initialize and accelerate training of document understanding systems. ([source](https://github.com/microsoft/unilm/tree/master/layoutlmft))
- [Result Reranking](https://awesome-repositories.com/f/artificial-intelligence-ml/result-reranking.md) — Optimizes re-ranking models to refine retrieval results by evaluating the relevance between queries and passages more precisely. ([source](https://github.com/microsoft/unilm/tree/master/simlm))
- [Voice Cloning](https://awesome-repositories.com/f/artificial-intelligence-ml/voice-cloning.md) — Modifies speaker identity or characteristics of audio input while preserving linguistic content. ([source](https://github.com/microsoft/unilm/tree/master/speecht5))

### Operating Systems & Systems Programming

- [Retentive State Mechanisms](https://awesome-repositories.com/f/operating-systems-systems-programming/kernel-core-internals/process-and-memory-management/retentive-state-mechanisms.md) — Implements retentive state processing to enable efficient sequence generation and handle long-context data during inference.

### Content Management & Publishing

- [Reading Order Predictors](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing/data-extraction-analysis/document-layout-analyzers/reading-order-predictors.md) — Analyzes text and spatial layout information within document images to determine the logical sequence in which text lines should be read. ([source](https://github.com/microsoft/unilm/tree/master/layoutreader))
- [AI-Generated Captions](https://awesome-repositories.com/f/content-management-publishing/documentation-knowledge-management/captioned-figure-managers/ai-generated-captions.md) — Produces descriptive text summaries for images by interpreting visual content and generating corresponding natural language captions. ([source](https://github.com/microsoft/unilm/tree/master/kosmos-2))
- [Markdown Converters](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing/format-specific-parsers/markdown-converters.md) — Transforms visual document layouts into structured markdown format by capturing both the text content and its original styling. ([source](https://github.com/microsoft/unilm/tree/master/kosmos-2.5))

### Data & Databases

- [Visual Tokenizers](https://awesome-repositories.com/f/data-databases/data-compression-algorithms/visual-token-compression/visual-tokenizers.md) — Implements visual data tokenization to convert raw images into discrete tokens using encoder-decoder architectures. ([source](https://github.com/microsoft/unilm/tree/master/beit2))
- [Sparse Caching Strategies](https://awesome-repositories.com/f/data-databases/data-engineering-infrastructure/caching-performance/caching-strategies/cache-key-generators/sparse-caching-strategies.md) — Implements sparse key-value caching to maintain accuracy while accelerating the processing of long text sequences.
- [Document Classification](https://awesome-repositories.com/f/data-databases/document-classification.md) — Categorizes documents based on their visual structure and content to automate sorting and organization workflows. ([source](https://github.com/microsoft/unilm/tree/master/dit))

### Development Tools & Productivity

- [Data Input Interfaces](https://awesome-repositories.com/f/development-tools-productivity/data-input-interfaces.md) — Provides automated input data processing to detect and handle raw text or pre-tokenized dataset structures. ([source](https://github.com/microsoft/unilm/tree/master/s2s-ft))

### Graphics & Multimedia

- [Audio Feature Extraction](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/media-manipulation/media-processing-workflows/audio-analysis-synthesis/audio-feature-extraction.md) — Processes audio input through pre-trained models to generate numerical representations for downstream speech analysis. ([source](https://github.com/microsoft/unilm/tree/master/speechlm))

### Software Engineering & Architecture

- [Training Cycles](https://awesome-repositories.com/f/software-engineering-architecture/document-models/training-cycles.md) — Executes training cycles and performance testing on document datasets to optimize model accuracy for specialized document understanding tasks. ([source](https://github.com/microsoft/unilm/tree/master/xdoc))
