# azure/mmlspark

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/azure-mmlspark).**

5,228 stars · 861 forks · Scala · MIT

## Links

- GitHub: https://github.com/Azure/mmlspark
- Homepage: http://aka.ms/spark
- awesome-repositories: https://awesome-repositories.com/repository/azure-mmlspark.md

## Description

Mmlspark is a distributed framework for executing machine learning models, data transformations, and AI service integrations across Apache Spark clusters. It functions as a distributed machine learning library and pipeline orchestrator, allowing users to integrate pre-trained cognitive services and custom models into large-scale batch and streaming workflows.

The project is distinguished by its ability to incorporate external AI services and web APIs directly into big data pipelines for text and vision analysis. It provides a scalable model training framework that coordinates gradient boosting and classification tasks across elastically resizable compute clusters, utilizing hardware acceleration for distributed model inference.

The toolset covers a broad range of capabilities including multimodal content analysis for image, speech, and text, as well as advanced anomaly detection for time-series and multivariate data. It includes utilities for data featurization, the execution of ONNX models, and responsible AI tools for model fairness auditing and prediction interpretation using additive contribution values.

The framework also provides a unified data access interface for reading and writing across various databases and cloud storage systems.

## Tags

### Artificial Intelligence & ML

- [Spark Integrations](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-deep-learning/spark-integrations.md) — Wraps machine learning workflows within a Spark execution engine to distribute processing across a compute cluster.
- [AI Service Integrations](https://awesome-repositories.com/f/artificial-intelligence-ml/ai-service-integrations.md) — Provides connectors and interfaces for integrating external artificial intelligence services into distributed data pipelines. ([source](https://microsoft.github.io/SynapseML/docs/Get%20Started/Quickstart%20-%20Your%20First%20Models/))
- [Distributed Inference Scaling](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-inference-scaling.md) — Executes trained models across clusters to generate predictions on large-scale data with hardware acceleration.
- [Distributed ML Pipeline Managers](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-ml-pipeline-managers.md) — Orchestrates the integration of pre-trained AI services and custom models into large-scale batch and streaming workflows.
- [Distributed Training](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-training-frameworks/distributed-training.md) — Coordinates the training and evaluation of machine learning models across elastically resizable compute clusters. ([source](https://microsoft.github.io/SynapseML/docs/Overview/))
- [Gradient Boosting](https://awesome-repositories.com/f/artificial-intelligence-ml/gradient-boosting.md) — Trains gradient boosting models across a distributed cluster to handle large datasets. ([source](https://cdn.jsdelivr.net/gh/azure/mmlspark@main/README.md))
- [Hardware-Accelerated Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/hardware-accelerated-inference.md) — Uses GPU and CPU acceleration to execute trained deep learning models across distributed datasets for higher throughput.
- [Machine Learning Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning-pipelines.md) — Incorporates machine learning models into batch and streaming workflows for consistent data processing. ([source](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/LightGBM/Overview/))
- [Machine Learning Workflow Libraries](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning-workflow-libraries.md) — Integrates machine learning models into distributed computing pipelines using a shared API. ([source](https://microsoft.github.io/SynapseML/docs/1.0.11/Overview/))
- [Distributed Training](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/machine-learning-training/distributed-training.md) — Scales the training and evaluation of machine learning models across multiple compute nodes.
- [Spark MLlib Integrations](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/distributed-and-scaling-strategies/distributed-learning/distributed-machine-learning-integrators/spark-mllib-integrations.md) — Provides a distributed framework for executing machine learning models and transformations across Apache Spark clusters.
- [Model Training Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/model-training-frameworks.md) — Offers a toolkit for training and evaluating gradient boosting and classification models across resizable clusters.
- [Multivariate Anomaly Detection](https://awesome-repositories.com/f/artificial-intelligence-ml/multivariate-anomaly-detection.md) — Identifies outliers across multiple variables by analyzing inter-correlations and dependencies between data dimensions. ([source](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/Anomaly%20Detection/Quickstart%20-%20Isolation%20Forests/))
- [Scalable Anomaly Detection](https://awesome-repositories.com/f/artificial-intelligence-ml/scalable-anomaly-detection.md) — Identifies outliers and multivariate anomalies in big data using distributed forest implementations and time series analysis.
- [AI Workflow Serving](https://awesome-repositories.com/f/artificial-intelligence-ml/ai-workflow-serving.md) — Exposes distributed computations as low-latency web services for real-time inference. ([source](https://cdn.jsdelivr.net/gh/azure/mmlspark@main/README.md))
- [Face Analysis](https://awesome-repositories.com/f/artificial-intelligence-ml/computer-vision-systems/face-analysis.md) — Detects human faces and groups individuals based on facial similarity and identity verification. ([source](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/AI%20Services/Overview/))
- [Distributed Computer Vision](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-computer-vision.md) — Analyzes image content and detects facial features across large datasets using distributed computing clusters.
- [Distributed Data Featurization](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-data-featurization.md) — Provides utilities for normalizing numeric features and mapping categorical labels across distributed datasets.
- [Distributed Training Optimizers](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-training-optimizers.md) — Optimizes distributed training through specialized synchronization to improve execution speed and memory usage. ([source](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/LightGBM/Overview/))
- [Forest-Based](https://awesome-repositories.com/f/artificial-intelligence-ml/gradient-boosting/outlier-detection/forest-based.md) — Identifies anomalies in large datasets using distributed forest implementations. ([source](https://cdn.jsdelivr.net/gh/azure/mmlspark@main/README.md))
- [Image Content Analyzers](https://awesome-repositories.com/f/artificial-intelligence-ml/image-content-analyzers.md) — Identifies visual features, generates descriptions, and extracts text or tags from images. ([source](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/AI%20Services/Overview/))
- [Inference Accelerators](https://awesome-repositories.com/f/artificial-intelligence-ml/inference-accelerators.md) — Runs distributed model inference using hardware acceleration and standardized formats to reduce processing time. ([source](https://cdn.jsdelivr.net/gh/azure/mmlspark@main/README.md))
- [ONNX Runtime Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-inference-serving/inference-engines/onnx-runtime-inference.md) — Executes trained deep learning and machine learning models across distributed data using the ONNX runtime. ([source](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/Deep%20Learning/ONNX/))
- [Model Persistence](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/data-and-checkpointing/model-loading/model-persistence.md) — Serializes machine learning models and metadata to a persistent store for retrieval and deployment. ([source](https://microsoft.github.io/SynapseML/docs/Use%20with%20MLFlow/Overview/))
- [Model Auditing](https://awesome-repositories.com/f/artificial-intelligence-ml/model-auditing.md) — Analyzes opaque models to understand decision-making and measure biases within datasets. ([source](https://cdn.jsdelivr.net/gh/azure/mmlspark@main/README.md))
- [Model Explainability](https://awesome-repositories.com/f/artificial-intelligence-ml/model-predictions/model-explainability.md) — Calculates feature importance and contribution scores to explain why specific data points were flagged. ([source](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/Anomaly%20Detection/Quickstart%20-%20Isolation%20Forests/))
- [Model Serializers](https://awesome-repositories.com/f/artificial-intelligence-ml/model-serializers.md) — Persists model bytes and metadata to a storage layer for retrieval and deployment across environments.
- [SHAP Value Computations](https://awesome-repositories.com/f/artificial-intelligence-ml/shap-value-computations.md) — Calculates additive contribution values for each feature to quantify the final model prediction. ([source](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/Responsible%20AI/Interpreting%20Model%20Predictions/))
- [Sentiment & Topic Analysis](https://awesome-repositories.com/f/artificial-intelligence-ml/text-feature-extraction/sentiment-topic-analysis.md) — Detects languages, extracts key phrases, and calculates sentiment scores from unstructured text. ([source](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/AI%20Services/Overview/))
- [Time Series Anomaly Detection](https://awesome-repositories.com/f/artificial-intelligence-ml/time-series-anomaly-detection.md) — Analyzes a series of data points to identify irregularities or anomalous latest points. ([source](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/AI%20Services/Overview/))

### Data & Databases

- [Apache Spark Pipelines](https://awesome-repositories.com/f/data-databases/apache-spark-pipelines.md) — Integrates machine learning models and data featurization into distributed computing workflows using Apache Spark.
- [Data Access & Abstraction](https://awesome-repositories.com/f/data-databases/data-access-querying/data-access-abstraction.md) — Simplifies data experiments by abstracting access to various databases, file systems, and cloud data stores. ([source](https://microsoft.github.io/SynapseML/docs/1.0.1/Overview/))
- [Distributed Text Analytics](https://awesome-repositories.com/f/data-databases/text-processing-utilities/text-processing-tools/unstructured-text-processing/distributed-text-analytics.md) — Processes massive amounts of unstructured text to extract sentiment, key phrases, and language identification. ([source](https://cdn.jsdelivr.net/gh/azure/mmlspark@main/README.md))
- [Unified Data Access Interfaces](https://awesome-repositories.com/f/data-databases/unified-data-access-interfaces.md) — Offers a standardized interface for reading and writing data across diverse databases, file systems, and cloud stores. ([source](https://microsoft.github.io/SynapseML/docs/0.11.4/Overview/))
- [Unified Data Connector Interfaces](https://awesome-repositories.com/f/data-databases/unified-storage-interfaces/unified-data-connector-interfaces.md) — Provides a common interface for reading and writing across diverse databases and cloud storage systems.
- [Document and Unstructured Extraction](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-processing/document-unstructured-extraction.md) — Transforms unstructured documents and receipts into structured data using OCR and machine learning. ([source](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/AI%20Services/Overview/))
- [K-Nearest Neighbor Retrieval](https://awesome-repositories.com/f/data-databases/k-nearest-neighbor-retrieval.md) — Executes scalable k-nearest neighbor queries with conditional filtering on distributed data. ([source](https://cdn.jsdelivr.net/gh/azure/mmlspark@main/README.md))
- [Surrogate Model Explanations](https://awesome-repositories.com/f/data-databases/tabular-data-frameworks/tabular-predictive-models/tabular-explanations/surrogate-model-explanations.md) — Generates local interpretations for data by fitting a surrogate model around a specific observation. ([source](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/Responsible%20AI/Interpreting%20Model%20Predictions/))

### Development Tools & Productivity

- [ML Lifecycle Pipelines](https://awesome-repositories.com/f/development-tools-productivity/build-lifecycle-pipelines/ml-lifecycle-pipelines.md) — Combines data featurization and model training into a single workflow to derive predictions at scale. ([source](https://cdn.jsdelivr.net/gh/azure/mmlspark@main/README.md))

### Scientific & Mathematical Computing

- [Sparse Data Processing](https://awesome-repositories.com/f/scientific-mathematical-computing/sparse-data-processing.md) — Implements fast and sparse data structures to handle high-dimensional text analytics at a distributed scale.

### Software Engineering & Architecture

- [External API Ingestion Pipelines](https://awesome-repositories.com/f/software-engineering-architecture/unified-data-modeling/external-api-ingestion-pipelines.md) — Ingests data directly into big data processing pipelines by calling arbitrary external HTTP web services. ([source](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/AI%20Services/Overview/))

### Web Development

- [External API Integrations](https://awesome-repositories.com/f/web-development/external-api-integrations.md) — Connects distributed data pipelines to external pre-trained AI services via HTTP requests for content analysis.

### Part of an Awesome List

- [General Machine Learning](https://awesome-repositories.com/f/awesome-lists/ai/general-machine-learning.md) — Distributed machine learning framework for Apache Spark.
- [Machine Learning](https://awesome-repositories.com/f/awesome-lists/ai/machine-learning.md) — Distributed ML library with broad support.
- [Machine Learning Frameworks](https://awesome-repositories.com/f/awesome-lists/ai/machine-learning-frameworks.md) — Distributed machine learning framework for Apache Spark.
