# microsoft/synapseml

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/microsoft-synapseml).**

5,230 stars · 860 forks · Scala · MIT

## Links

- GitHub: https://github.com/microsoft/SynapseML
- Homepage: http://aka.ms/spark
- awesome-repositories: https://awesome-repositories.com/repository/microsoft-synapseml.md

## Topics

`ai` `apache-spark` `azure` `big-data` `cognitive-services` `data-science` `databricks` `deep-learning` `http` `lightgbm` `machine-learning` `microsoft` `ml` `model-deployment` `onnx` `opencv` `pyspark` `scala` `spark` `synapse`

## Description

SynapseML is an Apache Spark machine learning library designed for building and scaling machine learning workflows and data pipelines across distributed clusters. It serves as a distributed machine learning pipeline framework and a distributed inference engine for executing hardware-accelerated predictions and deep learning tasks on large-scale datasets.

The project functions as a cloud AI integration layer, allowing users to apply pretrained artificial intelligence services for text, vision, and speech within distributed pipelines. It also includes a dedicated suite of tools for distributed anomaly detection to identify multivariate and time-series outliers across high-dimensional data.

The library covers a broad range of capabilities, including distributed computer vision for face and image analysis, scalable natural language processing for text analytics and translation, and the training of gradient boosted decision trees. It provides tools for similarity search via k-nearest neighbor modeling, model explainability through feature attribution, and the orchestration of reinforcement learning workflows.

The system utilizes a composable pipeline architecture and supports ONNX-based model inference for cross-platform compatibility.

## Tags

### Artificial Intelligence & ML

- [Distributed ML Pipeline Managers](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-ml-pipeline-managers.md) — Provides a composable framework for sequencing data featurization and model training across distributed compute clusters.
- [Distributed Machine Learning Integrators](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/distributed-and-scaling-strategies/distributed-learning/distributed-machine-learning-integrators.md) — Provides an interface for training and evaluating machine learning models on large-scale datasets using parallelized Spark structures.
- [AI Service Integrations](https://awesome-repositories.com/f/artificial-intelligence-ml/ai-service-integrations.md) — Provides a layer to integrate and apply cloud-based AI services for sentiment and language analysis within distributed pipelines. ([source](https://cdn.jsdelivr.net/gh/microsoft/synapseml@master/README.md))
- [Anomaly Detection](https://awesome-repositories.com/f/artificial-intelligence-ml/anomaly-detection.md) — Identifies multivariate outliers and unusual patterns in high-dimensional data and time series.
- [Cloud AI Integrations](https://awesome-repositories.com/f/artificial-intelligence-ml/cloud-ai-integrations.md) — Wraps external cloud AI services as pipeline steps by communicating over HTTP to process distributed data.
- [Computer Vision](https://awesome-repositories.com/f/artificial-intelligence-ml/computer-vision-systems/computer-vision.md) — Executes image recognition, face analysis, and object detection tasks across multi-node clusters.
- [Deep Learning Inference Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/deep-learning-inference-engines.md) — Executes pre-trained deep learning models on CPU or GPU hardware to generate predictions for large datasets. ([source](https://microsoft.github.io/SynapseML/))
- [Spark Cluster Connectivity](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-deep-learning/spark-integrations/spark-cluster-connectivity.md) — Establishes network connections to distributed Spark clusters to execute machine learning workflows across multiple nodes. ([source](https://microsoft.github.io/SynapseML/docs/Reference/R%20Setup/))
- [Distributed Inference Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-inference-engines.md) — Implements a distributed inference engine that splits and executes machine learning workloads across multiple cluster nodes.
- [Distributed Text Analytics](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-text-analytics.md) — Performs sentiment analysis, entity extraction, and language translation on massive textual datasets using distributed processing. ([source](https://cdn.jsdelivr.net/gh/microsoft/synapseml@master/README.md))
- [Machine Learning Workflow Libraries](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning-workflow-libraries.md) — Provides a standardized framework for building and scaling machine learning pipelines across distributed Spark clusters.
- [Distributed Training](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/machine-learning-training/distributed-training.md) — Scales the training and evaluation of machine learning models across multiple compute nodes. ([source](https://microsoft.github.io/SynapseML/docs/1.0.10/Overview/))
- [Model Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-inference-serving/model-integration-pipelines/model-inference.md) — Runs pretrained deep learning models across clusters to generate large-scale predictions using hardware acceleration. ([source](https://cdn.jsdelivr.net/gh/microsoft/synapseml@master/README.md))
- [Natural Language Processing Analysis](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing-analysis.md) — Executes linguistic analysis tasks including sentiment analysis, language detection, and entity extraction. ([source](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/AI%20Services/Overview/))
- [Scalable Anomaly Detection](https://awesome-repositories.com/f/artificial-intelligence-ml/scalable-anomaly-detection.md) — Provides distributed implementations of anomaly detection to identify outliers in large-scale, high-dimensional datasets.
- [Face Analysis](https://awesome-repositories.com/f/artificial-intelligence-ml/computer-vision-systems/face-analysis.md) — Detects human faces in images to perform verification, identification, grouping, and similarity matching. ([source](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/AI%20Services/Overview/))
- [Cybersecurity Machine Learning](https://awesome-repositories.com/f/artificial-intelligence-ml/cybersecurity-machine-learning.md) — Applies specialized machine learning models to detect and analyze cybersecurity threats. ([source](https://cdn.jsdelivr.net/gh/microsoft/synapseml@master/README.md))
- [Gradient Boosting](https://awesome-repositories.com/f/artificial-intelligence-ml/gradient-boosting.md) — Implements distributed training for gradient boosted decision trees to process large-scale datasets. ([source](https://cdn.jsdelivr.net/gh/microsoft/synapseml@master/README.md))
- [Nonlinear Detection](https://awesome-repositories.com/f/artificial-intelligence-ml/gradient-boosting/outlier-detection/nonlinear-detection.md) — Identifies anomalies in high-dimensional data using a distributed forest of isolation trees. ([source](https://cdn.jsdelivr.net/gh/microsoft/synapseml@master/README.md))
- [Image Content Analyzers](https://awesome-repositories.com/f/artificial-intelligence-ml/image-content-analyzers.md) — Provides tools for detecting objects and text within images to automate metadata extraction. ([source](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/AI%20Services/Overview/))
- [Machine Learning Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning-pipelines.md) — Embeds distributed machine learning models into pipelines to support batch, streaming, and serving workloads. ([source](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/LightGBM/Overview/))
- [Distributed Execution](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/frameworks/computer-vision/distributed-execution.md) — Runs vision-based machine learning models across distributed clusters to process large-scale image data. ([source](https://microsoft.github.io/SynapseML/docs/1.0.2/Overview/))
- [ONNX Runtime Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-inference-serving/inference-engines/onnx-runtime-inference.md) — Executes pre-trained models on CPU or GPU hardware using the cross-platform ONNX runtime for distributed inference.
- [Medical Relationship Extraction](https://awesome-repositories.com/f/artificial-intelligence-ml/medical-relationship-extraction.md) — Identifies medical entities and maps relationships within unstructured clinical documents and health records. ([source](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/AI%20Services/Overview/))
- [Multilingual Document Translation](https://awesome-repositories.com/f/artificial-intelligence-ml/multilingual-document-translation.md) — Translates text and full documents across multiple languages while preserving original structural formatting. ([source](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/AI%20Services/Overview/))
- [Speech and Text Conversion](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-and-text-conversion.md) — Provides integrated pipelines for transcribing audio to text and synthesizing text into neural audio. ([source](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/AI%20Services/Overview/))
- [Time Series Anomaly Detection](https://awesome-repositories.com/f/artificial-intelligence-ml/time-series-anomaly-detection.md) — Generates models to identify irregularities and anomalous data points within temporal time series data. ([source](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/AI%20Services/Overview/))
- [Elastic Scaling](https://awesome-repositories.com/f/artificial-intelligence-ml/training-engines/elastic-scaling.md) — Executes model training and evaluation across compute environments that dynamically resize based on the workload.

### Data & Databases

- [Distributed Data Processing](https://awesome-repositories.com/f/data-databases/distributed-data-processing.md) — Distributes machine learning workloads across a cluster of nodes using the Spark distributed data processing engine.
- [Text Analytics](https://awesome-repositories.com/f/data-databases/distributed-analytical-runtimes/text-analytics.md) — Processes textual data at scale using distributed interfaces to extract insights and patterns. ([source](https://microsoft.github.io/SynapseML/docs/1.0.2/Overview/))
- [Anomaly Detection Algorithms](https://awesome-repositories.com/f/data-databases/anomaly-detection-algorithms.md) — Implements algorithms to identify outliers across multiple data streams by analyzing inter-correlations and dependencies. ([source](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/Anomaly%20Detection/Quickstart%20-%20Isolation%20Forests/))
- [Computation Serving](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-processing/distributed-processing-frameworks/distributed-computing/computation-serving.md) — Exposes cluster-based computations as web services to deliver results with sub-millisecond response times. ([source](https://cdn.jsdelivr.net/gh/microsoft/synapseml@master/README.md))
- [Direct Memory Data Transfer](https://awesome-repositories.com/f/data-databases/shared-memory-data-exchange/direct-memory-data-transfer.md) — Optimizes data movement and memory usage between distributed partitions and native datasets using direct memory transfer. ([source](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/LightGBM/Overview/))
- [Similarity Search](https://awesome-repositories.com/f/data-databases/similarity-search.md) — Identifies the nearest neighbors for a query across large datasets based on feature similarity. ([source](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/Other%20Algorithms/Quickstart%20-%20Exploring%20Art%20Across%20Cultures/))
- [Structured Data Extraction](https://awesome-repositories.com/f/data-databases/structured-data-extraction.md) — Extracts key-value pairs and tables from business forms and IDs into structured formats. ([source](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/AI%20Services/Overview/))

### Development Tools & Productivity

- [Machine Learning Pipelines](https://awesome-repositories.com/f/development-tools-productivity/task-pipeline-managers/machine-learning-pipelines.md) — Orchestrates composable workflows that integrate data featurization, model training, and external AI services.

### Software Engineering & Architecture

- [Composable Architectures](https://awesome-repositories.com/f/software-engineering-architecture/composable-architectures.md) — Sequences data featurization and model training into unified workflows via a modular, composable interface.
