Mmlspark

Mmlspark is a distributed framework for executing machine learning models, data transformations, and AI service integrations across Apache Spark clusters. It functions as a distributed machine learning library and pipeline orchestrator, allowing users to integrate pre-trained cognitive services and custom models into large-scale batch and streaming workflows.

The project is distinguished by its ability to incorporate external AI services and web APIs directly into big data pipelines for text and vision analysis. It provides a scalable model training framework that coordinates gradient boosting and classification tasks across elastically resizable compute clusters, utilizing hardware acceleration for distributed model inference.

The toolset covers a broad range of capabilities including multimodal content analysis for image, speech, and text, as well as advanced anomaly detection for time-series and multivariate data. It includes utilities for data featurization, the execution of ONNX models, and responsible AI tools for model fairness auditing and prediction interpretation using additive contribution values.

The framework also provides a unified data access interface for reading and writing across various databases and cloud storage systems.

Features

Spark Integrations - Wraps machine learning workflows within a Spark execution engine to distribute processing across a compute cluster.
Apache Spark Pipelines - Integrates machine learning models and data featurization into distributed computing workflows using Apache Spark.
AI Service Integrations - Provides connectors and interfaces for integrating external artificial intelligence services into distributed data pipelines.
Distributed Inference Scaling - Executes trained models across clusters to generate predictions on large-scale data with hardware acceleration.
Distributed ML Pipeline Managers - Orchestrates the integration of pre-trained AI services and custom models into large-scale batch and streaming workflows.
Distributed Training - Coordinates the training and evaluation of machine learning models across elastically resizable compute clusters.
Gradient Boosting - Trains gradient boosting models across a distributed cluster to handle large datasets.
Hardware-Accelerated Inference - Uses GPU and CPU acceleration to execute trained deep learning models across distributed datasets for higher throughput.
Machine Learning Pipelines - Incorporates machine learning models into batch and streaming workflows for consistent data processing.
Machine Learning Workflow Libraries - Integrates machine learning models into distributed computing pipelines using a shared API.
Distributed Training - Scales the training and evaluation of machine learning models across multiple compute nodes.
Spark MLlib Integrations - Provides a distributed framework for executing machine learning models and transformations across Apache Spark clusters.
Model Training Frameworks - Offers a toolkit for training and evaluating gradient boosting and classification models across resizable clusters.
Multivariate Anomaly Detection - Identifies outliers across multiple variables by analyzing inter-correlations and dependencies between data dimensions.
Scalable Anomaly Detection - Identifies outliers and multivariate anomalies in big data using distributed forest implementations and time series analysis.
Data Access & Abstraction - Simplifies data experiments by abstracting access to various databases, file systems, and cloud data stores.
Distributed Text Analytics - Processes massive amounts of unstructured text to extract sentiment, key phrases, and language identification.
Unified Data Access Interfaces - Offers a standardized interface for reading and writing data across diverse databases, file systems, and cloud stores.
Unified Data Connector Interfaces - Provides a common interface for reading and writing across diverse databases and cloud storage systems.
ML Lifecycle Pipelines - Combines data featurization and model training into a single workflow to derive predictions at scale.
AI Workflow Serving - Exposes distributed computations as low-latency web services for real-time inference.
Face Analysis - Detects human faces and groups individuals based on facial similarity and identity verification.
Distributed Computer Vision - Analyzes image content and detects facial features across large datasets using distributed computing clusters.
Distributed Data Featurization - Provides utilities for normalizing numeric features and mapping categorical labels across distributed datasets.
Distributed Training Optimizers - Optimizes distributed training through specialized synchronization to improve execution speed and memory usage.
Forest-Based - Identifies anomalies in large datasets using distributed forest implementations.
Image Content Analyzers - Identifies visual features, generates descriptions, and extracts text or tags from images.
Inference Accelerators - Runs distributed model inference using hardware acceleration and standardized formats to reduce processing time.
ONNX Runtime Inference - Executes trained deep learning and machine learning models across distributed data using the ONNX runtime.
Model Persistence - Serializes machine learning models and metadata to a persistent store for retrieval and deployment.
Model Auditing - Analyzes opaque models to understand decision-making and measure biases within datasets.
Model Explainability - Calculates feature importance and contribution scores to explain why specific data points were flagged.
Model Serializers - Persists model bytes and metadata to a storage layer for retrieval and deployment across environments.
SHAP Value Computations - Calculates additive contribution values for each feature to quantify the final model prediction.
Sentiment & Topic Analysis - Detects languages, extracts key phrases, and calculates sentiment scores from unstructured text.
Time Series Anomaly Detection - Analyzes a series of data points to identify irregularities or anomalous latest points.
Document and Unstructured Extraction - Transforms unstructured documents and receipts into structured data using OCR and machine learning.
K-Nearest Neighbor Retrieval - Executes scalable k-nearest neighbor queries with conditional filtering on distributed data.
Surrogate Model Explanations - Generates local interpretations for data by fitting a surrogate model around a specific observation.
Sparse Data Processing - Implements fast and sparse data structures to handle high-dimensional text analytics at a distributed scale.
External API Ingestion Pipelines - Ingests data directly into big data processing pipelines by calling arbitrary external HTTP web services.
External API Integrations - Connects distributed data pipelines to external pre-trained AI services via HTTP requests for content analysis.
General Machine Learning - Distributed machine learning framework for Apache Spark.
Machine Learning - Distributed ML library with broad support.
Machine Learning Frameworks - Distributed machine learning framework for Apache Spark.

microsoft/SynapseML

5,230View on GitHub

SynapseML is an Apache Spark machine learning library designed for building and scaling machine learning workflows and data pipelines across distributed clusters. It serves as a distributed machine learning pipeline framework and a distributed inference engine for executing hardware-accelerated predictions and deep learning tasks on large-scale datasets. The project functions as a cloud AI integration layer, allowing users to apply pretrained artificial intelligence services for text, vision, and speech within distributed pipelines. It also includes a dedicated suite of tools for distributed

catboost/catboost

8,808View on GitHub

CatBoost is a gradient boosting machine learning library used to train decision tree ensembles for regression, classification, and ranking tasks. It functions as a high-performance framework that provides a categorical data processor for transforming non-numeric features, a distributed trainer for large-scale datasets, and GPU acceleration to speed up model construction. The library distinguishes itself through native handling of categorical data and text features, removing the need for manual encoding. It includes a specialized model interpretability tool that leverages SHAP values and featu

rasbt/python-machine-learning-book-2nd-edition

7,194View on GitHub

This project is a machine learning educational resource and implementation guide for Python. It provides a collection of executable code and notebooks that demonstrate predictive modeling, data analysis workflows, and the implementation of various machine learning algorithms. The repository features practical examples of classification, regression, and clustering tasks using Scikit-Learn, alongside tutorials for building and training deep learning architectures with TensorFlow. These include implementations of convolutional and recurrent networks. The content covers a broad range of capabili

lightgbm-org/LightGBM

18,460View on GitHub

LightGBM is a gradient boosting framework used to train decision tree ensembles for classification, regression, and ranking tasks. It functions as a distributed machine learning library and a decision tree ensemble implementation that utilizes leaf-wise growth and histogram-based feature binning. The framework is distinguished by its ability to offload heavy computations to CUDA or OpenCL devices for GPU acceleration and its capacity to parallelize training across multiple nodes using sockets, MPI, or Dask. It includes a specialized categorical feature processor that optimizes partitions for

Azuremmlspark

Features