Smile

Smile is a comprehensive JVM machine learning library and statistical computing toolkit. It provides a suite of algorithms for classification, regression, and clustering, implemented natively for Java, Scala, and Kotlin. The project also functions as a deep learning framework, a natural language processing library, and an inference engine for large language models.

The library distinguishes itself through GPU acceleration via LibTorch bindings and support for the ONNX model interchange format. It includes specialized capabilities for large language model inference, featuring Byte-Pair Encoding tokenization and an OpenAI-compatible REST API with server-sent event streaming. Additionally, it allows trained models to be wrapped as transformers for integration into Apache Spark pipelines.

The toolkit covers a broad surface of data science capabilities, including linear algebra, numerical optimization, and statistical hypothesis testing. It provides tools for data preprocessing, dimensionality reduction, and signal processing, as well as interactive 2D and 3D visualization. For linguistic analysis, it supports part-of-speech tagging, stemming, and keyword extraction.

The project provides idiomatic JVM language APIs and includes a desktop environment with an interactive shell for exploratory data analysis and model training.

Features

JVM Machine Learning Libraries - Provides a comprehensive suite of machine learning algorithms implemented natively for Java, Scala, and Kotlin.

GPU Acceleration - Accelerates neural network training and inference using native LibTorch bindings for GPU computation.

Inference Engines - Executes text generation and chat completions using BPE tokenization and streaming REST APIs.

Natural Language Processing - Implements core linguistic analysis tasks including tokenization, stemming, part-of-speech tagging, and keyword extraction.

Natural Language Processing Libraries - Offers a full NLP library for tokenization, stemming, part-of-speech tagging, and keyword extraction.

Neural Network Training Toolkits - Develops deep learning models using GPU-accelerated tensors and standard neural network layer types.

Deep Learning Acceleration - Utilizes LibTorch bindings to accelerate tensor operations and neural network training on GPUs.

Linear Algebra - Provides high-performance mathematical routines for vector and matrix operations via BLAS and LAPACK interfaces.

Numerical Computing - Provides a mathematical toolkit for linear algebra, numerical optimization, and statistical computing on the JVM.

Statistical Analysis Libraries - Provides a toolkit for statistical analysis, hypothesis testing, and probability distributions on the JVM.

Native-Free Implementations - Implements statistical and machine learning logic directly in Java and Scala to eliminate external native dependencies.

Association Rule Learning - Implements algorithms for discovering frequent itemsets and association rules within large datasets.

Spectral Clustering - Implements spectral clustering using the eigenvalues of a similarity matrix for dimensionality reduction and grouping.

Dimensionality Reduction Techniques - Provides algorithms for reducing the number of variables in a dataset while preserving essential structural information.

Genetic Algorithms - Applies genetic algorithms to evolve candidate solutions for complex optimization problems.

Keyword and Phrase Extraction - Identifies significant terms and multi-word phrases from text using TF-IDF and BM25.

Clustering Algorithms - Groups similar data points into clusters based on shared characteristics using various unsupervised algorithms.

Kernel Methods - Implements kernel functions for mapping data into higher-dimensional spaces, specifically for support vector machines.

Large Language Model Serving - Runs text generation and chat completions using BPE tokenization and an OpenAI-compatible REST API.

Linear Regression - Implements statistical methods for modeling relationships between variables using linear equations.

Machine Learning Classification - Trains models to categorize data into predefined classes using a variety of supervised classification algorithms.

ONNX Model Exporters - Supports exporting and importing machine learning models using the standardized ONNX format for cross-framework compatibility.

Tensor Memory Management - Automatically allocates and reclaims GPU tensor memory to prevent leaks during deep learning computation.

Byte Pair Encodings - Implements Byte-Pair Encoding for subword tokenization, ensuring compatibility with large language model vocabularies.

Manifold Learning Algorithms - Implements algorithms that model the underlying manifold of high-dimensional data to identify topological structures.

Hyperparameter Optimization - Uses genetic algorithms to automate the search and selection of optimal configuration parameters for models.

Model Performance Evaluators - Quantifies model accuracy and reliability using metrics like AUC and RMSE via cross-validation.

Model Serialization - Saves trained machine learning models to disk for deployment and pipeline integration.

OpenAI-Compatible Model Servers - Hosts trained models via an OpenAI-compatible REST API with server-sent event streaming.

Word Stemming - Reduces words to their root forms using standard stemming and lemmatization algorithms.

Non-Linear Regression - Trains decision trees or neural networks to model complex non-linear relationships for continuous prediction.

Part-of-Speech Taggers - Assigns grammatical tags to words in sentences using Hidden Markov Model taggers.

Sequence Model Training - Fits Hidden Markov Models or Conditional Random Fields to sequential data for tagging tasks.

Sequential Learning - Provides methods for training models on ordered data sequences, including HMMs and CRFs for sequence labeling.

Byte-Level Encoders - Encodes text into tokens using byte-level Byte-Pair Encoding for language model pipelines.

Time Series Forecasting - Implements models and architectures for predicting future values in temporal data sequences via autocorrelation analysis.

Text Preprocessing - Extracts meaning from text through sentence splitting, tokenization, stemming, and tagging.

Swing-Based Renderers - Renders interactive 2D and 3D plots using the Java Swing framework for desktop data exploration.

Interactive Data Science Environments - Offers a desktop environment with an interactive shell for exploratory data analysis and model training.

Interactive REPLs - Provides an interactive Java and Scala REPL with pre-imported packages for rapid experimentation.

LLaMA Model Support - Generates text responses from LLaMA-3 models with support for chat and streaming API serving.

Interactive Scientific Plot Constructors - Generates interactive 2D and 3D scatter plots and histograms for scientific data exploration.

Idiomatic API Wrappers - Provides concise, idiomatic Scala, Kotlin, and Java wrappers for accessing machine learning algorithms.

Random Number Generation - Provides high-quality pseudorandom number generation as an alternative to standard library generators.

Distance Metrics - Implements algorithms for calculating distances between elements, including Euclidean, Mahalanobis, Hamming, and edit distances.

Nearest Neighbor Searches - Provides efficient algorithms for finding the closest points in multi-dimensional datasets using spatial structures like k-d trees.

Numerical Function Optimization - Provides numerical function optimization using BFGS and L-BFGS algorithms.

Probability Distributions - Implements various probability distributions such as Normal, Poisson, and Beta for statistical analysis.

Numerical Problem Solving - Provides random number generators and unconstrained optimization for mathematical functions.

Hypothesis Testing - Conducts statistical hypothesis tests including t-tests, chi-squared, and ANOVA to validate assumptions.

Data Plotting Components - Renders interactive 2D and 3D statistical plots using the Java Swing framework.

Deep Learning Frameworks - Functions as a deep learning framework featuring GPU acceleration via LibTorch and ONNX model interchange support.

Machine Learning - Statistical Machine Intelligence & Learning Engine.

Machine Learning Libraries - Comprehensive set of pure Java libraries for statistical machine learning.

Science and Data Analysis - Statistical machine intelligence and learning engine.

haifenglsmile

Features

Star history