Why is pola-rs/polars a recommended Categorical Data Optimization GitHub Repositories repository?

Creates categorical columns that infer categories from data to reduce memory usage and increase speed.

Why is dmlc/xgboost a recommended Categorical Data Optimization GitHub Repositories repository?

Processes categorical features natively via partition-based splitting to improve efficiency and accuracy.

Why is lightgbm-org/lightgbm a recommended Categorical Data Optimization GitHub Repositories repository?

Optimizes partitions for categorical variables using native splitting instead of one-hot encoding.

Why is nlp-love/ml-nlp a recommended Categorical Data Optimization GitHub Repositories repository?

Implements native categorical splitting in decision trees to avoid one-hot encoding.

Why is dask/dask a recommended Categorical Data Optimization GitHub Repositories repository?

Converts columns to categorical types and tracks category sets to optimize performance across distributed partitions.

Why is iamseancheney/python_for_data_analysis_2nd_chinese_version a recommended Categorical Data Optimization GitHub Repositories repository?

Implements memory-efficient representations for categorical data to optimize performance during grouping operations.

Why is apache/pinot a recommended Categorical Data Optimization GitHub Repositories repository?

Classifies columns as dimensions, metrics, or time fields to enable internal optimizations like automated rollups.

Why is jdorn/sql-formatter a recommended Categorical Data Optimization GitHub Repositories repository?

Splits batches of database commands into individual, executable statements by identifying termination characters.

8 Repos

Awesome GitHub RepositoriesCategorical Data Optimization

Memory-efficient representations for categorical data in tabular formats.

Distinguishing note: Focuses on dynamic inference of categories for performance.

Explore 8 awesome GitHub repositories matching data & databases · Categorical Data Optimization. Refine with filters or upvote what's useful.

Finde die besten Repos mit KI.Wir suchen mit KI nach den am besten passenden Repositories.

pola-rs/polars
pola-rs/polars
38,855Auf GitHub ansehen
Polars is a high-performance columnar data processing library designed for efficient analytical workflows. It functions as a structured data library that organizes information into typed columns, utilizing the Apache Arrow memory format to enable zero-copy data sharing and cache-friendly, vectorized operations. The engine is built to handle large-scale tabular datasets, providing both local and distributed analytical runtimes that scale from single-machine environments to multi-node clusters. The project distinguishes itself through a sophisticated lazy query engine that constructs abstract e
Creates categorical columns that infer categories from data to reduce memory usage and increase speed.
Rustarrowdataframedataframe-library
Auf GitHub ansehen38,855
dmlc/xgboost
dmlc/xgboost
28,471Auf GitHub ansehen
XGBoost is a distributed machine learning library for implementing scalable gradient boosting decision trees used for regression, classification, and ranking. It functions as a predictive model framework and a cross-language toolkit, providing a core implementation with native bindings for Python, R, Java, Scala, and C++. The system is designed as a GPU-accelerated library that utilizes CUDA and NCCL to speed up the training of decision tree ensembles. It operates as a distributed framework capable of scaling training and prediction across multi-node clusters and GPU environments to process m
Processes categorical features natively via partition-based splitting to improve efficiency and accuracy.
C++distributed-systemsgbdtgbm
Auf GitHub ansehen28,471
lightgbm-org/lightgbm
lightgbm-org/LightGBM
18,460Auf GitHub ansehen
LightGBM is a gradient boosting framework used to train decision tree ensembles for classification, regression, and ranking tasks. It functions as a distributed machine learning library and a decision tree ensemble implementation that utilizes leaf-wise growth and histogram-based feature binning. The framework is distinguished by its ability to offload heavy computations to CUDA or OpenCL devices for GPU acceleration and its capacity to parallelize training across multiple nodes using sockets, MPI, or Dask. It includes a specialized categorical feature processor that optimizes partitions for
Optimizes partitions for categorical variables using native splitting instead of one-hot encoding.
C++
Auf GitHub ansehen18,460
nlp-love/ml-nlp
NLP-LOVE/ML-NLP
17,725Auf GitHub ansehen
This project is a machine learning algorithm reference and implementation guide that provides theoretical foundations and code for supervised learning, deep learning, and natural language processing. It serves as a comprehensive toolkit for implementing predictive models and a technical reference for algorithm engineering. The project focuses on ensemble learning frameworks, including the construction of decision trees, random forests, and gradient boosting models. It also functions as a probabilistic graphical model library and an NLP algorithm reference, with specific implementations for se
Implements native categorical splitting in decision trees to avoid one-hot encoding.
Jupyter Notebookdeep-learningmachine-learningnlp
Auf GitHub ansehen17,725
dask/dask
dask/dask
13,746Auf GitHub ansehen
Dask ist ein Framework für paralleles Rechnen und ein verteilter Task-Scheduler, der darauf ausgelegt ist, Python-Data-Science-Workflows von einzelnen Maschinen auf große Cluster zu skalieren. Es fungiert als Cluster-Ressourcenmanager, der die Berechnungslogik orchestriert, indem Aufgaben und deren Abhängigkeiten als gerichtete azyklische Graphen dargestellt werden. Diese Architektur ermöglicht es dem System, die Verteilung von Workloads auf verfügbare Hardware zu automatisieren und gleichzeitig komplexe Ausführungsanforderungen zu verwalten. Das Projekt zeichnet sich durch eine Lazy-Evaluation-Engine aus, die Datenoperationen verzögert, bis sie explizit angefordert werden, was eine globale Graphoptimierung und effiziente Ressourcenzuweisung ermöglicht. Es integriert speicherbewusstes Data-Spilling, um Systemabstürze bei der Verarbeitung von Datensätzen zu verhindern, die den verfügbaren Speicher überschreiten, und nutzt Task-Graph-Fusion, um Sequenzen von Operationen in einzelne Ausführungsschritte zu kombinieren, wodurch Scheduling-Overhead und Inter-Node-Kommunikation minimiert werden. Die Plattform bietet eine umfassende Oberfläche für die Datenanalyse im großen Maßstab, einschließlich Unterstützung für verteiltes maschinelles Lernen, Integration in das Hochleistungsrechnen und parallele Datenverarbeitung. Sie bietet umfangreiche Werkzeuge für das Cluster-Lebenszyklusmanagement, Performance-Profiling und die Echtzeitüberwachung der Aufgabenausführung. Benutzer können diese Umgebungen über verschiedene Infrastrukturen hinweg bereitstellen, einschließlich lokaler Hardware, Cloud-Anbietern, containerisierten Systemen und Hochleistungsrechner-Clustern.
Converts columns to categorical types and tracks category sets to optimize performance across distributed partitions.
Pythondasknumpypandas
Auf GitHub ansehen13,746
iamseancheney/python_for_data_analysis_2nd_chinese_version
iamseancheney/python_for_data_analysis_2nd_chinese_version
8,937Auf GitHub ansehen
This project is an educational resource and a collection of instructional materials for performing data manipulation and statistical analysis using Python. It provides a comprehensive set of guides and code examples for using the Pandas, NumPy, and Matplotlib libraries to analyze structured data. The resource includes a dedicated guide for reshaping, cleaning, and aggregating tabular data and time series via Pandas, alongside a reference for high-performance vectorized operations and linear algebra using NumPy. It also features tutorials for creating publication-quality charts, distribution p
Implements memory-efficient representations for categorical data to optimize performance during grouping operations.
matplotlibnumpypandas
Auf GitHub ansehen8,937
apache/pinot
apache/pinot
6,098Auf GitHub ansehen
Pinot is a distributed, columnar analytical database designed for high-concurrency, low-latency query processing. It functions as a real-time OLAP datastore, enabling interactive, user-facing analytics by ingesting and querying massive datasets from both streaming and batch sources. The system architecture relies on a centralized controller for cluster coordination and a distributed segment-based storage model to ensure horizontal scalability. The platform distinguishes itself through a hybrid ingestion pipeline that unifies real-time event streams and historical batch data into a single quer
Classifies columns as dimensions, metrics, or time fields to enable internal optimizations like automated rollups.
Java
Auf GitHub ansehen6,098
jdorn/sql-formatter
jdorn/sql-formatter
3,857Auf GitHub ansehen
This project is a PHP library designed for parsing, beautifying, and syntax-highlighting SQL queries. It provides a set of utilities to improve the readability of database code, facilitate debugging, and assist in the maintenance of complex query structures. The library distinguishes itself by offering both aesthetic and functional processing capabilities. It can transform raw SQL strings into structured, indented formats for human review, or compress them by removing comments and unnecessary whitespace to optimize them for network transmission and logging. Additionally, it includes a syntax
Splits batches of database commands into individual, executable statements by identifying termination characters.
HTML
Auf GitHub ansehen3,857