30 open-source projects similar to clips/pattern, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Pattern alternative.
CoreNLP is a Java natural language processing library designed to convert raw human language text into structured data. It utilizes a suite of linguistic annotators to analyze text through a pipeline, extracting grammatical structures, sentiment, and linguistic patterns. The project includes a coreference resolution engine that links multiple mentions of the same entity to maintain contextual consistency across documents. It also provides tools for named entity recognition to categorize people, companies, and locations, and a part-of-speech tagger to assign grammatical categories and base for
TextBlob is a natural language processing library that provides a unified interface for common linguistic tasks. It operates as a wrapper-based API, simplifying the use of complex processing libraries by delegating core operations to specialized external frameworks. The project features a pluggable processing pipeline that allows for the integration of custom logic and alternative language engines. It supports the extension of processing models through plugins to add specific language support or custom data processing. The library covers a broad range of linguistic capabilities, including se
This is an interactive notebook-based course that teaches machine learning from Python fundamentals through deep learning and natural language processing. It uses real datasets and multiple frameworks within a structured, hands-on curriculum that combines concise explanations with executable code cells, built-in datasets, and embedded exercise checkpoints. Learning progresses through data preparation and exploration, classical machine learning workflows, computer vision with convolutional neural networks, and natural language processing with deep learning, all delivered as a cohesive progressi
Smile is a comprehensive JVM machine learning library and statistical computing toolkit. It provides a suite of algorithms for classification, regression, and clustering, implemented natively for Java, Scala, and Kotlin. The project also functions as a deep learning framework, a natural language processing library, and an inference engine for large language models. The library distinguishes itself through GPU acceleration via LibTorch bindings and support for the ONNX model interchange format. It includes specialized capabilities for large language model inference, featuring Byte-Pair Encodin
This project is a comprehensive Python toolkit designed for natural language processing, research, and education. It functions as a linguistic data processor that provides a standardized framework for managing, cleaning, and analyzing large collections of annotated text corpora and lexical resources. The library distinguishes itself through its integration of both symbolic and statistical methods, allowing users to perform complex tasks ranging from rule-based grammar parsing to machine learning-driven classification. It offers a modular pipeline for text processing, enabling the transformati
Flair is a transformer-based natural language processing framework used to build and train models for text classification and sequence tagging. It provides a specialized library for generating contextual text embeddings and performing linguistic analysis. The framework includes dedicated tools for named entity recognition, including the identification of specialized biomedical entities across multiple languages. It further supports entity linking to map identified text mentions to unique entries within general or biomedical knowledge bases. The project covers a broad range of language analys
Compromise is a natural language processing library and rule-based text parser designed to analyze unstructured text. It functions as a toolkit for identifying parts of speech, linguistic patterns, and semantic meaning, while providing specialized engines for named entity recognition and the parsing of temporal and numeric data. The project is distinguished by its linguistic morphological engine, which can conjugate verbs across different tenses and inflect nouns and adjectives. It further allows for linguistic model customization through a plugin system that enables the extension of lexicons
Natural is a natural language processing library for Node.js that provides tools for text analysis, tokenization, and phonetic matching. It functions as a collection of specialized toolsets for word stemming, string similarity quantification, and pattern-based text classification. The library includes a phonetic sound analyzer that converts words into phonetic representations to identify matches based on sound rather than literal spelling. It also features a text classification engine that assigns categories to text inputs using trained models and pattern recognition. Additional capabilities
nlp.js is a JavaScript natural language processing library and development framework used to build natural language understanding engines. It provides a toolkit for creating local machine learning models for intent classification and acts as a multilingual text processor that detects languages and normalizes text across various dialects. The framework distinguishes itself by supporting local execution on both servers and mobile devices, enabling chatbot functionality without an internet connection. It features a specialized system for conversational slot filling to collect mandatory informati
This project is a Chinese text segmentation library and tokenizer designed to split Chinese sentences into individual words. It serves as a natural language processing tool for splitting characters into words, tagging parts of speech, and extracting keywords using statistical analysis. The library distinguishes itself through support for custom dictionary configuration and vocabulary file management, allowing users to override default segmentation rules for domain-specific accuracy. It also includes a TF-IDF keyword extractor to identify significant words and core topics within documents. Th
This project is a support vector machine library implemented in C, providing an engine for classification and regression tasks. It functions as a machine learning kernel library and a statistical model validator used to categorize data points and predict continuous numerical values. The library allows for the definition of custom kernel functions to calculate similarity between data points in specialized datasets. It also includes tools for probabilistic modeling, such as estimating class membership, data density, and distribution boundaries. Broad capabilities cover model training for multi
This project is an educational resource providing practical code examples and implementations of machine learning algorithms using the Python language. It serves as a guide for constructing predictive pipelines, clustering models, and dimensionality reduction within the Scikit-Learn ecosystem. The repository includes comprehensive demonstrations for supervised and unsupervised learning, as well as detailed examples for implementing neural networks and deep architectures. It also provides practical guidance on exporting model parameters to JSON and wrapping trained models in web APIs for produ
This project is a comprehensive resource directory for web data extraction, providing a curated collection of tools and libraries for parsing data, automating browsers, and managing network operations. It serves as a guide for extracting structured information from HTML, XML, JSON, and PDF formats. The toolkit focuses on advanced data collection strategies, including headless browser automation to interact with JavaScript and a suite of network utilities for DNS resolution and WebSocket connections. It specifically covers methods for bypassing bot protections through proxy pool management, us
Newspaper is a Python library designed for scraping, parsing, and analyzing web-based information. It functions as a framework for automated news aggregation and large-scale web content extraction, providing tools to download, clean, and structure text, metadata, and media from diverse online sources. The project distinguishes itself through a pipeline-oriented architecture that combines heuristic-based content extraction with natural language processing. It automatically identifies and isolates article bodies from web page boilerplate while simultaneously performing language detection, keywo
Compromise is a natural language processing library and rule-based engine designed for English text manipulation, analysis, and parsing. It provides a toolkit for tokenizing text, identifying parts of speech, and performing linguistic analysis to achieve semantic understanding of unstructured strings. The project distinguishes itself through its ability to programmatically transform grammar, such as modifying verb tenses, noun plurality, and adjective forms. It also functions as a named entity recognizer capable of extracting people, places, organizations, dates, and contact information from
YSDA course in Natural Language Processing
HanLP is a natural language processing library and deep learning framework specifically optimized for the Chinese language, while also functioning as a multilingual text processor. It serves as a toolkit for performing linguistic analysis, semantic understanding, and script conversion. The project distinguishes itself through a dedicated focus on Chinese linguistic structures, including a specialized script converter for transforming text between Simplified Chinese, Traditional Chinese, and Pinyin. It further supports domain-specific model training to improve the recognition of professional t
Open Deep Research is an AI-powered web research agent that combines a reasoning model with live web search and data extraction to perform deep, multi-source investigations on any topic. It operates through a dual interface, offering both a command-line tool and a Model Context Protocol server, allowing developers to integrate web capabilities directly into AI agents and coding assistants. The project distinguishes itself by orchestrating an iterative research loop where a reasoning model plans steps, interprets search results, and guides subsequent web interactions. It uses Firecrawl for scr
Spider is a multi-threaded web crawler and link extraction library designed for systematic website traversal and data collection. It functions as a utility for discovering and cataloging hyperlinks across domains, enabling the mapping of site architecture and the gathering of structured information. The tool utilizes a thread-pool concurrency model to fetch and parse multiple web pages simultaneously, maximizing network throughput during the crawling process. It manages navigation through a centralized queue-based frontier and employs in-memory state tracking to ensure efficient deduplication
CrawlerTutorial is a comprehensive Python web scraping tutorial and framework designed for extracting data from static and dynamic websites. It functions as a web data extraction pipeline and an HTTP request orchestrator, covering the full lifecycle of scraping applications from initial fetching to final data storage. The project provides specialized guidance on anti-bot bypass techniques and web API reverse engineering. It includes methods for evading browser detection through identity masking and proxy rotation, as well as techniques for identifying hidden API endpoints by analyzing network
This project is an educational course and learning curriculum for implementing and fine-tuning transformer models using the Hugging Face ecosystem. It serves as a structured guide and technical walkthrough for processing multimodal data, adapting pre-trained neural networks, and deploying models. The material includes a guide for managing, versioning, and distributing model weights and datasets through a centralized asset hub. It also provides a practical tutorial on adapting models to specific datasets using parameter-efficient methods and an implementation guide for solving natural language
This repository provides a collection of deep learning models and neural network architectures built for natural language processing tasks. It functions as a library of pre-trained models designed to process, analyze, and generate human language data using the TensorFlow framework. The project utilizes sequence-to-sequence modeling and layered neural architectures to handle variable-length language data. By employing static dataflow graphing and tensor-based representations, the models execute mathematical operations to transform input features into abstract linguistic meanings. Users can loa
Crawler4j is a multi-threaded Java web crawler and spider designed for high-volume web traversal and content extraction. It functions as a polite crawling framework that enables the discovery and indexing of HTML and binary content across multiple websites. The project distinguishes itself through a persistent crawling model that serializes session state to local storage, allowing the engine to resume indexing after a crash or interruption. It includes a politeness controller to regulate request frequency and delays, preventing server overloading and IP blocking. The system covers a broad ra
This repository is a comprehensive collection of instructional guides and practical examples for Python development, focusing on machine learning, data science, and web scraping. It provides implementations for neural networks, reinforcement learning algorithms, and deep learning architectures using PyTorch, alongside detailed manuals for scientific computing and data visualization. The project distinguishes itself by offering specialized tutorials on concurrent programming to optimize CPU performance and guides for setting up Linux development environments. It covers the implementation of ad
This project is a development course and learning curriculum focused on building large language model chatbots. It provides a structured series of tutorials for creating conversational agents through the application of natural language processing and deep learning models. The materials include a technical walkthrough for implementing neural networks and word embeddings to handle automated question-answering tasks. It also provides a guide for constructing large-scale conversation corpora from external text sources to train and evaluate dialogue systems. The curriculum covers core text analys
This project is a comprehensive instructional resource and course for building neural networks using PyTorch. It covers the fundamental building blocks of deep learning, including tensor manipulation, automatic differentiation, and the construction of modular neural network components. The repository serves as a technical guide for several specialized domains. It provides implementation details for computer vision tasks such as image classification, object detection, and semantic segmentation, as well as natural language processing workflows involving transformers, recurrent networks, and gen
This is a comprehensive deep learning course delivered entirely through Jupyter Notebooks, designed to teach neural network construction using TensorFlow 2.x. The curriculum follows a sequential-model-first pedagogy, introducing the Sequential API before moving to functional and subclassing approaches, and covers the full spectrum of model building from regression and classification through convolutional neural networks, natural language processing, and time series forecasting. The course is structured around a checkpoint-based training workflow that saves the best model weights during traini
PyText is an extensible PyTorch-based framework for building, training, and deploying custom natural language processing models, including text classifiers, sequence taggers, and intent-slot predictors. It provides a modular toolkit that allows developers to assemble these models using pluggable registries for model architectures, data formats, and tensorizers, all configurable through YAML files without requiring code changes. The framework distinguishes itself through its comprehensive support for the full NLP model lifecycle, from training to production inference. It includes pre-built neu
Stanza is a Python natural language processing library designed for tokenization, lemmatization, and dependency parsing across many human languages using neural models. It provides a neural processing pipeline that converts raw text into structured linguistic data objects, alongside a specialized analyzer for extracting medical insights from clinical and biomedical language. The project includes a wrapper that connects Python scripts to Java-based natural language processing tools and remote annotation servers. This enables a bridge for extracting linguistic annotations and analysis data from
This project is a machine learning algorithm reference and implementation guide that provides theoretical foundations and code for supervised learning, deep learning, and natural language processing. It serves as a comprehensive toolkit for implementing predictive models and a technical reference for algorithm engineering. The project focuses on ensemble learning frameworks, including the construction of decision trees, random forests, and gradient boosting models. It also functions as a probabilistic graphical model library and an NLP algorithm reference, with specific implementations for se