Apache Hive is a SQL-on-Hadoop data warehouse that enables querying and managing petabytes of data stored in distributed storage such as HDFS and cloud storage services. It provides a familiar SQL interface for batch analytics and reporting, supported by a core set of components including the HiveServer2 Thrift service for remote query execution, the Hive Metastore Service for central metadata management, the Hive ACID Transaction Engine for concurrent read-write operations, and the Hive LLAP Interactive Engine for low-latency analytical processing. The WebHCat REST API offers an HTTP interface for submitting Hive, MapReduce, and Pig jobs and managing HCatalog metadata.
Hive distinguishes itself through its multi-engine query execution, allowing queries to run on Apache Spark, Tez, or MapReduce to balance performance and resource usage across different workloads. It supports external data federation, enabling direct querying of remote databases, Druid, HBase, and Iceberg tables without moving data. Enterprise security integration provides authentication via Kerberos, LDAP, SAML, JWT, or OAuth2, with fine-grained access control through Apache Ranger. The cost-based optimizer, materialized views, and LLAP persistent daemon work together to deliver sub-second query responses on large datasets.
The platform offers comprehensive data management capabilities including ACID transactions, multiple storage formats such as ORC, Parquet, Avro, and RCFile, and support for cloud storage on S3, Azure Data Lake, and Google Cloud Storage. It includes a pluggable SerDe abstraction layer for custom data formats and a storage handler interface for connecting to external systems like HBase, Druid, Kudu, and JDBC sources. Advanced SQL features cover windowed aggregation, grouping sets, common table expressions, and geospatial calculations, while extensibility is provided through user-defined functions, custom MapReduce scripts, and procedural SQL execution.
Hive can be deployed via stable release tarballs, Docker containers, or Amazon EMR, and includes command-line tools like Beeline and HCatalog for interactive and batch query execution. Monitoring and observability features allow inspection of query execution plans, job status tracking, and runtime metrics viewing.