diff --git a/.gitignore b/.gitignore index b6c1c570..35cfe67d 100644 --- a/.gitignore +++ b/.gitignore @@ -5,3 +5,4 @@ themes/hive/.DS_Store themes/hive/static/.DS_Store .hugo_build.lock public +target diff --git a/README.md b/README.md index 8fcb6faa..8bab76ac 100644 --- a/README.md +++ b/README.md @@ -1,20 +1,21 @@ +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. --> + # Apache Hive Documentation Site This repository contains the code for generating the Apache Hive web site. @@ -25,13 +26,14 @@ It's built with Hugo and hosted at https://hive.apache.org. * Clone this repository. * Install [hugo] on macOS: - ```brew install hugo``` -* For other OS please refer: [hugo-install] +```brew install hugo``` +* For other OS please refer: [hugo-install] * To verify your new install: ```hugo version``` * To build and start the Hugo server run: + ``` >>> hugo server -D @@ -55,19 +57,20 @@ Running in Fast Render Mode. For full rebuilds on change: hugo server --disableF Web Server is available at http://localhost:1313/ (bind address 127.0.0.1) Press Ctrl+C to stop ``` -* Navigate to `http://localhost:1313/` to view the site locally. +* Navigate to `http://localhost:1313/` to view the site locally. -### To Add New Content +### To Add New Content -* To add new markdown file : -`hugo new general/Downloads.md` +* To add new markdown file : + `hugo new general/Downloads.md` * Update `themes/hive/layouts/partials/menu.html` and `config.toml` to add navigation link to the markdown page as needed. ### Pushing to site -Commit and push the changes to the main branch. The site is automatically deployed from the site directory. +Commit and push the changes to the main branch. The site is automatically deployed from the site directory. [hugo]: https://gohugo.io/getting-started/quick-start/ -[hugo-install]: https://gohugo.io/installation/ \ No newline at end of file +[hugo-install]: https://gohugo.io/installation/ + diff --git a/content/Development/_index.md b/content/Development/_index.md index e314b2e2..b66ad894 100644 --- a/content/Development/_index.md +++ b/content/Development/_index.md @@ -2,3 +2,4 @@ title: "Development" date: 2025-07-24 --- + diff --git a/content/Development/desingdocs/_index.md b/content/Development/desingdocs/_index.md index c00f76ab..31ceadc4 100644 --- a/content/Development/desingdocs/_index.md +++ b/content/Development/desingdocs/_index.md @@ -2,3 +2,4 @@ title: "Design Documents" date: 2025-07-24 --- + diff --git a/content/Development/desingdocs/accessserver-design-proposal.md b/content/Development/desingdocs/accessserver-design-proposal.md index f46f3db0..2966a9b0 100644 --- a/content/Development/desingdocs/accessserver-design-proposal.md +++ b/content/Development/desingdocs/accessserver-design-proposal.md @@ -46,18 +46,15 @@ Hive has a powerful data model that allows users to map logical tables and parti HCatalog's Storage Based Authorization model is explained in more detail in the [HCatalog documentation](http://hive.apache.org/docs/hcat_r0.5.0/authorization.html), but the following set of quotes provides a good high-level overview: -> -> ... when a file system is used for storage, there is a directory corresponding to a database or a table. With this authorization model, **the read/write permissions a user or group has for this directory determine the permissions a user has on the database or table**. -> -> ... -> -> For example, an alter table operation would check if the user has permissions on the table directory before allowing the operation, even if it might not change anything on the file system. -> -> ... -> +> ... when a file system is used for storage, there is a directory corresponding to a database or a table. With this authorization model, **the read/write permissions a user or group has for this directory determine the permissions a user has on the database or table**. +> +> ... +> +> For example, an alter table operation would check if the user has permissions on the table directory before allowing the operation, even if it might not change anything on the file system. +> +> ... +> > When the database or table is backed by a file system that has a Unix/POSIX-style permissions model (like HDFS), there are read(r) and write(w) permissions you can set for the owner user, group and ‘other’. The file system’s logic for determining if a user has permission **on the directory or file** will be used by Hive. -> -> There are several problems with this approach, the first of which is actually hinted at by the inconsistency highlighted in the preceding quote. To determine whether a particular user has read permission on table `foo`, HCatalog's [HdfsAuthorizationProvider class](http://svn.apache.org/repos/asf/hive/branches/branch-0.11/hcatalog/core/src/main/java/org/apache/hcatalog/security/HdfsAuthorizationProvider.java) checks to see if the user has read permission on the corresponding HDFS directory `/hive/warehouse/foo` that contains the table's data. However, in HDFS having [read permission on a directory](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html) only implies that you have the ability to list the contents of the directory – it doesn't have any affect on your ability to read the files contained in the directory. @@ -100,7 +97,3 @@ Finally, red is used in the preceding diagram to highlight HCatalog components w ![](images/icons/bullet_blue.gif) - - - - diff --git a/content/Development/desingdocs/binary-datatype-proposal.md b/content/Development/desingdocs/binary-datatype-proposal.md index df906e19..3c68eb13 100644 --- a/content/Development/desingdocs/binary-datatype-proposal.md +++ b/content/Development/desingdocs/binary-datatype-proposal.md @@ -21,9 +21,9 @@ create table binary_table (a string, b binary); ### How is 'binary' represented internally in Hive -Binary type in Hive will map to 'binary' data type in thrift.  +Binary type in Hive will map to 'binary' data type in thrift.  -Primitive java object for 'binary' type is ByteArrayRef +Primitive java object for 'binary' type is ByteArrayRef PrimitiveWritableObject for 'binary' type is BytesWritable @@ -41,13 +41,13 @@ As with other types, binary data will be sent to transform script in String form ### Supported Serde: -ColumnarSerde +ColumnarSerde -BinarySortableSerde +BinarySortableSerde -LazyBinaryColumnarSerde   +LazyBinaryColumnarSerde   -LazyBinarySerde +LazyBinarySerde LazySimpleSerde @@ -57,7 +57,3 @@ Group-by and unions will be supported on columns with 'binary' type - - - - diff --git a/content/Development/desingdocs/column-statistics-in-hive.md b/content/Development/desingdocs/column-statistics-in-hive.md index e6a9c3f1..35d4b53b 100644 --- a/content/Development/desingdocs/column-statistics-in-hive.md +++ b/content/Development/desingdocs/column-statistics-in-hive.md @@ -30,59 +30,60 @@ To view column stats : ``` describe formatted [table_name] [column_name]; ``` + ### **Metastore Schema** To persist column level statistics, we propose to add the following new tables, CREATE TABLE TAB_COL_STATS - ( - CS_ID NUMBER NOT NULL, - TBL_ID NUMBER NOT NULL, - COLUMN_NAME VARCHAR(128) NOT NULL, - COLUMN_TYPE VARCHAR(128) NOT NULL, - TABLE_NAME VARCHAR(128) NOT NULL, - DB_NAME VARCHAR(128) NOT NULL, +( +CS_ID NUMBER NOT NULL, +TBL_ID NUMBER NOT NULL, +COLUMN_NAME VARCHAR(128) NOT NULL, +COLUMN_TYPE VARCHAR(128) NOT NULL, +TABLE_NAME VARCHAR(128) NOT NULL, +DB_NAME VARCHAR(128) NOT NULL, LOW_VALUE RAW, - HIGH_VALUE RAW, - NUM_NULLS BIGINT, - NUM_DISTINCTS BIGINT, +HIGH_VALUE RAW, +NUM_NULLS BIGINT, +NUM_DISTINCTS BIGINT, BIT_VECTOR, BLOB,  /* introduced in [HIVE-16997](https://issues.apache.org/jira/browse/HIVE-16997) in Hive 3.0.0 */ AVG_COL_LEN DOUBLE, - MAX_COL_LEN BIGINT, - NUM_TRUES BIGINT, - NUM_FALSES BIGINT, - LAST_ANALYZED BIGINT NOT NULL) +MAX_COL_LEN BIGINT, +NUM_TRUES BIGINT, +NUM_FALSES BIGINT, +LAST_ANALYZED BIGINT NOT NULL) ALTER TABLE COLUMN_STATISTICS ADD CONSTRAINT COLUMN_STATISTICS_PK PRIMARY KEY (CS_ID); ALTER TABLE COLUMN_STATISTICS ADD CONSTRAINT COLUMN_STATISTICS_FK1 FOREIGN KEY (TBL_ID) REFERENCES TBLS (TBL_ID) INITIALLY DEFERRED ; CREATE TABLE PART_COL_STATS - ( - CS_ID NUMBER NOT NULL, - PART_ID NUMBER NOT NULL, +( +CS_ID NUMBER NOT NULL, +PART_ID NUMBER NOT NULL, DB_NAME VARCHAR(128) NOT NULL, - COLUMN_NAME VARCHAR(128) NOT NULL, - COLUMN_TYPE VARCHAR(128) NOT NULL, - TABLE_NAME VARCHAR(128) NOT NULL, - PART_NAME VARCHAR(128) NOT NULL, +COLUMN_NAME VARCHAR(128) NOT NULL, +COLUMN_TYPE VARCHAR(128) NOT NULL, +TABLE_NAME VARCHAR(128) NOT NULL, +PART_NAME VARCHAR(128) NOT NULL, LOW_VALUE RAW, - HIGH_VALUE RAW, - NUM_NULLS BIGINT, - NUM_DISTINCTS BIGINT, +HIGH_VALUE RAW, +NUM_NULLS BIGINT, +NUM_DISTINCTS BIGINT, BIT_VECTOR, BLOB,  /* introduced in [HIVE-16997](https://issues.apache.org/jira/browse/HIVE-16997) in Hive 3.0.0 */ AVG_COL_LEN DOUBLE, - MAX_COL_LEN BIGINT, - NUM_TRUES BIGINT, - NUM_FALSES BIGINT, - LAST_ANALYZED BIGINT NOT NULL) +MAX_COL_LEN BIGINT, +NUM_TRUES BIGINT, +NUM_FALSES BIGINT, +LAST_ANALYZED BIGINT NOT NULL) ALTER TABLE COLUMN_STATISTICS ADD CONSTRAINT COLUMN_STATISTICS_PK PRIMARY KEY (CS_ID); @@ -93,44 +94,44 @@ ALTER TABLE COLUMN_STATISTICS ADD CONSTRAINT COLUMN_STATISTICS_FK1 FOREIGN KEY ( We propose to add the following Thrift structs to transport column statistics: struct BooleanColumnStatsData { - 1: required i64 numTrues, - 2: required i64 numFalses, - 3: required i64 numNulls - } +1: required i64 numTrues, +2: required i64 numFalses, +3: required i64 numNulls +} struct DoubleColumnStatsData { - 1: required double lowValue, - 2: required double highValue, - 3: required i64 numNulls, - 4: required i64 numDVs, +1: required double lowValue, +2: required double highValue, +3: required i64 numNulls, +4: required i64 numDVs, 5: optional string bitVectors } struct LongColumnStatsData { - 1: required i64 lowValue, - 2: required i64 highValue, - 3: required i64 numNulls, - 4: required i64 numDVs, +1: required i64 lowValue, +2: required i64 highValue, +3: required i64 numNulls, +4: required i64 numDVs, 5: optional string bitVectors - } +} struct StringColumnStatsData { - 1: required i64 maxColLen, - 2: required double avgColLen, - 3: required i64 numNulls, - 4: required i64 numDVs, +1: required i64 maxColLen, +2: required double avgColLen, +3: required i64 numNulls, +4: required i64 numDVs, 5: optional string bitVectors - } +} struct BinaryColumnStatsData { - 1: required i64 maxColLen, - 2: required double avgColLen, - 3: required i64 numNulls - } +1: required i64 maxColLen, +2: required double avgColLen, +3: required i64 numNulls +} struct Decimal { 1: required binary unscaled, @@ -168,43 +169,43 @@ union ColumnStatisticsData { } struct ColumnStatisticsObj { - 1: required string colName, - 2: required string colType, - 3: required ColumnStatisticsData statsData - } +1: required string colName, +2: required string colType, +3: required ColumnStatisticsData statsData +} struct ColumnStatisticsDesc { - 1: required bool isTblLevel, - 2: required string dbName, - 3: required string tableName, - 4: optional string partName, - 5: optional i64 lastAnalyzed - } +1: required bool isTblLevel, +2: required string dbName, +3: required string tableName, +4: optional string partName, +5: optional i64 lastAnalyzed +} struct ColumnStatistics { - 1: required ColumnStatisticsDesc statsDesc, - 2: required list statsObj; - } +1: required ColumnStatisticsDesc statsDesc, +2: required list statsObj; +} We propose to add the following Thrift APIs to persist, retrieve and delete column statistics: bool update_table_column_statistics(1:ColumnStatistics stats_obj) throws (1:NoSuchObjectException o1, - 2:InvalidObjectException o2, 3:MetaException o3, 4:InvalidInputException o4) - bool update_partition_column_statistics(1:ColumnStatistics stats_obj) throws (1:NoSuchObjectException o1, - 2:InvalidObjectException o2, 3:MetaException o3, 4:InvalidInputException o4) +2:InvalidObjectException o2, 3:MetaException o3, 4:InvalidInputException o4) +bool update_partition_column_statistics(1:ColumnStatistics stats_obj) throws (1:NoSuchObjectException o1, +2:InvalidObjectException o2, 3:MetaException o3, 4:InvalidInputException o4) ColumnStatistics get_table_column_statistics(1:string db_name, 2:string tbl_name, 3:string col_name) throws - (1:NoSuchObjectException o1, 2:MetaException o2, 3:InvalidInputException o3, 4:InvalidObjectException o4) - ColumnStatistics get_partition_column_statistics(1:string db_name, 2:string tbl_name, 3:string part_name, - 4:string col_name) throws (1:NoSuchObjectException o1, 2:MetaException o2, - 3:InvalidInputException o3, 4:InvalidObjectException o4) +(1:NoSuchObjectException o1, 2:MetaException o2, 3:InvalidInputException o3, 4:InvalidObjectException o4) +ColumnStatistics get_partition_column_statistics(1:string db_name, 2:string tbl_name, 3:string part_name, +4:string col_name) throws (1:NoSuchObjectException o1, 2:MetaException o2, +3:InvalidInputException o3, 4:InvalidObjectException o4) bool delete_partition_column_statistics(1:string db_name, 2:string tbl_name, 3:string part_name, 4:string col_name) throws - (1:NoSuchObjectException o1, 2:MetaException o2, 3:InvalidObjectException o3, - 4:InvalidInputException o4) - bool delete_table_column_statistics(1:string db_name, 2:string tbl_name, 3:string col_name) throws - (1:NoSuchObjectException o1, 2:MetaException o2, 3:InvalidObjectException o3, - 4:InvalidInputException o4) +(1:NoSuchObjectException o1, 2:MetaException o2, 3:InvalidObjectException o3, +4:InvalidInputException o4) +bool delete_table_column_statistics(1:string db_name, 2:string tbl_name, 3:string col_name) throws +(1:NoSuchObjectException o1, 2:MetaException o2, 3:InvalidObjectException o3, +4:InvalidInputException o4) Note that delete_column_statistics is needed to remove the entries from the metastore when a table is dropped. Also note that currently Hive doesn’t support drop column. diff --git a/content/Development/desingdocs/correlation-optimizer.md b/content/Development/desingdocs/correlation-optimizer.md index 746a13cb..dc95a9cf 100644 --- a/content/Development/desingdocs/correlation-optimizer.md +++ b/content/Development/desingdocs/correlation-optimizer.md @@ -38,15 +38,15 @@ GROUP BY /*AGG2*/ tmp1.key; The original operator tree generated by Hive is shown below. -![](/attachments/34019487/34177453.png) - - **Figure 1: The original operator tree of Example 1 generated by Hive** +![](/attachments/34019487/34177453.png) + +**Figure 1: The original operator tree of Example 1 generated by Hive** This plan uses three MapReduce jobs to evaluate this query. However, `AGG1`, `JOIN1`, and `AGG2` all require the column `key` to be the partitioning column for shuffling the data. Thus, we do not need to shuffle the data in the same way three times. We only need to shuffle the data once, and thus a single MapReduce job is needed. The optimized operator tree is shown below. -![](/attachments/34019487/34177454.png) - - **Figure 2: The optimized operator tree of Example 1** +![](/attachments/34019487/34177454.png) + +**Figure 2: The optimized operator tree of Example 1** Since the input table of `AGG1` and the left table of `JOIN1` are both `t1`, when we use a single MapReduce job to evaluate this query, Hive only needs to scan `t1` once. While, in the original plan, `t1` is used in two MapReduce jobs, and thus it is scanned twice. @@ -66,15 +66,15 @@ GROUP BY /*AGG2*/ t1.key; The original operator tree generated by Hive is shown below. -![](/attachments/34019487/34177455.png) - - **Figure 3: The original operator tree of Example 2 generated by Hive** +![](/attachments/34019487/34177455.png) + +**Figure 3: The original operator tree of Example 2 generated by Hive** This example is similar to Example 1. The optimized operator tree only needs a single MapReduce job, which is shown below. -![](/attachments/34019487/34177456.png) - - **Figure 4: The optimized operator tree of Example 2** +![](/attachments/34019487/34177456.png) + +**Figure 4: The optimized operator tree of Example 2** ### 2.3 Example 3 @@ -108,15 +108,15 @@ WHERE d.d_date >= '2001-05-01' and The original operator tree generated by Hive is shown below. -![](/attachments/34019487/34177457.png) - - **Figure 5: The original operator tree of Example 3 generated by Hive** +![](/attachments/34019487/34177457.png) + +**Figure 5: The original operator tree of Example 3 generated by Hive** In this complex query, we will first have several MapJoins (`MJ1`, `MJ2`, and `MJ3`) which can be evaluated in the same Map phase. Since `JOIN1`, `JOIN2`, `JOIN3`, and `JOIN4` use the same column as the join key, we can use a single MapReduce job to evaluate all operators before `AGG1`. The second MapReduce job will generate the final results. The optimized operator tree is shown below. -![](/attachments/34019487/34177458.png) - - **Figure 6: The optimized operator tree of Example 3** +![](/attachments/34019487/34177458.png) + +**Figure 6: The optimized operator tree of Example 3** ## 3. Intra-query Correlations @@ -199,7 +199,3 @@ The umbrella jira is [HIVE-3667](https://issues.apache.org/jira/browse/HIVE-3667 ![](images/icons/bullet_blue.gif) - - - - diff --git a/content/Development/desingdocs/default-constraint.md b/content/Development/desingdocs/default-constraint.md index 54ae4790..19c5f92a 100644 --- a/content/Development/desingdocs/default-constraint.md +++ b/content/Development/desingdocs/default-constraint.md @@ -28,7 +28,6 @@ CREATE TABLE will be updated to let user specify DEFAULT as follows: * With column definition + CREATE TABLE ( DEFAULT ) - * ~~With constraint specification~~ + ~~CREATE TABLE ( , …, CONSTRAINT DEFAULT ()~~ @@ -44,7 +43,7 @@ To be compliant with SQL standards, Hive will only permit default values which f Anytime user doesn’t specify a value explicitly for a column, its default value will be used if defined. For example: -`INSERT INTO (co1, col3) values( , )` +`INSERT INTO (co1, col3) values( , )` Above statement doesn’t specify a value for col2 so system will use the default value for col2 if it is defined. @@ -97,7 +96,6 @@ Along with this logic change we foresee the following changes: * Metastore code will need to be updated to support the DEFAULT constraint. + We propose to store/serialize the default value as string after it is evaluated and constant folded. + DEFAULT_VALUE will be added to KEY_CONSTRAINTS table in metastore schema. - * Hive Parser will need to be updated to allow new DEFAULT keyword with default value. * Error handling/Validation logic needs to be added to make sure DEFAULT value conforms to allowed categories during CREATE TABLE. * Type check to make sure DEFAULT VALUE type is compatible with corresponding column type. @@ -106,7 +104,3 @@ Along with this logic change we foresee the following changes: [HIVE-19059](https://issues.apache.org/jira/browse/HIVE-19059) adds the keyword DEFAULT to enable users to add DEFAULT values in INSERT and UPDATE statements without specifying the column schema.  See [DEFAULT Keyword (HIVE-19059)]({{< ref "default-keyword" >}}). - - - - diff --git a/content/Development/desingdocs/default-keyword.md b/content/Development/desingdocs/default-keyword.md index 7d220afe..41171bd2 100644 --- a/content/Development/desingdocs/default-keyword.md +++ b/content/Development/desingdocs/default-keyword.md @@ -36,7 +36,3 @@ Example: During first phase of AST analysis AST for DEFAULT will be replaced with corresponding DEFAULT value AST or NULL AST. - - - - diff --git a/content/Development/desingdocs/dependent-tables.md b/content/Development/desingdocs/dependent-tables.md index abc3633f..8e6f3723 100644 --- a/content/Development/desingdocs/dependent-tables.md +++ b/content/Development/desingdocs/dependent-tables.md @@ -7,11 +7,11 @@ date: 2024-12-12 Hive supports both partitioned and unpartitioned external tables. In both cases, when a new table/partition is being added, the location is also specified for the new table/partition. Let us consider a specific example: -create table T (key string, value string) partitioned by (ds string, hr string); +create table T (key string, value string) partitioned by (ds string, hr string); -insert overwrite table T partition (ds='1', hr='1') ...; +insert overwrite table T partition (ds='1', hr='1') ...; -.. +.. insert overwrite table T partition (ds='1', hr='24') ...; @@ -23,7 +23,7 @@ When all the hourly partitions are created for a day (ds='1'), the corresponding alter table Tsignal add partition (ds='1') location 'Location of T'/ds=1; -There is a implicit dependency between Tsignal@ds=1 and T@ds=1/hr=1, T@ds=1/hr=2, .... T@ds=1/hr=24, but that dependency is not captured anywhere +There is a implicit dependency between Tsignal@ds=1 and T@ds=1/hr=1, T@ds=1/hr=2, .... T@ds=1/hr=24, but that dependency is not captured anywhere in the metastore. It would be useful to have an ability to explicitly create that dependency. This dependency can be used for all kinds of auditing purposes. For eg. when the following query is performed: @@ -37,9 +37,9 @@ create dependency table Tdependent (key string, value string) partitioned by (ds This is like a external table but also captures the dependency (we can also enhance external tables for the same). -alter table Tdependent add partition (ds='1') location '/T/ds=1' dependent partitions table T partitions (ds='1'); +alter table Tdependent add partition (ds='1') location '/T/ds=1' dependent partitions table T partitions (ds='1'); -specify the partial partition spec for the dependent partitions. +specify the partial partition spec for the dependent partitions. Note that each table can point to different locations - hive needs to ensure that all the dependent partitions are under the location 'T/ds=1' @@ -47,59 +47,59 @@ Note that each table can point to different locations - hive needs to ensure tha The metastore can store the dependencies completely or partially. -* + Materialize the dependencies both-ways - - Tdependent@ds=1 depends on T@ds=1/hr=1 to T@ds=1/hr=24 - - T@ds=1/hr=1 is depended upon by T@ds=1 - - Advantages: if T@ds=1/hr=1 is dropped, T@ds=1 can be notified or it can choose to dis-allow this - - Any property on Tdependent can be propagated to T +* + Materialize the dependencies both-ways + + Tdependent@ds=1 depends on T@ds=1/hr=1 to T@ds=1/hr=24 + + T@ds=1/hr=1 is depended upon by T@ds=1 + + Advantages: if T@ds=1/hr=1 is dropped, T@ds=1 can be notified or it can choose to dis-allow this + + Any property on Tdependent can be propagated to T * + Is the dependency used for querying ? What happens if T@ds=1/hr=25 gets added ? The query 'select .. from Tdependent where ds = 1' includes T@ds=1/hr=25, but this is not shown in the inputs. - + Dont use the location for querying - then why have the location ? + + Dont use the location for querying - then why have the location ? +* + Store partial dependencies + + Tdependent@ds=1 depends on T@ds=1 (spec). + + At describe time, the spec is evaluated and all the dependent partitions are computed dynamically. At add partition time, verify that the location captures all dependent partitions. -* + Store partial dependencies - - Tdependent@ds=1 depends on T@ds=1 (spec). - - At describe time, the spec is evaluated and all the dependent partitions are computed dynamically. At add partition time, verify that the location captures all dependent partitions. - - The partial spec is not used for querying - location is used for that. At query time, verify that the location captures all dependent partitions. + The partial spec is not used for querying - location is used for that. At query time, verify that the location captures all dependent partitions. * The dependent table does not have a location. - + The list of partitions are computed at query time - think of it like a view, where each partition has its own definition limited to 'select * from T where partial/full partition spec'. Query layer needs to change. Is it possible ? Unlike a view, it does not rewritten at semantic analysis time. After partition pruning is done (on a dependent table), rewrite the - - tree to contain the base table T - the columns remain the same, so it should be possible. -With this, it is possible that the partitions point to different tables. + + The list of partitions are computed at query time - think of it like a view, where each partition has its own definition limited to 'select * from T where partial/full partition spec'. Query layer needs to change. Is it possible ? Unlike a view, it does not rewritten at semantic analysis time. After partition pruning is done (on a dependent table), rewrite the + + tree to contain the base table T - the columns remain the same, so it should be possible. + +With this, it is possible that the partitions point to different tables. For eg: -alter table Tdependent add partition (ds='1') depends on table T1 partition (ds='1'); +alter table Tdependent add partition (ds='1') depends on table T1 partition (ds='1'); alter table Tdependent add partition (ds='2') depends on table T2 partition (ds='2'); -Something that can be achieved by external tables currently. The dependent partitions are computed dynamically - T1@ds=1/hr=1 does not know the fact that it is dependent upon by Tdependent@ds=1. +Something that can be achieved by external tables currently. The dependent partitions are computed dynamically - T1@ds=1/hr=1 does not know the fact that it is dependent upon by Tdependent@ds=1. -T1@ds=1/hr=1 can be dropped anytime, and Tdependent@ds=1 automatically stops depending upon it from that point. +T1@ds=1/hr=1 can be dropped anytime, and Tdependent@ds=1 automatically stops depending upon it from that point. -I am leaning towards this - the user need not specify both the location and the dependent partitions. +I am leaning towards this - the user need not specify both the location and the dependent partitions. Can the external tables be enhanced to support this ? Will create a problem for the query layer, since the external tables are handled differently today. -* + The list of dependent partitions are materialized and stored in the metastore, and use that for querying. - - A query like 'select .. from Tdependent where ds = 1' gets transformed to 'select .. from (select * from T where ((ds = 1 and hr = 1) or (ds = 1 and hr = 2) .... or (ds=1 and hr=24))' - - Can put a lot of load on the query layer. +* + The list of dependent partitions are materialized and stored in the metastore, and use that for querying. + + A query like 'select .. from Tdependent where ds = 1' gets transformed to 'select .. from (select * from T where ((ds = 1 and hr = 1) or (ds = 1 and hr = 2) .... or (ds=1 and hr=24))' -= Final Proposal = + Can put a lot of load on the query layer. + += Final Proposal = Instead of enhancing external tables to solve a convoluted usecase, create a new type of tables - dependent tables. The reason for the existence of external tables is to point to some existing data. -== Dependent Tables == +== Dependent Tables == Create a table which explicitly depends on another table. Consider the following scenario for the first use case mentioned: @@ -168,7 +168,3 @@ The query on Tdep is re-written to access the underlying tables. For eg. for the * desc extended will be enhanced to show all the dependent partitions. * The dependency of an existing partition can be changed: alter table Tdep partition (ds=10) depends on table T2 - - - - diff --git a/content/Development/desingdocs/design.md b/content/Development/desingdocs/design.md index fdb3d841..7bff3bb6 100644 --- a/content/Development/desingdocs/design.md +++ b/content/Development/desingdocs/design.md @@ -67,13 +67,13 @@ HiveQL is an SQL-like query language for Hive. It mostly mimics SQL syntax for c ## Optimizer -More plan transformations are performed by the optimizer. The optimizer is an evolving component. As of 2011, it was rule-based and performed the following: column pruning and predicate pushdown. However, the infrastructure was in place, and there was work under progress to include other optimizations like map-side join. (Hive 0.11 added several [join optimizations](/docs/latest/language/languagemanual-joinoptimization).) - - The optimizer can be enhanced to be cost-based (see [Cost-based optimization in Hive](/docs/latest/user/cost-based-optimization-in-hive) and [HIVE-5775](https://issues.apache.org/jira/browse/HIVE-5775)). The sorted nature of output tables can also be preserved and used later on to generate better plans. The query can be performed on a small sample of data to guess the data distribution, which can be used to generate a better plan. - - A [correlation optimizer](/development/desingdocs/correlation-optimizer) was added in Hive 0.12. - - The plan is a generic operator tree, and can be easily manipulated. +More plan transformations are performed by the optimizer. The optimizer is an evolving component. As of 2011, it was rule-based and performed the following: column pruning and predicate pushdown. However, the infrastructure was in place, and there was work under progress to include other optimizations like map-side join. (Hive 0.11 added several [join optimizations](/docs/latest/language/languagemanual-joinoptimization).) + +The optimizer can be enhanced to be cost-based (see [Cost-based optimization in Hive](/docs/latest/user/cost-based-optimization-in-hive) and [HIVE-5775](https://issues.apache.org/jira/browse/HIVE-5775)). The sorted nature of output tables can also be preserved and used later on to generate better plans. The query can be performed on a small sample of data to guess the data distribution, which can be used to generate a better plan. + +A [correlation optimizer](/development/desingdocs/correlation-optimizer) was added in Hive 0.12. + +The plan is a generic operator tree, and can be easily manipulated. ## Hive APIs @@ -83,7 +83,3 @@ More plan transformations are performed by the optimizer. The optimizer is an ev ![](images/icons/bullet_blue.gif) - - - - diff --git a/content/Development/desingdocs/designdocs.md b/content/Development/desingdocs/designdocs.md index 332f491c..103a6ef3 100644 --- a/content/Development/desingdocs/designdocs.md +++ b/content/Development/desingdocs/designdocs.md @@ -33,7 +33,7 @@ Proposals that appear in the "Completed" and "In Progress" sections should inclu * [Optimizing Skewed Joins]({{< ref "skewed-join-optimization" >}}) ([HIVE-3086](https://issues.apache.org/jira/browse/HIVE-3086)) * [Correlation Optimizer]({{< ref "correlation-optimizer" >}}) ([HIVE-2206](https://issues.apache.org/jira/browse/HIVE-2206)) * [Hive on Tez]({{< ref "hive-on-tez" >}}) ([HIVE-4660](https://issues.apache.org/jira/browse/HIVE-4660)) - + [Hive-Tez Compatibility]({{< ref "hive-tez-compatibility" >}}) + + [Hive-Tez Compatibility]({{< ref "hive-tez-compatibility" >}}) * [Vectorized Query Execution]({{< ref "vectorized-query-execution" >}}) ([HIVE-4160](https://issues.apache.org/jira/browse/HIVE-4160)) * [Cost Based Optimizer in Hive](/docs/latest/user/cost-based-optimization-in-hive) ([HIVE-5775](https://issues.apache.org/jira/browse/HIVE-5775)) * [Atomic Insert/Update/Delete](https://issues.apache.org/jira/browse/HIVE-5317) ([HIVE-5317](https://issues.apache.org/jira/browse/HIVE-5317)) @@ -91,29 +91,19 @@ Proposals that appear in the "Completed" and "In Progress" sections should inclu ![](images/icons/bullet_blue.gif) [attachments/27362075/34177489.pdf](/attachments/27362075/34177489.pdf) (application/pdf) - ![](images/icons/bullet_blue.gif) [attachments/27362075/35193076.pdf](/attachments/27362075/35193076.pdf) (application/pdf) - ![](images/icons/bullet_blue.gif) [attachments/27362075/35193122.pdf](/attachments/27362075/35193122.pdf) (application/pdf) - ![](images/icons/bullet_blue.gif) [attachments/27362075/35193191-html](/attachments/27362075/35193191-html) (text/html) - ![](images/icons/bullet_blue.gif) [attachments/27362075/34177489.pdf](/attachments/27362075/34177489.pdf) (application/download) - ![](images/icons/bullet_blue.gif) [attachments/27362075/55476344.pdf](/attachments/27362075/55476344.pdf) (application/download) - - - - - diff --git a/content/Development/desingdocs/different-timestamp-types.md b/content/Development/desingdocs/different-timestamp-types.md index 74d3d3a2..25e1502e 100644 --- a/content/Development/desingdocs/different-timestamp-types.md +++ b/content/Development/desingdocs/different-timestamp-types.md @@ -39,11 +39,11 @@ Let's summarize the example of how the different semantics described above apply If the timestamp literal '1969-07-20 16:17:39' is inserted in Washington D.C. and then queried from Paris, it might be shown in the following ways based on timestamp semantics: -| SQL type | Semantics | Result | Explanation | -| --- | --- | --- | --- | -| TIMESTAMP [WITHOUT TIME ZONE] | LocalDateTime | 1969-07-20 16:17:39 | Displayed like the original timestamp literal. | -| TIMESTAMP WITH LOCAL TIME ZONE | Instant | 1969-07-20 21:17:39 | Differs from the original timestamp literal, but refers to the same time instant. | -| TIMESTAMP WITH TIME ZONE | OffsetDateTime | 1969-07-20 16:17:39 (UTC -04:00) | Displayed like the original literal but showing the time zone offset as well. | +| SQL type | Semantics | Result | Explanation | +|--------------------------------|----------------|----------------------------------|-----------------------------------------------------------------------------------| +| TIMESTAMP [WITHOUT TIME ZONE] | LocalDateTime | 1969-07-20 16:17:39 | Displayed like the original timestamp literal. | +| TIMESTAMP WITH LOCAL TIME ZONE | Instant | 1969-07-20 21:17:39 | Differs from the original timestamp literal, but refers to the same time instant. | +| TIMESTAMP WITH TIME ZONE | OffsetDateTime | 1969-07-20 16:17:39 (UTC -04:00) | Displayed like the original literal but showing the time zone offset as well. | Of course, the different semantics do not only affect the textual representations but perhaps more importantly SQL function behavior as well. These allow users to take advantage of timestamps in different ways or to explicitly create different textual representations instead of the implicit ones shown above. @@ -56,9 +56,3 @@ Reconstructible details: | TIMESTAMP WITH LOCAL TIME ZONE | Instant | | ✓ | | | TIMESTAMP WITH TIME ZONE | OffsetDateTime | ✓ | ✓ | ✓ | - - - - - - diff --git a/content/Development/desingdocs/dynamicpartitions.md b/content/Development/desingdocs/dynamicpartitions.md index 04675207..37e994c4 100644 --- a/content/Development/desingdocs/dynamicpartitions.md +++ b/content/Development/desingdocs/dynamicpartitions.md @@ -7,13 +7,13 @@ date: 2024-12-12 ## Documentation -This is the design document for dynamic partitions in Hive. Usage information is also available: +This is the design document for dynamic partitions in Hive. Usage information is also available: * [Tutorial: Dynamic-Partition Insert]({{< ref "#tutorial:-dynamic-partition-insert" >}}) * [Hive DML: Dynamic Partition Inserts]({{< ref "#hive-dml:-dynamic-partition-inserts" >}}) * [HCatalog Dynamic Partitioning]({{< ref "hcatalog-dynamicpartitions" >}}) - + [Usage with Pig]({{< ref "#usage-with-pig" >}}) - + [Usage from MapReduce]({{< ref "#usage-from-mapreduce" >}}) + + [Usage with Pig]({{< ref "#usage-with-pig" >}}) + + [Usage from MapReduce]({{< ref "#usage-from-mapreduce" >}}) References: @@ -27,7 +27,7 @@ References: ## Syntax -DP columns are specified the same way as it is for SP columns – in the partition clause. The only difference is that DP columns do not have values, while SP columns do. In the partition clause, we need to specify all partitioning columns, even if all of them are DP columns. +DP columns are specified the same way as it is for SP columns – in the partition clause. The only difference is that DP columns do not have values, while SP columns do. In the partition clause, we need to specify all partitioning columns, even if all of them are DP columns. In INSERT ... SELECT ... queries, the dynamic partition columns must be **specified last** among the columns in the SELECT statement and **in the same order** in which they appear in the PARTITION() clause. @@ -80,7 +80,7 @@ In INSERT ... SELECT ... queries, the dynamic partition columns must be **specif ``` -The above example shows the case of all DP columns in CTAS. If you want put some constant for some partitioning column, you can specify it in the select-clause. e.g, +The above example shows the case of all DP columns in CTAS. If you want put some constant for some partitioning column, you can specify it in the select-clause. e.g, ``` @@ -101,9 +101,9 @@ The above example shows the case of all DP columns in CTAS. If you want put some ## Design issues - 1) Data type of the dynamic partitioning column: +1) Data type of the dynamic partitioning column: - A dynamic partitioning column could be the result of an expression. For example: +A dynamic partitioning column could be the result of an expression. For example: ``` @@ -112,23 +112,23 @@ The above example shows the case of all DP columns in CTAS. If you want put some ``` -Although currently there is not restriction on the data type of the partitioning column, allowing non-primitive columns to be partitioning column probably doesn't make sense. The dynamic partitioning column's type should be derived from the expression. The data type has to be able to be converted to a string in order to be saved as a directory name in HDFS. +Although currently there is not restriction on the data type of the partitioning column, allowing non-primitive columns to be partitioning column probably doesn't make sense. The dynamic partitioning column's type should be derived from the expression. The data type has to be able to be converted to a string in order to be saved as a directory name in HDFS. - 2) Partitioning column value to directory name conversion: +2) Partitioning column value to directory name conversion: - After converting column value to string, we still need to convert the string value to a valid directory name. Some reasons are: +After converting column value to string, we still need to convert the string value to a valid directory name. Some reasons are: * string length is unlimited in theory, but HDFS/local FS directory name length is limited. * string value could contains special characters that is reserved in FS path names (such as '/' or '..'). * what should we do for partition column ObjectInspector? - We need to define a UDF (say hive_qname_partition(T.part_col)) to take a primitive typed value and convert it to a qualified partition name. +We need to define a UDF (say hive_qname_partition(T.part_col)) to take a primitive typed value and convert it to a qualified partition name. - 3) Due to 2), this dynamic partitioning scheme qualifies as a hash-based partitioning scheme, except that we define the hash function to be as close as +3) Due to 2), this dynamic partitioning scheme qualifies as a hash-based partitioning scheme, except that we define the hash function to be as close as -the input value. We should allow users to plugin their own UDF for the partition hash function. Will file a follow up JIRA if there is sufficient interests. +the input value. We should allow users to plugin their own UDF for the partition hash function. Will file a follow up JIRA if there is sufficient interests. - 4) If there are multiple partitioning columns, their order is significant since that translates to the directory structure in HDFS: partitioned by (ds string, dept int) implies a directory structure of ds=2009-02-26/dept=2. In a DML or DDL involving partitioned table, So if a subset of partitioning columns are specified (static), we should throw an error if a dynamic partitioning column is lower. Example: +4) If there are multiple partitioning columns, their order is significant since that translates to the directory structure in HDFS: partitioned by (ds string, dept int) implies a directory structure of ds=2009-02-26/dept=2. In a DML or DDL involving partitioned table, So if a subset of partitioning columns are specified (static), we should throw an error if a dynamic partitioning column is lower. Example: ``` @@ -137,7 +137,3 @@ the input value. We should allow users to plugin their own UDF for the partition ``` - - - - diff --git a/content/Development/desingdocs/enabling-grpc-in-hive-metastore.md b/content/Development/desingdocs/enabling-grpc-in-hive-metastore.md index 30a80485..02076841 100644 --- a/content/Development/desingdocs/enabling-grpc-in-hive-metastore.md +++ b/content/Development/desingdocs/enabling-grpc-in-hive-metastore.md @@ -6,6 +6,7 @@ date: 2024-12-12 # Apache Hive : Enabling gRPC in Hive/Hive Metastore (Proposal) ## Contacts + Cameron Moberg (Google), Zhou Fang (Google), Feng Lu (Google), Thejas Nair (Cloudera), Vihang Karajgaonkar (Cloudera), Naveen Gangam (Cloudera) # Objective @@ -22,7 +23,7 @@ Hive Metastore is the central repository of Apache Hive (among others like [Pres [gRPC](https://grpc.io/) is a modern open source high performance RPC framework that can run in any environment. It can efficiently connect services in and across data centers with pluggable support for load balancing, tracing, health checking and authentication. It is also applicable in the last mile of distributed computing to connect devices, mobile applications and browsers to backend services. -Providing gRPC as an option to access Metastore brings us many benefits. Compared to Thrift, gRPC supports streaming that provides better performance for large requests. In addition, it is extensible to more advanced authentication features and is fully compatible with Google’s IAM service that supports fine grained permission checks. A path to integrate gRPC with Hive Metastore is sketched out by this proposal. +Providing gRPC as an option to access Metastore brings us many benefits. Compared to Thrift, gRPC supports streaming that provides better performance for large requests. In addition, it is extensible to more advanced authentication features and is fully compatible with Google’s IAM service that supports fine grained permission checks. A path to integrate gRPC with Hive Metastore is sketched out by this proposal. # Design @@ -64,8 +65,6 @@ service HiveMetaStoreGrpc { } ``` - - Once the service methods are defined, the gRPC server can be created so it can be instantiated by the driver class. The *HiveMetaStoreGrpcServer.java* class signature would [implement the gRPC server](https://grpc.io/docs/languages/java/basics/#simple-rpc) interface like below: **class** @@ -104,7 +103,6 @@ Parsing *metastore.custom.server.class* is implemented in Hive Metastore reposit * *metastore.grpc.service.account.keyfile* - string, optional; the path to the JSON keyfile that the Metastore server will run as. * *metastore.grpc.authentication.class* - string, optional; the gRPC class will use this class to perform authn/authz against the gRPC requests. + The detailed implementation of auth support is not in scope for this design proposal. - * Additional gRPC server configs; maximal request size, max connections, port, etc. ### Hive Metastore Client @@ -146,7 +144,6 @@ Similar to the changes to the server config, a user can populate the following f * [a separate ASF licensed repo]: + Add an additional *HiveMetaStoreGrpcServer* class that implements the logic of the gRPC service methods that translates & explicitly calls the predefined Thrift implementation. + Add an additional *HiveMetaStoreGrpcClient* class that implements [IMetaStoreClient](https://github.com/apache/hive/blob/master/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java) that opens a gRPC connection with the gRPC metastore server and translates Thrift API requests to gRPC and sends to server - * Separate logic of *HiveMetaStore.java* so that a gRPC server, in addition to the Thrift server, can be initialized to making a clear distinction between Thrift and gRPC implementations. * Add required configuration values and implement the dynamical gRPC class loading/instantiation wiring code inside Hive Metastore Server and Client (e.g., *SessionHiveMetastoreClient.java* and *RetryingMetastoreClient.java*). @@ -159,7 +156,3 @@ Similar to the changes to the server config, a user can populate the following f ![](images/icons/bullet_blue.gif) - - - - diff --git a/content/Development/desingdocs/filterpushdowndev.md b/content/Development/desingdocs/filterpushdowndev.md index fa74e597..435c01e2 100644 --- a/content/Development/desingdocs/filterpushdowndev.md +++ b/content/Development/desingdocs/filterpushdowndev.md @@ -54,7 +54,7 @@ Column names in this string are unqualified references to the columns of the tab As mentioned above, we want to avoid duplication in code which interprets the filter string (e.g. parsing). As a first cut, we will provide access to the `ExprNodeDesc` tree by passing it along in serialized form as an optional companion to the filter string. In followups, we will provide parsing utilities for the string form. We will also provide an IndexPredicateAnalyzer class capable of detecting simple [sargable](http://en.wikipedia.org/wiki/Sargable) - subexpressions in an `ExprNodeDesc` tree. In followups, we will provide support for discriminating and combining more complex indexable subexpressions. +subexpressions in an `ExprNodeDesc` tree. In followups, we will provide support for discriminating and combining more complex indexable subexpressions. ``` public class IndexPredicateAnalyzer @@ -174,11 +174,11 @@ x > 3 AND upper(y) = 'XYZ' ``` Suppose a storage handler is capable of implementing the range scanfor `x > 3`, but does not have a facility for evaluating {{upper(y) = - 'XYZ'}}. In this case, the optimal plan would involve decomposing the filter, pushing just the first part down into the storage handler, and - leaving only the remainder for Hive to evaluate via its own executor. +'XYZ'}}. In this case, the optimal plan would involve decomposing the filter, pushing just the first part down into the storage handler, and +leaving only the remainder for Hive to evaluate via its own executor. In order for this to be possible, the storage handler needs to be able to negotiate the decomposition with Hive. This means that Hive gives - the storage handler the entire filter, and the storage handler passes back a "residual": the portion that needs to be evaluated by Hive. A null residual indicates that the storage handler was able to deal with the entire filter on its own (in which case no `FilterOperator` is needed). +the storage handler the entire filter, and the storage handler passes back a "residual": the portion that needs to be evaluated by Hive. A null residual indicates that the storage handler was able to deal with the entire filter on its own (in which case no `FilterOperator` is needed). In order to support this interaction, we will introduce a new (optional) interface to be implemented by storage handlers: @@ -203,7 +203,3 @@ It is assumed that storage handlers which are sophisticated enough to implement Again, this interface is optional, and pushdown is still possible even without it. If the storage handler does not implement this interface, Hive will always implement the entire expression in the `FilterOperator`, but it will still provide the expression to the storage handler's input format; the storage handler is free to implement as much or as little as it wants. - - - - diff --git a/content/Development/desingdocs/groupbywithrollup.md b/content/Development/desingdocs/groupbywithrollup.md index 1050ec95..3aecbe10 100644 --- a/content/Development/desingdocs/groupbywithrollup.md +++ b/content/Development/desingdocs/groupbywithrollup.md @@ -18,13 +18,13 @@ Before the rollup option was added to the group by operator, there were 4 differ This plan remains the same, only the implementation of the map-side hash-based aggregation operator was modified to handle the extra rows needed for rollup. The plan is as follows: -Mapper: +Mapper: -*Hash-based group by operator to perform partial aggregations +*Hash-based group by operator to perform partial aggregations *Reduce sink operator, performs some partial aggregations -Reducer: +Reducer: *MergePartial (list-based) group by operator to perform final aggregations @@ -32,21 +32,21 @@ Reducer: Again, this plan remains the same, only the implementation of the map-side hash-based aggregation operator was modified to handle the extra rows needed for rollup. The plan is as follows: -Mapper 1: +Mapper 1: -*Hash-based group by operator to perform partial aggregations +*Hash-based group by operator to perform partial aggregations *Reduce sink operator to spray by the group by and distinct keys (if there is a distinct key) or a random number otherwise -Reducer 1: +Reducer 1: *Partials (list-based) group by operator to perform further partial aggregations -Mapper 2: +Mapper 2: *Reduce sink operator, performs some partial aggregations -Reducer 2: +Reducer 2: *Final (list-based) group by operator to perform final aggregations @@ -56,11 +56,11 @@ Note that if there are no group by keys or distinct keys, Reducer 1 and Mapper 2 This plan is the case from pre-rollup version of group by where there is no Map Aggr and No Skew, I included it for completeness as it remains an option if rollup is not used. The plan is as follows: -Mapper: +Mapper: *Reduce sink operator, performs some partial aggregations -Reducer: +Reducer: *Complete (list-based) group by operator to perform all aggregations @@ -68,19 +68,19 @@ Reducer: The plan is as follows: -Mapper 1: +Mapper 1: *Reduce sink operator, does not perform any partial aggregations -Reducer 1: +Reducer 1: -*Hash-based group by operator, much like the one used in the mappers of previous cases +*Hash-based group by operator, much like the one used in the mappers of previous cases -Mapper 2: +Mapper 2: *Reduce sink operator, performs some partial aggregations -Reducer 2: +Reducer 2: *MergePartial (list-based) group by operator to perform remaining aggregations @@ -88,19 +88,19 @@ Reducer 2: This plan is the same as was used for the case of No Map Aggr and Skew in the pre-rollup version of group by, for this cads when rollup is not used, or none of the aggregations make use of a distinct key. The implementation of the list-based group by operator was modified to handle the extra rows required for rollup if rollup is being used. The plan is as follows: -Mapper 1: +Mapper 1: *Reduce sink operator to spray by the group by and distinct keys (if there is a distinct key) or a random number otherwise -Reducer 1: +Reducer 1: *Partial1 (list-based) group by operator to perform partial aggregations, it makes use of the new list-based group by operator implementation for rollup if necessary -Mapper 2: +Mapper 2: *Reduce sink operator, performs some partial aggregations -Reducer 2: +Reducer 2: *Final (list-based) group by operator to perform remaining aggregations @@ -108,27 +108,27 @@ Reducer 2: This plan is used when there is No Map Aggr and Skew and there is an aggregation that involves a distinct key and rollup is being used. The plan is as follows: -Mapper 1: +Mapper 1: *Reduce sink operator to spray by the group by and distinct keys (if there is a distinct key) or a random number otherwise -Reducer 1: +Reducer 1: -*Hash-based group by operator, much like the one used in the mappers of previous cases +*Hash-based group by operator, much like the one used in the mappers of previous cases -Mapper 2: +Mapper 2: *Reduce sink operator to spray by the group by and distinct keys (if there is a distinct key) or a random number otherwise -Reducer 2: +Reducer 2: *Partials (list-based) group by operator to perform further partial aggregations -Mapper 3: +Mapper 3: *Reduce sink operator, performs some partial aggregations -Reducer 3: +Reducer 3: *Final (list-based) group by operator to perform final aggregations @@ -138,3 +138,4 @@ Note that if there are no group by keys or distinct keys, Reducer 2 and Mapper 3 * [Original design doc](https://issues.apache.org/jira/secure/attachment/12437909/dp_design.txt) * [HIVE-2397](https://issues.apache.org/jira/browse/HIVE-2397) + diff --git a/content/Development/desingdocs/hadoop-compatible-input-output-format-for-hive.md b/content/Development/desingdocs/hadoop-compatible-input-output-format-for-hive.md index 1c738837..92e8bfc0 100644 --- a/content/Development/desingdocs/hadoop-compatible-input-output-format-for-hive.md +++ b/content/Development/desingdocs/hadoop-compatible-input-output-format-for-hive.md @@ -44,7 +44,3 @@ Usage: 3. Initialize HiveApiOutputFormat with the information. 4. Go to town using HiveApiOutputFormat with your Hadoop-compatible writing system. - - - - diff --git a/content/Development/desingdocs/hbase-execution-plans-for-rawstore-partition-filter-condition.md b/content/Development/desingdocs/hbase-execution-plans-for-rawstore-partition-filter-condition.md index 9cef21f3..7f2bdeb7 100644 --- a/content/Development/desingdocs/hbase-execution-plans-for-rawstore-partition-filter-condition.md +++ b/content/Development/desingdocs/hbase-execution-plans-for-rawstore-partition-filter-condition.md @@ -7,58 +7,41 @@ date: 2024-12-12 (Apologies for this doc being organized properly, I thought something is better than nothing - Thejas) - This is part of metastore on hbase work -  [![](https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21140&avatarType=issuetype)HIVE-9452](https://issues.apache.org/jira/browse/HIVE-9452?src=confmacro) - - - Use HBase to store Hive metadata +- +Use HBase to store Hive metadata Open Functionality needed - - -RawStore functions that support partition filtering are the following - +RawStore functions that support partition filtering are the following - * getPartitionsByExpr * getPartitionsByFilter (takes filter string as argument, used from hcatalog) - - We need to generate a query execution plan in terms of Hbase scan api calls for a given filter condition. - - ## Notes about the api to be supported - - getPartitionsByExpr - Current partition expression evaluation path ExprNodeGenericFuncDesc represents the partition filter expression in the plan 1. It is serialized into byte[] and Metastore api is invoked with the byte[]. 2. ObjectStore processing of expression - -1. deserializes the byte[], prints it to convert it to Filter string -2. Converts Filter string to ExpressionTree using parser (Filter.g) -3. Walk ExpressionTree to create sql query (in direct sql) - - +3. deserializes the byte[], prints it to convert it to Filter string +4. Converts Filter string to ExpressionTree using parser (Filter.g) +5. Walk ExpressionTree to create sql query (in direct sql) getPartitionsByFilter - Evaluation of it is similar, it just skips the steps required to create the filter string. We certainly need the ability to work with filter string to support this function. - - Why do we convert from ExprNodeGenericFuncDesc to kryo serialized byte[] and not to the filter string ? - - Filter expressions supported currently - Leaf Operators : =, >, <, <=, >=, LIKE, != - - Logical Operators : AND, OR +Leaf Operators : =, >, <, <=, >=, LIKE, != - +Logical Operators : AND, OR Partition table in hbase @@ -66,24 +49,10 @@ Partition information is stored in with the key as a delimited string consisting The value contains rest of the partition information. (side note: do we need the partition values in the value part?) - - - - - - - - - - # Implementation - - Serialization format of partition table key in hbase - - Desirable properties for key serialization format - 1. It should be possible to perform filter operations on the keys without deserializing the fields (LIKE operator is not common, so its ok if we have to deserialize for that one) @@ -92,14 +61,9 @@ Desirable properties for key serialization format - BinarySortableSerDe satisfies these requirements except for number 3. Meeting requirement 3 might need some index information to be stored in end of the serialized key. - - Limitations with current storage format (no secondary keys) -If there are multiple partition keys for a table, and partition filter condition does not have a condition on the first partition key, we would end up scanning all partitions for the table to find the matches. For this case, we need support for secondary indexes on the table. While we could implement this using a second table, the lack of support for atomic operations across rows/tables is a problem. We would need some level of transaction support in hbase to be able to create secondary indexes reliably. - - - +If there are multiple partition keys for a table, and partition filter condition does not have a condition on the first partition key, we would end up scanning all partitions for the table to find the matches. For this case, we need support for secondary indexes on the table. While we could implement this using a second table, the lack of support for atomic operations across rows/tables is a problem. We would need some level of transaction support in hbase to be able to create secondary indexes reliably. Filtering the partitions @@ -109,13 +73,8 @@ The hbase api’s used will depend on the filtering condition - 2. In case of more complex queries involving additional partition columns, we need to use a scan filter with conditions on remaining columns as well. ie, new Scan(byte[] startRow, byte[] stopRow) + Scan.setFilter(..) 3. If there are no conditions on the first partition column, then all partitions on the table would need to be scanned. In that case, start and end rows will be based only on the db+table prefix of the key. - - Filters with top level “OR” conditions - Each of the conditions under OR should be evaluated to see which of the above api call pattern suits them. If any one of the conditions requires no 3 call pattern, it makes sense to represent the entire filter condition using api call pattern 3. - - - Examples of conversion of query plan to hbase api calls * merge function below does a set-union @@ -123,9 +82,6 @@ Examples of conversion of query plan to hbase api calls * The scan(startRow, endRow) scans from startRow to row before endRow. ie, it represents rows where (r >= startRow and r < endRow). But it can be made to represent (r > startRow) by adding a zero byte to startRow, and made to represent (r <= endRow) by adding zero byte to endRow. ie, the plans for >= and > are similar, <= and = are similar. * All keys corresponding to a partitions of a table have a common prefix of “db + tablename”. That is referred to as “X” in following examples. - - -   | Filter expression | HBase calls | @@ -147,106 +103,75 @@ Examples of conversion of query plan to hbase api calls   - - - Relevant classes : - - Input: ExpressionTree (existing) - TreeNodes for AND/OR expressions. Leaf Node for leaf expressions with  =,< ... - - Output: -  public static abstract class FilterPlan { + public static abstract class FilterPlan { -    abstract FilterPlan and(FilterPlan other); +   abstract FilterPlan and(FilterPlan other); -    abstract FilterPlan or(FilterPlan other); +   abstract FilterPlan or(FilterPlan other); -    abstract List getPlans(); +   abstract List getPlans(); -  } - - + } // represents a union of multiple ScanPlan MultiScanPlan extends FilterPlan - - - ScanPlan extends FilterPlan -    // represent Scan start - -    private ScanMarker startMarker ; +   // represent Scan start -    // represent Scan end +   private ScanMarker startMarker ; -    private ScanMarker endMarker ; +   // represent Scan end -    private ScanFilter filter; +   private ScanMarker endMarker ; - +   private ScanFilter filter; public FilterPlan and(FilterPlan other) { - // calls this.and(otherScanPlan) on each scan plan in other +// calls this.and(otherScanPlan) on each scan plan in other } private ScanPlan and(ScanPlan other) { -   // combines start marker and end marker and filters of this and other +  // combines start marker and end marker and filters of this and other } public FilterPlan or(FilterPlan other) { -   // just create a new FilterPlan from other, with this additional plan +  // just create a new FilterPlan from other, with this additional plan } - - - PartitionFilterGenerator - -  /** - -   * Visitor for ExpressionTree. + /** -   * It first generates the ScanPlan for the leaf nodes. The higher level nodes are +  * Visitor for ExpressionTree. -   * either AND or OR operations. It then calls FilterPlan.and and FilterPlan.or with +  * It first generates the ScanPlan for the leaf nodes. The higher level nodes are -   * the child nodes to generate the plans for higher level nodes. +  * either AND or OR operations. It then calls FilterPlan.and and FilterPlan.or with -   */ +  * the child nodes to generate the plans for higher level nodes. - - - +  */ Initial implementation: Convert from from ExpressionTree to Hbase filter, thereby implementing both getPartitionsByFilter and getPartitionsByExpr - - -A new custom Filter class implementation needs to be created. Filter class implements Writable, and the hbase expression to be evaluated is serialized - - +A new custom Filter class implementation needs to be created. Filter class implements Writable, and the hbase expression to be evaluated is serialized We can potentially create the filter directly from ExprNodeGenericFuncDesc in case of the new fastpath config is set. - - - - - - diff --git a/content/Development/desingdocs/hbasebulkload.md b/content/Development/desingdocs/hbasebulkload.md index d4a0ff43..bb32055f 100644 --- a/content/Development/desingdocs/hbasebulkload.md +++ b/content/Development/desingdocs/hbasebulkload.md @@ -229,7 +229,3 @@ TBLPROPERTIES("hbase.table.name" = "transactions"); * Support loading into existing tables once HBASE-1923 is implemented * Wrap it all up into the ideal single-INSERT-with-auto-sampling job... - - - - diff --git a/content/Development/desingdocs/hbasemetastoredevelopmentguide.md b/content/Development/desingdocs/hbasemetastoredevelopmentguide.md index 5ebefd55..e5f17d92 100644 --- a/content/Development/desingdocs/hbasemetastoredevelopmentguide.md +++ b/content/Development/desingdocs/hbasemetastoredevelopmentguide.md @@ -23,18 +23,17 @@ Once you’ve built the code from the HBase metastore branch (hbase-metastore), 1. Install HBase, preferably HBase 1.1.1 as that’s what is being used for testing. 2. Copy following jars into $HBASE_HOME/lib - 1. hive-common-.*.jar - 2. hive-metastore-.*.jar - 3. hive-serde-.*.jar + 1. hive-common-.*.jar + 2. hive-metastore-.*.jar + 3. hive-serde-.*.jar 3. Setup HBase,   I run it in stand alone mode, so you have to set a couple of values in hbase-site.xml for this to work. 4. Set HADOOP_HOME if you’re not in a cluster where hadoop is already on your path. 5. Start HBase: $HBASE_HOME/bin/start-hbase.sh 6. Set it up so that HBase jars and conf file are picked up by Hive -1. export HIVE_AUX_JARS_PATH=$HBASE_HOME/lib/ -2. export AUX_CLASSPATH=$HBASE_HOME/conf - -8. Create the metastore tables in HBase: hive --service hbaseschematool --install -9. Configure Hive to use HBase as its metastore, in hive-site.xml: +7. export HIVE_AUX_JARS_PATH=$HBASE_HOME/lib/ +8. export AUX_CLASSPATH=$HBASE_HOME/conf +9. Create the metastore tables in HBase: hive --service hbaseschematool --install +10. Configure Hive to use HBase as its metastore, in hive-site.xml: ``` @@ -67,8 +66,6 @@ The following command will import the metadata from the rdbms to hbase: hive --service hbaseimport  - - # Design Docs [Overall Approach](https://issues.apache.org/jira/secure/attachment/12697601/HBaseMetastoreApproach.pdf) @@ -77,7 +74,3 @@ hive --service hbaseimport    - - - - diff --git a/content/Development/desingdocs/hive-metadata-caching-proposal.md b/content/Development/desingdocs/hive-metadata-caching-proposal.md index 712757e3..44b0288b 100644 --- a/content/Development/desingdocs/hive-metadata-caching-proposal.md +++ b/content/Development/desingdocs/hive-metadata-caching-proposal.md @@ -11,7 +11,7 @@ During Hive 2 benchmark, we find Hive metastore operation take a lot of time and ## Server side vs client side cache -We are thinking about two possible locations of cache. One is on metastore client side, the other is on metastore server side. Both client side and server side cache needs to be a singleton and shared within the JVM. Let’s take Metastore server side cache as an example and illustrated below: +We are thinking about two possible locations of cache. One is on metastore client side, the other is on metastore server side. Both client side and server side cache needs to be a singleton and shared within the JVM. Let’s take Metastore server side cache as an example and illustrated below: ![ms2.png](https://lh6.googleusercontent.com/qtleiaHa_6m5Qv8VdvdVzAO23lThXljKODtu0uNJDanrRteYOfR-ss6HhBnByFz4XjmYbXUzqKRRExgM1t56xrBUP2sEwVsncMTT2zVrxwlI-63NMQUeqCErWN4DRkTz7wEHmn_5) @@ -19,8 +19,6 @@ Here we show two HiveServer2 instances using a single remote metastore. The meta ![ms1.png](https://lh6.googleusercontent.com/yDtScj5Ls99DYNBW6Z5KAqxFQscGsnfSoT7o20TZkA4OYYoiaFdJjqKwBa437pmygEx72e7KWmkeqFm-0Z2I2c-sWeYYi8YdAU1oSiCIPVOPDPhB8yNpGepO1jbgH0kE7Bq8_8KR) - - On the other hand, Metastore client side lives in client JVM and will go away once the client is gone. ![ms3.png](https://lh3.googleusercontent.com/tlk0QJCfWASyjKpDkwdU1qur5f1DI0MIFuVgy-3vrmiXFtOe7ztZQH7gA-PkpW58FAFnNrVkfhd7FtywPzl2wCTvhD9rj2vFuykUArz6XDBw1zmRNF2oKaBQQgs51Vvio1GOdwsN) @@ -39,8 +37,8 @@ To address this issue, we envision several approaches to invalidate stale cache 2. Metastore has an event log (currently used for implementing replication v2). The event log captures all the changes to the metadata object. So we shall be able to monitor the event log on every cache instance and invalidate changed entries ( [![](https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21146&avatarType=issuetype)](https://issues.apache.org/jira/browse/HIVE-18056?src=confmacro) - - - CachedStore: Have a whitelist/blacklist config to allow selective caching of tables/partitions and allow read while prewarming +- +CachedStore: Have a whitelist/blacklist config to allow selective caching of tables/partitions and allow read while prewarming Closed ). This might have a minor lag due to the event propagation, but that should be much shorter than the cache eviction. @@ -61,13 +59,11 @@ Presto has the following cache: + partitionCache + userRolesCache + userTablePrivileges - * Range scan cache + databaseNamesCache: regex -> database names, facilitates database search + tableNamesCache + viewNamesCache + partitionNamesCache: table name -> partition names - * Other + partitionFilterCache: PS -> partition names, facilitates partition pruning @@ -75,43 +71,43 @@ For every partition filter condition, Presto breaks it down into tupleDomain and AddExchanges.planTableScan: -            DomainTranslator.ExtractionResult decomposedPredicate = DomainTranslator.fromPredicate( +           DomainTranslator.ExtractionResult decomposedPredicate = DomainTranslator.fromPredicate( -                    metadata, +                   metadata, -                    session, +                   session, -                    deterministicPredicate, +                   deterministicPredicate, -                    types); +                   types); -    public static class ExtractionResult +   public static class ExtractionResult -    { +   { -        private final TupleDomain tupleDomain; +       private final TupleDomain tupleDomain; -        private final Expression remainingExpression; +       private final Expression remainingExpression; -    } +   } tupleDomain is a mapping of column -> range or exact value. When converting to PS, any range will be converted into wildcard and only exact value will be considered: HivePartitionManager.getFilteredPartitionNames: -        for (HiveColumnHandle partitionKey : partitionKeys) { +       for (HiveColumnHandle partitionKey : partitionKeys) { -            if (domain != null && domain.isNullableSingleValue()) { +           if (domain != null && domain.isNullableSingleValue()) { -                    filter.add(((Slice) value).toStringUtf8()); +                   filter.add(((Slice) value).toStringUtf8()); -            else { +           else { -                filter.add(PARTITION_VALUE_WILDCARD); +               filter.add(PARTITION_VALUE_WILDCARD); -            } +           } -        } +       } For example, the expression “state = CA and date between ‘201612’ and ‘201701’ will be broken down to PS (state = CA) and remainder date between ‘201612’ and ‘201701’. Presto will retrieve the partitions with state = CA from the PS -> partition name cache and partition object cache, and evaluates “date between ‘201612’ and ‘201701’ for every partitions returned. This is a good balance compare to caching partition names for every expression. @@ -122,8 +118,8 @@ Our design is a metastore server side cache and we will do metastore invalidatio Further, in our design, metastore will read all metastore objects once at startup time (prewarm) and there is no eviction of the metastore objects ever since. The only time we change cache is when user requested a change through metastore client (eg, alter table, alter partition), and upon receiving metastore event of changes made by other metastore server. Note that during prewarm (which can take a long time if the metadata size is large), we will allow the metastore to server requests. If a table has already been cached, the requests for that table (and its partitions and statistics) can be served from the cache. If the table has not been prewarmed yet, the requests for that table will be served from the database ( [![](https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21146&avatarType=issuetype)](https://issues.apache.org/jira/browse/HIVE-18264?src=confmacro) - - - CachedStore: Store cached partitions/col stats within the table cache and make prewarm non-blocking +- +CachedStore: Store cached partitions/col stats within the table cache and make prewarm non-blocking Resolved ). @@ -131,8 +127,8 @@ Resolved Currently, the size of the metastore cache can be restricted by a combination of cache whitelist and blacklist patterns ( [![](https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21146&avatarType=issuetype)](https://issues.apache.org/jira/browse/HIVE-18056?src=confmacro) - - - CachedStore: Have a whitelist/blacklist config to allow selective caching of tables/partitions and allow read while prewarming +- +CachedStore: Have a whitelist/blacklist config to allow selective caching of tables/partitions and allow read while prewarming Closed ). Before a table is cached, it is checked against these filters to decide if it can be cached or not. Similarly, when a table is read, if it does not pass the above filters, it is read from the database and not the cache. @@ -145,12 +141,12 @@ In our experiments, we adopted some memory optimizations discussed below, which   -| | | | -| --- | --- | --- | -| object | count | Avg size (byte) | -| table | 895 | 1576 | -| partition | 97,863 | 591 | -| storagedescriptor | 412 | 680 | +| | | | +|-------------------|--------|-----------------| +| object | count | Avg size (byte) | +| table | 895 | 1576 | +| partition | 97,863 | 591 | +| storagedescriptor | 412 | 680 |   @@ -188,8 +184,8 @@ For local metastore request that changes an object, such as alter table/alter pa For remote metastore updates, we will either use a periodical synchronization (current approach), or monitor event log and fetch affected objects from SQL database ( [![](https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21146&avatarType=issuetype)HIVE-18661](https://issues.apache.org/jira/browse/HIVE-18661?src=confmacro) - - - CachedStore: Use metastore notification log events to update cache +- +CachedStore: Use metastore notification log events to update cache Resolved ). Both options are discussed already in “Cache Consistency” section. @@ -199,8 +195,8 @@ Resolved We already have aggregated stats module in ObjectStore ( [![](https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21140&avatarType=issuetype)HIVE-10382](https://issues.apache.org/jira/browse/HIVE-10382?src=confmacro) - - - Aggregate stats cache for RDBMS based metastore codepath +- +Aggregate stats cache for RDBMS based metastore codepath Closed ). However, the base column statistics is not cached and needs to fetch from SQL database everytime needed. We plan to port aggregated stats module to CachedStore to use cached column statistics to do the calculation. One design choice yet to make is whether we need to cache aggregated stats, or calculate them on the fly in the CachedStore assuming all column stats are in memory. But in either case, once we turn on aggregate stats in CacheStore, we shall turn off it in ObjectStore (already have a switch) so we don’t do it twice. @@ -211,7 +207,7 @@ This is one of the most important operations in Hive metastore we want to optimi ### Architecture -CachedStore will implement a RawStore interface. CachedStore internally wraps a real RawStore implementation which could be anything (either ObjectStore, or HBaseStore). In HiveServer2 embedded metastore or standalone metastore setting, we will set hive.metastore.rawstore.impl to CachedStore, and hive.metastore.cached.rawstore.impl (the wrapped RawStore) to ObjectStore. If we are using HiveCli with embedded metastore, we might want to skip CachedStore since we might not want prewarm latency. +CachedStore will implement a RawStore interface. CachedStore internally wraps a real RawStore implementation which could be anything (either ObjectStore, or HBaseStore). In HiveServer2 embedded metastore or standalone metastore setting, we will set hive.metastore.rawstore.impl to CachedStore, and hive.metastore.cached.rawstore.impl (the wrapped RawStore) to ObjectStore. If we are using HiveCli with embedded metastore, we might want to skip CachedStore since we might not want prewarm latency. ### Potential Issues @@ -227,9 +223,3 @@ There are maybe some potential issue or unimplemented featrue in the initial ver In our design, we sacrifice prewarm time and memory footprint in change of simplicity and better runtime performance. By monitoring event queue, and can solve the remote metastore consistency issue which is missing in Presto. Architecture level, CachedStore is a lightweight cache layer wrapping the real RawStore, with this design, there’s nothing prevent us to implement alternative cache strategy in addition to our current approach. - - - - - - diff --git a/content/Development/desingdocs/hive-on-tez.md b/content/Development/desingdocs/hive-on-tez.md index b8119ec8..f1f9bcbd 100644 --- a/content/Development/desingdocs/hive-on-tez.md +++ b/content/Development/desingdocs/hive-on-tez.md @@ -67,9 +67,9 @@ Limiting the integration to the fairly simple MRR/MPJ pattern will require minim # Functional requirements of phase I * Hive continues to work **as is** on clusters that do not have TEZ. - + MR revisions 20, 20S, 23 continue to work unchanged. + + MR revisions 20, 20S, 23 continue to work unchanged. * Hive can optionally submit MR jobs to TEZ without any additional improvements. - + Hive can treat TEZ like just another Hadoop 23 instance. + + Hive can treat TEZ like just another Hadoop 23 instance. * Hive can optionally detect chains of MR jobs and optimize them to a single DAG of the form MR* and submit it to TEZ. * Hive can optionally detect when a join has multiple parent tasks and combine them into a single DAG of a tree shape. * Hive will display the MRR optimization in explain plans. @@ -86,11 +86,11 @@ The following things are out of scope for the first phase: One new configuration variable will be introduced: * ~~hive.optimize.tez~~  -hive.execution.engine (changed in [HIVE-6103](https://issues.apache.org/jira/browse/HIVE-6103)) - + ~~True~~  - tez: Submit native TEZ dags, optimized for MRR/MPJ - + ~~False~~  - mr (default): Submit single map, single reduce plans + hive.execution.engine (changed in [HIVE-6103](https://issues.apache.org/jira/browse/HIVE-6103)) + + ~~True~~  + tez: Submit native TEZ dags, optimized for MRR/MPJ + + ~~False~~  + mr (default): Submit single map, single reduce plans * Update:  Several configuration variables were introduced in Hive 0.13.0.  See the [Tez section]({{< ref "#tez-section" >}}) in Configuration Properties. Note: It is possible to execute an MR plan against TEZ. In order to do so, one simply has to change the following variable (assuming Tez is installed on the cluster): @@ -305,7 +305,3 @@ For information about how to configure Hive 0.13.0+ for Tez, see the release not For a list of Hive and Tez releases that are compatible with each other, see [Hive-Tez Compatibility]({{< ref "hive-tez-compatibility" >}}). - - - - diff --git a/content/Development/desingdocs/hive-remote-databases-tables.md b/content/Development/desingdocs/hive-remote-databases-tables.md index f3cfc521..2fa16636 100644 --- a/content/Development/desingdocs/hive-remote-databases-tables.md +++ b/content/Development/desingdocs/hive-remote-databases-tables.md @@ -58,13 +58,14 @@ While Waggle Dance is working well for us, its design was highly influenced by o We therefore propose that the concept of remotes be added to Hive. Practically this would encapsulate and deliver the proven functionality and utility of Waggle Dance while simultaneously overcoming the deficiencies in the Waggle Dance design. Before exploring the full scope of this idea, let's consider the anatomy of the most typical use case from a user's perspective; creating a link to a table in a remote cluster to enable local access: ``` - CREATE REMOTE TABLE local_db.local_tbl - CONNECTED TO remote_db.remote_tbl - VIA 'org.apache.hadoop.hive.metastore.ThriftHiveMetastoreClientFactory' - WITH TBLPROPERTIES ( - 'hive.metastore.uris' = 'thrift://remote-hms:9083' - ); +CREATE REMOTE TABLE local_db.local_tbl +CONNECTED TO remote_db.remote_tbl +VIA 'org.apache.hadoop.hive.metastore.ThriftHiveMetastoreClientFactory' +WITH TBLPROPERTIES ( + 'hive.metastore.uris' = 'thrift://remote-hms:9083' +); ``` + Once created the user can expect to access the table `remote_db.remote_tbl`, located in a remote Hive cluster, as if it were a cluster local entity, using the synonym `local_db.local_tbl`. * Firstly notice that this can be an entirely user-driven process, managed from within a Hive client. If desired, the creation of remotes could possibly be authorized by appropriate `GRANTs`. @@ -88,23 +89,25 @@ Our first example dealt with the simple federating of a single table from one re Waggle Dance actually federates databases, and hence sets of tables. We could achieve a similar feat with a `CREATE REMOTE DATABASE` (CRD) statement. This would expose all tables in the remote database to the local Hive cluster ``` - CREATE REMOTE DATABASE local_db_name - CONNECTED TO remote_db_name - VIA 'org.apache.hadoop.hive.metastore.ThriftHiveMetastoreClientFactory' - WITH DBPROPERTIES ( - 'hive.metastore.uris' = 'thrift://remote-hms:9083' - ); +CREATE REMOTE DATABASE local_db_name +CONNECTED TO remote_db_name +VIA 'org.apache.hadoop.hive.metastore.ThriftHiveMetastoreClientFactory' +WITH DBPROPERTIES ( + 'hive.metastore.uris' = 'thrift://remote-hms:9083' +); ``` + ##### Statement defaults The CRT and CRD statements can be simplified if we assume some sensible defaults. Here we assume that if a `VIA` stanza is not supplied, we'll default to the HMS Thrift implementation. If the `CONNECT TO` stanza is omitted, the remote database name is assumed to be equal to user supplied local name: ``` - CREATE REMOTE DATABASE db_name - WITH DBPROPERTIES ( - 'hive.metastore.uris' = 'thrift://remote-hms:9083' - ); +CREATE REMOTE DATABASE db_name +WITH DBPROPERTIES ( + 'hive.metastore.uris' = 'thrift://remote-hms:9083' +); ``` + Now, for a remote table we can also derive the local database name from the user's currently selected database, and expect that the remote table name is equal to the user supplied local name: ``` @@ -113,20 +116,22 @@ CREATE REMOTE TABLE tbl_name 'hive.metastore.uris' = 'thrift://remote-hms:9083' ); ``` + ##### SSH Tunneling and bastion hosts With a suitable connector, remotes could be configured to use a SSH tunnel to access a remote Hive metastore in cases where certain network restrictions prevent a direct connection from the local cluster to the machine running the Thrift Hive metastore service. A SSH tunnel consists of one or more hops or jump-boxes. The connection between each pair of nodes requires a user and a private key to establish the SSH connection. ``` - CREATE REMOTE TABLE tbl_name - VIA 'org.apache.hadoop.hive.metastore.SSHThriftHiveMetastoreClientFactory' - WITH TBLPROPERTIES ( - 'hive.metastore.uris' = 'thrift://metastore.domain:9083' - 'ssh.tunnel.route' = 'bastionuser@bastion-host.domain -> user@cluster-node.domain' - 'ssh.tunnel.private.keys' = '/home/user/.ssh/bastionuser-key-pair.pem,/home/user/.ssh/user-key-pair.pem' - 'ssh.tunnel.known.hosts' = '/home/user/.ssh/known_hosts' - ); +CREATE REMOTE TABLE tbl_name +VIA 'org.apache.hadoop.hive.metastore.SSHThriftHiveMetastoreClientFactory' +WITH TBLPROPERTIES ( + 'hive.metastore.uris' = 'thrift://metastore.domain:9083' + 'ssh.tunnel.route' = 'bastionuser@bastion-host.domain -> user@cluster-node.domain' + 'ssh.tunnel.private.keys' = '/home/user/.ssh/bastionuser-key-pair.pem,/home/user/.ssh/user-key-pair.pem' + 'ssh.tunnel.known.hosts' = '/home/user/.ssh/known_hosts' +); ``` + ##### Non-Thrift catalog integrations Using different `HiveMetastoreClientFactory` we can import database and table entities for other catalog implementations, or HMS endpoints that use alternative protocols such as REST or GRPC. Consider these illustrative examples: @@ -134,21 +139,23 @@ Using different `HiveMetastoreClientFactory` we can import database and table en ###### AWS Glue ``` - CREATE REMOTE TABLE tbl_name - VIA 'com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory' - WITH TBLPROPERTIES ( - -- Glue endpoint configuration - ); +CREATE REMOTE TABLE tbl_name +VIA 'com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory' +WITH TBLPROPERTIES ( + -- Glue endpoint configuration +); ``` + ###### Netflix iceberg ``` - CREATE REMOTE TABLE tbl_name - VIA 'xxx.yyy.iceberg.hive.zzz.IcebergTableHiveClientFactory' - WITH TBLPROPERTIES ( - 'iceberg.table.path' = 'an-atomic-store:/tables/tbl_name' - ) +CREATE REMOTE TABLE tbl_name +VIA 'xxx.yyy.iceberg.hive.zzz.IcebergTableHiveClientFactory' +WITH TBLPROPERTIES ( + 'iceberg.table.path' = 'an-atomic-store:/tables/tbl_name' +) ``` + ##### Behaviour of `DESCRIBE` and `SHOW` operations On executing `DESCRIBE` operations on remote tables and databases, we envisage that the user be returned the description from the remote catalog to which the remote configuration is appended. @@ -181,7 +188,3 @@ Waggle Dance is primarily used for read only access of tables in remote Hive clu Remote tables should work in the context of read only access. To read ACID, one needs only the `ValidTxnList` from the remote metastore and access to the set of base and delta files. Writing of remote ACID tables does not seem practical as there is no global transaction manager in this architecture. Note that at this time ACID does not function reliably on S3, although this capability has been promised. - - - - diff --git a/content/Development/desingdocs/hive-tez-compatibility.md b/content/Development/desingdocs/hive-tez-compatibility.md index 31cae566..6db040d8 100644 --- a/content/Development/desingdocs/hive-tez-compatibility.md +++ b/content/Development/desingdocs/hive-tez-compatibility.md @@ -7,18 +7,14 @@ date: 2024-12-12 This is derived from the pom files of the respective releases. Other releases with compatibility are listed in parenthesis. -| Hive | (Works with) Tez | -| --- | --- | -| 0.13 | 0.4.0-incubating | +| Hive | (Works with) Tez | +|------|-------------------------| +| 0.13 | 0.4.0-incubating | | 0.14 | 0.5.2+, (through 0.7.0) | -| 1.0 | 0.5.2, (through 0.7.0) | -| 1.1 | 0.5.2, (through 0.7.0) | -| 1.2* | 0.5.3, (through 0.7.0) | -| 2.0 | 0.8.2 | +| 1.0 | 0.5.2, (through 0.7.0) | +| 1.1 | 0.5.2, (through 0.7.0) | +| 1.2* | 0.5.3, (through 0.7.0) | +| 2.0 | 0.8.2 | *Hive-1.2 is the latest release of Hive as of 07/2015. - - - - diff --git a/content/Development/desingdocs/hivereplicationdevelopment.md b/content/Development/desingdocs/hivereplicationdevelopment.md index 2bc5e1fc..17a898bc 100644 --- a/content/Development/desingdocs/hivereplicationdevelopment.md +++ b/content/Development/desingdocs/hivereplicationdevelopment.md @@ -120,18 +120,18 @@ As mentioned above, each event is tagged with an event sequence ID. In addition Each event is handled differently depending on its event type, which is a combination of the object (database, table, partition) and the operation (create, add, alter, insert, drop). Each event may include a source command, a copy, and a destination command. The following chart describes the ten event types and how they are handled with descriptions below. -| Event | Source Command | Needs Copy? | Destination Command | -| --- | --- | --- | --- | -| CreateDatabase | No-op | No | No-op | -| DropDatabase | No-op | No | `DROP DATABASE CASCADE` | -| AlterDatabase | *(not implemented)* | *(not implemented)* | *(not implemented)* | -| CreateTable | `EXPORT … FOR REPLICATION` | Yes | `IMPORT` | -| DropTable | No-op | No | `DROP TABLE … FOR REPLICATION(‘id’)` | -| AlterTable | `EXPORT … FOR METADATA REPLICATION` | Yes (metadata only) | ``IMPORT`` | -| AddPartition | (multi) `EXPORT … FOR REPLICATION` | Yes | (multi) ``IMPORT`` | -| DropPartition | No-op | No | (multi) `ALTER TABLE … DROP PARTITION(…) FOR REPLICATION(‘id’)` | -| AlterPartition | `EXPORT … FOR METADATA REPLICATION` | Yes (metadata only) | ``IMPORT`` | -| Insert | `EXPORT … FOR REPLICATION` | Yes (dumb copy) | ``IMPORT`` | +| Event | Source Command | Needs Copy? | Destination Command | +|----------------|-------------------------------------|---------------------|-----------------------------------------------------------------| +| CreateDatabase | No-op | No | No-op | +| DropDatabase | No-op | No | `DROP DATABASE CASCADE` | +| AlterDatabase | *(not implemented)* | *(not implemented)* | *(not implemented)* | +| CreateTable | `EXPORT … FOR REPLICATION` | Yes | `IMPORT` | +| DropTable | No-op | No | `DROP TABLE … FOR REPLICATION(‘id’)` | +| AlterTable | `EXPORT … FOR METADATA REPLICATION` | Yes (metadata only) | ``IMPORT`` | +| AddPartition | (multi) `EXPORT … FOR REPLICATION` | Yes | (multi) ``IMPORT`` | +| DropPartition | No-op | No | (multi) `ALTER TABLE … DROP PARTITION(…) FOR REPLICATION(‘id’)` | +| AlterPartition | `EXPORT … FOR METADATA REPLICATION` | Yes (metadata only) | ``IMPORT`` | +| Insert | `EXPORT … FOR REPLICATION` | Yes (dumb copy) | ``IMPORT`` | ##### CreateDatabase diff --git a/content/Development/desingdocs/hivereplicationv2development.md b/content/Development/desingdocs/hivereplicationv2development.md index 936198a9..6e882c46 100644 --- a/content/Development/desingdocs/hivereplicationv2development.md +++ b/content/Development/desingdocs/hivereplicationv2development.md @@ -77,7 +77,6 @@ One of the primary ways this consideration affects us is that we drifted towards # Rubberbanding - Consider the following series of operations: ``` @@ -86,7 +85,7 @@ INSERT INTO TABLE blah [PARTITION (p="a") VALUES 5; INSERT INTO TABLE blah [PARTITION (p="b") VALUES 10; INSERT INTO TABLE blah [PARTITION (p="a") VALUES 15; ``` - + Now, for each operation that occurs, a monotonically increasing state-id is provided by DbNotificationListener, so that we have an ability to order those events by when they occurred. For the sake of simplicity, let's say they occurred at states 10,20,30,40 respectively, in order. Now, if there were another thread running "SELECT * from blah;" from another thread, then depending on when the SELECT command ran, it would have differing results: @@ -105,8 +104,8 @@ Now, let us look at the same select * behaviour we observed on the source as it 1. If the select * runs before PROC(10), then we get an error, since the table has not yet been created. 2. If the select * runs between PROC(10) & PROC(20), then it will result in the partition p="a") being impressed over. - 1. If PROC(20) occurs before 40 has occurred, then it will return { (a,5) } - 2. If PROC(20) occurs after 40 has occurred, then it will return { (a,5) , (a,15) } - This is because the partition state captured by PROC(20) will occur after 40, and thus contain (a,15), but partition p="b" has not yet been re-impressed because we haven't yet re-impressed that partition, which will occur only at PROC(30). + 1. If PROC(20) occurs before 40 has occurred, then it will return { (a,5) } + 2. If PROC(20) occurs after 40 has occurred, then it will return { (a,5) , (a,15) } - This is because the partition state captured by PROC(20) will occur after 40, and thus contain (a,15), but partition p="b" has not yet been re-impressed because we haven't yet re-impressed that partition, which will occur only at PROC(30). We stop our examination at this point, because we see one possible outcome from the select * on the destination which was impossible at the source. This is the problem introduced by state-transfer that we term rubber-banding - nomenclature coming in from online games which deal with each individual player having different latencies, and the server having to reconcile updates in a staggered/stuttering fashion. @@ -124,8 +123,6 @@ Let us now consider a base part of a replication workflow. It would need to have 4. The requisite data is copied over from the source wh to the destination warehouse 5. The destination then performs whatever task is needed to restate - - Now, so far, our primary problem seems to be that we can only capture "latest" state, and not the original state at the time the event occurred. That is to say that at the time we process the notification, we get the state of the object at that time, t3, instead of the state of the object at time t1. In the time between t1 and t3, the object may have changed substantially, and if we go ahead and take the state at t3, and then apply to destination in an idempotent fashion, always taking only updates, we get our current implementation, with the rubberbanding problem. Fundamentally, this is the core of our problem. To not have rubberbanding, one of the following must be true of that time period between t1 & t3: @@ -137,7 +134,6 @@ Route (1) is how we should approach ACID tables, and should be the way hopefully In the meanwhile, however, we must try to solve (2) as well. To this end, our goal with replv2 is to make sure that if there is any hive access that makes any change to an object, we capture the original state. There are two aspects to the original state - the metadata and the data. The metadata is easily solvable, since t1 & t2 can be done in the context of a single hive operation, and we can impress the metadata for the notification and our change to the metadata in the same metastore transaction. This now leaves us the question of what happens with the backing filesystem data for the object. - Now, in addition to this problem that we're solving of tracking what the filesystem state was at the time we did our dump, we have one more problem we want to solve, and that is that of the 4x copy problem. We've already solved the problem with the extra copy on the destination. Now, we need to somehow prevent the extra copy on the source to make this work. Essentially, what we need, to prevent making an extra copy of the entire data on the source, we need to have a "stable" way of determining what the FS backing state for the object was at the time the event occurred. Both of these problems, that of the 4x copy problem, and that of making sure that we know what FS state existed at t1 to prevent rubberbanding, are then solvable if we have a snapshot of the source filesystem at the time the event occurred. At first, this, to us, led us to looking at HDFS snapshots as the way to solve this problem. Unfortunately, HDFS snapshots, while they would solve our problem, are, per discussion with HDFS folks, not something we can create a large number of, and we might very well likely need a snapshot for every single event that comes along. @@ -168,6 +164,7 @@ Event 100: ALTER TABLE tbl ADD PARTITION (p=1) SET LOCATION ; Event 110: ALTER TABLE tbl DROP PARTITION (p=1); Event 120: ALTER TABLE tbl ADD PARTITION (p=1) SET LOCATION ; ``` + When loading the dump on the destination side (at a much later point), when the event 100 is replayed, the load task on the destination will try to pull the files from the (the _files contains the path of ), which may contain new or different data. To replicate the exact state of the source at the time event 100 occurred at the source, we do the following: 1. When Event 100 occurs at the source, in the notification event, we store the checksum of the file(s) in the newly added partition along with the file path(s). @@ -211,7 +208,6 @@ The current implementation of replication is built upon existing commands EXPORT This is better described via various examples of each of the pieces of the command syntax, as follows: - (a) REPL DUMP sales;       REPL DUMP sales.['.*?']Replicates out sales database for bootstrap, from =0 (bootstrap case) to = with a batch size of 0, i.e. no batching. (b) REPL DUMP sales.['T3', '[a-z]+']; @@ -242,7 +238,7 @@ Similar to cases (d) & (e), with the addition of a batch size of =100       REPL DUMP sales.['[a-z]+', 'Q5'] REPLACE sales.['[a-z]+'] FROM 500; -This is an example of changing the replication policy/scope dynamically during incremental replication cycle. +This is an example of changing the replication policy/scope dynamically during incremental replication cycle. In first case, a full DB replication policy "sales" is changed to a replication policy that includes only table/view names with only alphabets "sales.['[a-z]+']" such as "stores", "products" etc. The REPL LOAD using this dump would intelligently drops the tables which are excluded as per the new policy. For instance, table with name 'T5' would be automatically dropped during REPL LOAD if it is already there in target cluster. @@ -252,14 +248,12 @@ In second case, policy is again changed to include table/view 'Q5' and in this c The REPL DUMP command has an optional WITH clause to set command-specific configurations to be used when trying to dump. These configurations are only used by the corresponding REPL DUMP command and won't be used for other queries running in the same session. In this example, we set the configurations to exclude external tables and also include only metadata and don't dump data.  - - #### Return values: 1. Error codes returned as return error codes (and over jdbc if with HS2) 2. Returns 2 columns in the ResultSet: - 1. - the directory to which it has dumped info. - 2. - the last event-id associated with this dump, which might be the end-evid, or the curr-evid, as the case may be. + 1. - the directory to which it has dumped info. + 2. - the last event-id associated with this dump, which might be the end-evid, or the curr-evid, as the case may be. #### Note: @@ -276,10 +270,8 @@ When bootstrap dump is in progress, it blocks rename table/partition operations Look up the HiveServer logs for below pair of log messages. > REPL DUMP:: Set property for Database: , Property: , Value: ACTIVE -> +> > REPL DUMP:: Reset property for Database: , Property: -> -> If Reset property log is not found for the corresponding Set property log, then user need to manually reset the database property with value as "IDLE" using ALTER DATABASE command. @@ -287,7 +279,6 @@ If Reset property log is not found for the corresponding Set property log, then `REPL LOAD {} FROM {WITH ('key1'='value1', 'key2'='value2')};` - This causes a REPL DUMP present in (which is to be a fully qualified HDFS URL) to be pulled and loaded. If is specified, and the original dump was a database-level dump, this allows Hive to do db-rename-mapping on import. If dbname is not specified, the original dbname as recorded in the dump would be used.The REPL LOAD command has an optional WITH clause to set command-specific configurations to be used when trying to copy from the source cluster. These configurations are only used by the corresponding REPL LOAD command and won't be used for other queries running in the same session. #### Return values: @@ -299,12 +290,13 @@ This causes a REPL DUMP present in (which is to be a fully qualified H `REPL STATUS ;` - -Will return the same output that REPL LOAD returns, allows REPL LOAD to be run asynchronously. If no knowledge of a replication associated with that db is present, i.e., there are no known replications for that, we return an empty set. Note that for cases where a destination db or table exists, but no known repl exists for it, this should be considered an error condition for tools calling REPL LOAD to pass on to the end-user, to alert them that they may be overwriting an existing db with another. +Will return the same output that REPL LOAD returns, allows REPL LOAD to be run asynchronously. If no knowledge of a replication associated with that db is present, i.e., there are no known replications for that, we return an empty set. Note that for cases where a destination db or table exists, but no known repl exists for it, this should be considered an error condition for tools calling REPL LOAD to pass on to the end-user, to alert them that they may be overwriting an existing db with another. + #### Return values: 1. Error codes returned as normal. 2. Returns the last replication state (event ID) for the given database. + # Bootstrap, Revisited When we introduced the notion of a need for bootstrap, we said that the problem of time passing during the bootstrap was something of a problem that needed solving separately. @@ -313,16 +305,12 @@ Let us say that we begin the dump at evid=170, and by the time we finish the dum Let us consider the case of a table T1, which was dumped out around evid=200. Now, let us say that the following operations have occurred on the two tables during the time the dump has been proceeding: - - -| event id | operation | -| --- | --- | -| 184 | ALTER TABLE T1 DROP PARTITION(Px) | -| 196 | ALTER TABLE T1 ADD PARTITION(Pa) | -| 204 | ALTER TABLE T1 ADD PARTITION(Pb) | -| 216 | ALTER TABLE T1 DROP PARTITION(Py) | - - +| event id | operation | +|----------|-----------------------------------| +| 184 | ALTER TABLE T1 DROP PARTITION(Px) | +| 196 | ALTER TABLE T1 ADD PARTITION(Pa) | +| 204 | ALTER TABLE T1 ADD PARTITION(Pb) | +| 216 | ALTER TABLE T1 DROP PARTITION(Py) | Basically, let us try to understand what happens when partitions are added(Pa & Pb) and dropped(Px & Py) both before and after a table is dumped. So, for our bootstrap, we go through 2 phases - first an object dump of all the objects we're expected to dump, and then a consolidation phase where we go through all the events that occurred during our object dump. @@ -336,8 +324,6 @@ Now, one approach to handle this would be to simply say that we say that the dum While this can work, the problem with this approach is that the destination can now have tables at differing states as a result of the dump - i.e. a table T2 that was dumped at about evid=220 will have newer info than T1 that was dumped about evid=200, and this is a sort of mini-rubberbanding in itself, since different parts of a whole are at different states. This problem is actually a little worse, since different partitions of a table can actually be at different states. Thus, we will not follow this approach. - - Approach 2 : Consolidate at source. The alternate approach, then, is to go through each of the events from evid=170 to evid=230 in our example, which are the current-event-ids at the beginning of the object dump phase and the end of the object dump phase respectively, and to use that to modify the object dumps that we've just made. Any drops will result in the dumped object being changed/deleted, and any creates will result in additional dumped objects being added. Alters will result in dumped objects being replaced by their newer equivalent. At the end of this consolidation, all objects dumped should be capable of being restored on the destination as if the state for them was 230, and incremental replication can then take over, processing event 230 onwards. @@ -357,9 +343,9 @@ The related hive config parameter is "hive.metastore.event.db.notification.api.a The auth mechanism works as below: 1. Skip auth in embedded metasore mode regardless of "hive.metastore.event.db.notification.api.auth" setting -The reason is that we know the metastore calls are made from hive as opposed to other un-authorized processes that are running metastore client. + The reason is that we know the metastore calls are made from hive as opposed to other un-authorized processes that are running metastore client. 2. Enable auth in remote metastore mode if "hive.metastore.event.db.notification.api.auth" set to true -The UGI of the remote metastore client is always set on metastore server. We retrieve this user info and check if this user has proxy privilege according to the [proxy user](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Superusers.html) settings. For example, the UGI is user "hive" and "hive" been configured to have the proxy privilege against a list of hosts. Then the auth will pass for the notification related calls from those hosts. If a user "foo" is performing repl operations (e.g. through HS2 with doAs=true), then the auth will fail unless user "foo" is configured to have the proxy privilege. + The UGI of the remote metastore client is always set on metastore server. We retrieve this user info and check if this user has proxy privilege according to the [proxy user](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Superusers.html) settings. For example, the UGI is user "hive" and "hive" been configured to have the proxy privilege against a list of hosts. Then the auth will pass for the notification related calls from those hosts. If a user "foo" is performing repl operations (e.g. through HS2 with doAs=true), then the auth will fail unless user "foo" is configured to have the proxy privilege. # Setup/Configuration @@ -373,7 +359,6 @@ hive.metastore.dml.events = true [hive.repl.cm](http://hive.repl.cm/).enabled = true - There are additional replication related parameters (with their default values). These are relevant only to cluster that acts as the source cluster. The defaults should work for these in most cases -  REPLDIR("hive.repl.rootdir","/user/hive/repl/", "HDFS root dir for all replication dumps."), @@ -381,11 +366,3 @@ REPLCMDIR("hive.repl.cmrootdir","/user/hive/cmroot/", "Root dir for ChangeManag REPLCMRETIAN("[hive.repl.cm](http://hive.repl.cm).retain","24h", new TimeValidator(TimeUnit.HOURS),"Time to retain removed files in cmrootdir."), REPLCMINTERVAL("[hive.repl.cm](http://hive.repl.cm).interval","3600s",new TimeValidator(TimeUnit.SECONDS),"Inteval for cmroot cleanup thread."), - - - - - - - - diff --git a/content/Development/desingdocs/hiveserver2-thrift-api.md b/content/Development/desingdocs/hiveserver2-thrift-api.md index 282fd4f5..58759be5 100644 --- a/content/Development/desingdocs/hiveserver2-thrift-api.md +++ b/content/Development/desingdocs/hiveserver2-thrift-api.md @@ -770,7 +770,3 @@ service TSQLService { ``` - - - - diff --git a/content/Development/desingdocs/howl.md b/content/Development/desingdocs/howl.md index af96560f..36f2723b 100644 --- a/content/Development/desingdocs/howl.md +++ b/content/Development/desingdocs/howl.md @@ -13,7 +13,3 @@ This page collects some pointers to resources about Howl (an effort to create a * [Howl CLI functional spec](http://wiki.apache.org/pig/Howl/HowlCliFuncSpec) * [Original plans for Owl (predecessor to Howl)](http://wiki.apache.org/pig/owl) - - - - diff --git a/content/Development/desingdocs/hybrid-grace-hash-join-v1-0.md b/content/Development/desingdocs/hybrid-grace-hash-join-v1-0.md index c5ec281a..728a7a53 100644 --- a/content/Development/desingdocs/hybrid-grace-hash-join-v1-0.md +++ b/content/Development/desingdocs/hybrid-grace-hash-join-v1-0.md @@ -83,39 +83,28 @@ When the entire small table can fit in memory, everything is similar to Classic ![](https://lh6.googleusercontent.com/iVa2tD1kTSoWQKxWxwPhPxJHnvnlr1uJa4HwrkHqNPB4osG3xzVU-sjmCUKfI0c9f3I3rpNqx9M_qmjelxDfWYUp7165k7k2O44C1Y2A8ajc9eK0SonMgD-y5Z6ia0Ydfg) - ![](https://lh3.googleusercontent.com/VnwA074O47g4crdCqIxW3CFyxVpWgfKD6-2iKAbdQGn2u_IR1LWW-nT5t261ljhY8Y_zOmm2Mj3GGJY6yhxkUbNcwSh5-r4Pwo3mUoj8Yqo3C9a4ZaFYwRQgGisKuWZ42A) - - When the small table cannot fit in memory, things are a little more complicated. Small table keys keep getting hashed into different partitions until at some point the memory limit is reached. Then the biggest partition in memory will be moved to disk so that memory is freed to some extent. The new key that should have been put into that partition will be put in the corresponding sidefile. ![](https://lh4.googleusercontent.com/DahwmR-yyQSks0veoua0FOsmp7XoV_8fS1KIb3UGo8cdjcPFJ4ani_40TzYmIQHzXmaERJduVwkFqn8bZd7bCTz-bcxy66Rhn4dQ5OUrcQ09s7AIQax7UfyGlTzQOV_eDA) - - During probing, if matches are found from in-memory partitions, they will be directly put into the result. ![](https://lh5.googleusercontent.com/zGTjywxVuEClExTogYw2oe16Wn3CVjc1l64CTGUUKUlHB0SN74LVJe3h57udeghBzswaQtcztekGgH4st8-DIzLRKMAaoMmZv8KmqkhxK-DqzYBlABAwRrM2yolCS-Ad_w) - - If a possible match is found, i.e., the key hash of big table falls into an on-disk partition, it will be put into an on-disk matchfile. No matching is done at this moment, but will be done later. ![](https://lh5.googleusercontent.com/hPDypZAfTtCBI1Ocdvp6kkS40MLnCH_yadfef7kgk5firhV5EJ9g1DLr4CWrdgxc1nsGvc1HnY4US7TwOiAnT1oEQ1snrXH2Tp0sneJaxXWITp2JIB-rntB9A0aMV5C5FQ) - - After the big table has been exhausted, all the in-memory partitions are no longer needed and purged. Purging is appropriate because those keys that should match have been output to result, and those that don’t match won’t be relevant to the on-disk structures. -Now there may be many on-disk triplets (partition, sidefile and matchfile). For each triplet, merge the partition and sidefile and put them back in memory to form a new hash table. Then scan the matchfile against the newly created hash table looking for matches. The logic here is similar to Classic Hash Join. +Now there may be many on-disk triplets (partition, sidefile and matchfile). For each triplet, merge the partition and sidefile and put them back in memory to form a new hash table. Then scan the matchfile against the newly created hash table looking for matches. The logic here is similar to Classic Hash Join. ![](https://lh4.googleusercontent.com/Gpbx1cZh37iFIvqMUBWjgce2GxAPvrJ-XldCMs6NgvqatlGa4x4try2eED-VLgdYiAW7tjZElEhmyzcWzFLo_k1t5j2kT-CGM0eRPHm_cV0gcA3oPUEyf-bu6fABiE3xeQ) - - Repeat until all on-disk triplets have been exhausted. At this moment, the join between small table and big table is completed. # Recursive Hashing and Spilling @@ -124,38 +113,24 @@ There are cases when the hash function is not working well for distributing valu Assume there’s only 1GB memory space available for the join. Hash function h1 distributes keys into two partitions HT1 and HT2. At some point in time, the memory is full and HT1 is moved to disk. Assume all the future keys all hash to HT1, so they go to Sidefile1. - - [![](https://lh3.googleusercontent.com/ppKL4CGZnDUQWJRT7CGepJ8I5_WDCe2FlQhSEoZ135-YCvbZkeRtmj3Yl8zcoQytCD9ZVQ3bkujvj30V9gOuYDqSRiOKHHxXa1k5fqg2RGWBIEuMB3Y3rqveW5w2gb8Kog)](https://drive.draw.io/#G0B72RYNz9voGccmlLNEhjcW4wUFU) - - As normal, during the probe phase big table values will be either matched and put into result or not matched and put into a Matchfile. - - [![](https://lh6.googleusercontent.com/a-QU_o2xvnLaVn7hKuhHX2VnmK0EYmi_0BaY-1H32wEAho0pJ1wmSaGOoO-VmNpwvj-MlVNdNA30JUEKGkuWydAKho6igYEo2D72W9aTRArWBvFRja91prI7I664CiSLvw)](https://drive.draw.io/#G0B72RYNz9voGcVnVzeWwzR0hyUDA) - - After the big table is exhausted, all the values from big table should be either output (matched) to the result, or staged into matchfiles. Then it is time to merge back pairs of on-disk hash partition and sidefile, and build a new in-memory hash table, and probe the corresponding matchfile against it. In this example, since the predicted size of the merge is greater than the memory limit (800MB + 300MB > 1GB), we cannot simply merge them back to memory. Instead, we need to rehash them by using a different hash function h2. - - [![](https://lh6.googleusercontent.com/9-F7k7Auay7iqSl6gdwA8ImAwKMIyaixtcrHHJB65tmld3Yw_DASBXdpTlNfx21X0rAjnL7KvA_KVWEshvRWRTlCldMeztPMLINjJVO4gvMSLv2WxTGY450LwQYtCOlM5w)](https://drive.draw.io/#G0B72RYNz9voGcVnVzeWwzR0hyUDA)   Now we probe using Matchfile 1 against HT 3 (in memory) and HT 4 (on disk). Matching values for HT 3 go into result. Possibly matching values for HT 4 go to Matchfile 4. - - ![](https://lh4.googleusercontent.com/BMAqCXS1nBeb-v_XdLfVswpMP4S7cRD0x976n7sGcmrF3zsPfD7OTd-SvmQYkTp7M6kbLxIp_LEdOQMZTakG3oRkSs0SCYEYF-izz10dLn3wOUz20eAOMgBXJWgD4THamg) - - This process can continue recursively if the size of HT 4 plus size of Sidefile 4 is still greater than the memory limit, i.e., hashing HT 4 and Sidefile 4 using a third hash function h3, and probing Matchfile 4 using h3. In this example, the size of HT 4 plus size of Sidefile 4 is smaller than memory limit, so we are done. # Skewed Data Distribution @@ -166,7 +141,7 @@ Several approaches can be considered to handle this problem. If we have reliable # Bloom Filter -As of [Hive 2.0.0](https://issues.apache.org/jira/browse/HIVE-11306), a cheap Bloom filter is built during the build phase of the Hybrid hashtable, which is consulted against before spilling a row into the matchfile. The goal is to minimize the number of records which end up being spilled to disk, which may not have any matches in the spilled hashtables. The optimization also benefits left outer joins since the row which entered the hybrid join can be immediately generated as output with appropriate nulls indicating a lack of match, while without the filter it would have to be serialized onto disk only to be reloaded without a match at the end of the probe. +As of [Hive 2.0.0](https://issues.apache.org/jira/browse/HIVE-11306), a cheap Bloom filter is built during the build phase of the Hybrid hashtable, which is consulted against before spilling a row into the matchfile. The goal is to minimize the number of records which end up being spilled to disk, which may not have any matches in the spilled hashtables. The optimization also benefits left outer joins since the row which entered the hybrid join can be immediately generated as output with appropriate nulls indicating a lack of match, while without the filter it would have to be serialized onto disk only to be reloaded without a match at the end of the probe. # References @@ -180,7 +155,3 @@ As of [Hive 2.0.0](https://issues.apache.org/jira/browse/HIVE-11306), a cheap Bl * Dewitt, David J.  Implementation techniques for main memory database systems * Jimmy Lin and Chris Dyer  Data-Intensive Text Processing with MapReduce - - - - diff --git a/content/Development/desingdocs/indexdev-bitmap.md b/content/Development/desingdocs/indexdev-bitmap.md index 5ae88a28..349f7a33 100644 --- a/content/Development/desingdocs/indexdev-bitmap.md +++ b/content/Development/desingdocs/indexdev-bitmap.md @@ -7,15 +7,15 @@ date: 2024-12-12 ## Introduction -This document explains the proposed design for adding a bitmap index handler (). +This document explains the proposed design for adding a bitmap index handler (). -Bitmap indexing () is a standard technique for indexing columns with few distinct +Bitmap indexing () is a standard technique for indexing columns with few distinct values, such as gender. ## Approach -We want to develop a bitmap index that can reuse as much of the existing Compact Index code as possible. +We want to develop a bitmap index that can reuse as much of the existing Compact Index code as possible. ## Proposal @@ -45,7 +45,3 @@ For the second iteration, the first entry will be: This one uses 1-byte array entries, so each value in the array stores 8 rows. If an entry is 0x00 or 0xFF, it represents 1 or more consecutive bytes of zeros, (in this case 5 and 4, respectively) - - - - diff --git a/content/Development/desingdocs/indexdev.md b/content/Development/desingdocs/indexdev.md index a24afa97..e33d64d5 100644 --- a/content/Development/desingdocs/indexdev.md +++ b/content/Development/desingdocs/indexdev.md @@ -314,34 +314,29 @@ TBD: validation on index table format (can be any managed table format?) ## Current Status (JIRA) -| Type | Key | Summary | Assignee | Reporter | Priority | Status | Resolution | Created | Updated | Due | -| --- | --- | ---| --- | --- | --- | --- | --- | --- | --- | --- | -| [New Feature](https://issues.apache.org/jira/browse/HIVE-21792?src=confmacro) | [HIVE-21792](https://issues.apache.org/jira/browse/HIVE-21792?src=confmacro) | [Hive Indexes... Again](https://issues.apache.org/jira/browse/HIVE-21792?src=confmacro) | Unassigned | David Mollitor | Major | Open | Unresolved | May 24, 2019 | Feb 27, 2024 | | -| [Improvement](https://issues.apache.org/jira/browse/HIVE-18448?src=confmacro) | [HIVE-18448](https://issues.apache.org/jira/browse/HIVE-18448?src=confmacro) | [Drop Support For Indexes From Apache Hive](https://issues.apache.org/jira/browse/HIVE-18448?src=confmacro) | Zoltan Haindrich | David Mollitor | Minor | Closed | Fixed | Jan 12, 2018 | May 28, 2022 | -| [Bug](https://issues.apache.org/jira/browse/HIVE-18035?src=confmacro) | [HIVE-18035](https://issues.apache.org/jira/browse/HIVE-18035?src=confmacro) | [NullPointerException on querying a table with a compact index](https://issues.apache.org/jira/browse/HIVE-18035?src=confmacro) | Unassigned | Brecht Machiels | Major | Open | Unresolved | Nov 09, 2017 | Nov 13, 2017 | | -| [Bug](https://issues.apache.org/jira/browse/HIVE-15282?src=confmacro) | [HIVE-15282](https://issues.apache.org/jira/browse/HIVE-15282?src=confmacro) | [Different modification times are used when an index is built and when its staleness is checked](https://issues.apache.org/jira/browse/HIVE-15282?src=confmacro) | Marta Kuczora | Marta Kuczora | Major | Resolved | Fixed | Nov 24, 2016 | Jul 21, 2017 | | -| [Bug](https://issues.apache.org/jira/browse/HIVE-13844?src=confmacro) | [HIVE-13844](https://issues.apache.org/jira/browse/HIVE-13844?src=confmacro) | [Invalid index handler in org.apache.hadoop.hive.ql.index.HiveIndex class](https://issues.apache.org/jira/browse/HIVE-13844?src=confmacro) | Svetozar Ivanov | Svetozar Ivanov | Minor | Closed | Fixed | May 25, 2016 | Feb 27, 2024 | | -| [Bug](https://issues.apache.org/jira/browse/HIVE-13377?src=confmacro) | [HIVE-13377](https://issues.apache.org/jira/browse/HIVE-13377?src=confmacro) | [Lost rows when using compact index on parquet table](https://issues.apache.org/jira/browse/HIVE-13377?src=confmacro) | Unassigned | Gabriel C Balan | Minor | Open | Unresolved | Mar 29, 2016 | Mar 29, 2016 | | -| [Bug](https://issues.apache.org/jira/browse/HIVE-12877?src=confmacro) | [HIVE-12877](https://issues.apache.org/jira/browse/HIVE-12877?src=confmacro) | [Hive use index for queries will lose some data if the Query file is compressed.](https://issues.apache.org/jira/browse/HIVE-12877?src=confmacro) | Unassigned | yangfang | Major | Patch Available | Unresolved | Jan 15, 2016 | Feb 01, 2016 | Jan 15, 2016 | -| [Bug](https://issues.apache.org/jira/browse/HIVE-11227?src=confmacro) | [HIVE-11227](https://issues.apache.org/jira/browse/HIVE-11227?src=confmacro) | [Kryo exception during table creation in Hive](https://issues.apache.org/jira/browse/HIVE-11227?src=confmacro) | Unassigned | Akamai | Major | Open | Unresolved | Jul 10, 2015 | Oct 21, 2022 | Jul 12, 2015 | -| [Bug](https://issues.apache.org/jira/browse/HIVE-11154?src=confmacro) | [HIVE-11154](https://issues.apache.org/jira/browse/HIVE-11154?src=confmacro) | [Indexing not activated with left outer join and where clause](https://issues.apache.org/jira/browse/HIVE-11154?src=confmacro) | Bennie Can | Bennie Can | Major | Open | Unresolved | Jun 30, 2015 | Jun 30, 2015 | Jul 11, 2015 | -| [Bug](https://issues.apache.org/jira/browse/HIVE-10021?src=confmacro) | [HIVE-10021](https://issues.apache.org/jira/browse/HIVE-10021?src=confmacro) | ["Alter index rebuild" statements submitted through HiveServer2 fail when Sentry is enabled](https://issues.apache.org/jira/browse/HIVE-10021?src=confmacro) | Aihua Xu | Richard Williams | Major | Closed | Fixed | Mar 19, 2015 | Feb 16, 2016 | | -| [Bug](https://issues.apache.org/jira/browse/HIVE-9656?src=confmacro) | [HIVE-9656](https://issues.apache.org/jira/browse/HIVE-9656?src=confmacro) | [Create Index Failed without WITH DEFERRED REBUILD](https://issues.apache.org/jira/browse/HIVE-9656?src=confmacro) | Chaoyu Tang | Will Du | Major | Open | Unresolved | Feb 11, 2015 | Sep 28, 2015 | | -| [Bug](https://issues.apache.org/jira/browse/HIVE-9639?src=confmacro) | [HIVE-9639](https://issues.apache.org/jira/browse/HIVE-9639?src=confmacro) | [Create Index failed in Multiple version of Hive running](https://issues.apache.org/jira/browse/HIVE-9639?src=confmacro) | Unassigned | Will Du | Major | Open | Unresolved | Feb 10, 2015 | Mar 14, 2015 | | -| [Bug](https://issues.apache.org/jira/browse/HIVE-8475?src=confmacro) | [HIVE-8475](https://issues.apache.org/jira/browse/HIVE-8475?src=confmacro) | [add test case for use of index from not-current database](https://issues.apache.org/jira/browse/HIVE-8475?src=confmacro) | Thejas Nair | Thejas Nair | Major | Closed | Fixed | Oct 15, 2014 | Nov 13, 2014 | | -| [Bug](https://issues.apache.org/jira/browse/HIVE-7692?src=confmacro) | [HIVE-7692](https://issues.apache.org/jira/browse/HIVE-7692?src=confmacro) | [when table is dropped associated indexes also should be dropped](https://issues.apache.org/jira/browse/HIVE-7692?src=confmacro) | Thejas Nair | Thejas Nair | Major | Resolved | Not A Problem | Aug 12, 2014 | Nov 04, 2014 | | -| [Bug](https://issues.apache.org/jira/browse/HIVE-7239?src=confmacro) | [HIVE-7239](https://issues.apache.org/jira/browse/HIVE-7239?src=confmacro) | [Fix bug in HiveIndexedInputFormat implementation that causes incorrect query result when input backed by Sequence/RC files](https://issues.apache.org/jira/browse/HIVE-7239?src=confmacro) | Illya Yalovyy | Sumit Kumar | Major | Closed | Fixed | Jun 16, 2014 | Jul 26, 2017 | | -| [Bug](https://issues.apache.org/jira/browse/HIVE-6996?src=confmacro) | [HIVE-6996](https://issues.apache.org/jira/browse/HIVE-6996?src=confmacro) | [FS based stats broken with indexed tables](https://issues.apache.org/jira/browse/HIVE-6996?src=confmacro) | Ashutosh Chauhan | Ashutosh Chauhan | Major | Closed | Fixed | Apr 30, 2014 | Jun 09, 2014 | | -| [Bug](https://issues.apache.org/jira/browse/HIVE-6921?src=confmacro) | [HIVE-6921](https://issues.apache.org/jira/browse/HIVE-6921?src=confmacro) | [index creation fails with sql std auth turned on](https://issues.apache.org/jira/browse/HIVE-6921?src=confmacro) | Ashutosh Chauhan | Ashutosh Chauhan | Major | Closed | Fixed | Apr 16, 2014 | Jun 09, 2014 | | -| [Bug](https://issues.apache.org/jira/browse/HIVE-5902?src=confmacro) | [HIVE-5902](https://issues.apache.org/jira/browse/HIVE-5902?src=confmacro) | [Cannot create INDEX on TABLE in HIVE 0.12](https://issues.apache.org/jira/browse/HIVE-5902?src=confmacro) | Unassigned | Juraj Volentier | Major | Open | Unresolved | Nov 27, 2013 | Mar 14, 2015 | | -| [Bug](https://issues.apache.org/jira/browse/HIVE-5664?src=confmacro) | [HIVE-5664](https://issues.apache.org/jira/browse/HIVE-5664?src=confmacro) | [Drop cascade database fails when the db has any tables with indexes](https://issues.apache.org/jira/browse/HIVE-5664?src=confmacro) | Venki Korukanti | Venki Korukanti | Major | Closed | Fixed | Oct 28, 2013 | Feb 19, 2015 | | -| [Bug](https://issues.apache.org/jira/browse/HIVE-5631?src=confmacro) | [HIVE-5631](https://issues.apache.org/jira/browse/HIVE-5631?src=confmacro) | [Index creation on a skew table fails](https://issues.apache.org/jira/browse/HIVE-5631?src=confmacro) | Venki Korukanti | Venki Korukanti | Major | Closed | Fixed | Oct 23, 2013 | Feb 19, 2015 | | - +| Type | Key | Summary | Assignee | Reporter | Priority | Status | Resolution | Created | Updated | Due | +|-------------------------------------------------------------------------------|------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------|------------------|----------|-----------------|---------------|--------------|--------------|--------------| +| [New Feature](https://issues.apache.org/jira/browse/HIVE-21792?src=confmacro) | [HIVE-21792](https://issues.apache.org/jira/browse/HIVE-21792?src=confmacro) | [Hive Indexes... Again](https://issues.apache.org/jira/browse/HIVE-21792?src=confmacro) | Unassigned | David Mollitor | Major | Open | Unresolved | May 24, 2019 | Feb 27, 2024 | | +| [Improvement](https://issues.apache.org/jira/browse/HIVE-18448?src=confmacro) | [HIVE-18448](https://issues.apache.org/jira/browse/HIVE-18448?src=confmacro) | [Drop Support For Indexes From Apache Hive](https://issues.apache.org/jira/browse/HIVE-18448?src=confmacro) | Zoltan Haindrich | David Mollitor | Minor | Closed | Fixed | Jan 12, 2018 | May 28, 2022 | +| [Bug](https://issues.apache.org/jira/browse/HIVE-18035?src=confmacro) | [HIVE-18035](https://issues.apache.org/jira/browse/HIVE-18035?src=confmacro) | [NullPointerException on querying a table with a compact index](https://issues.apache.org/jira/browse/HIVE-18035?src=confmacro) | Unassigned | Brecht Machiels | Major | Open | Unresolved | Nov 09, 2017 | Nov 13, 2017 | | +| [Bug](https://issues.apache.org/jira/browse/HIVE-15282?src=confmacro) | [HIVE-15282](https://issues.apache.org/jira/browse/HIVE-15282?src=confmacro) | [Different modification times are used when an index is built and when its staleness is checked](https://issues.apache.org/jira/browse/HIVE-15282?src=confmacro) | Marta Kuczora | Marta Kuczora | Major | Resolved | Fixed | Nov 24, 2016 | Jul 21, 2017 | | +| [Bug](https://issues.apache.org/jira/browse/HIVE-13844?src=confmacro) | [HIVE-13844](https://issues.apache.org/jira/browse/HIVE-13844?src=confmacro) | [Invalid index handler in org.apache.hadoop.hive.ql.index.HiveIndex class](https://issues.apache.org/jira/browse/HIVE-13844?src=confmacro) | Svetozar Ivanov | Svetozar Ivanov | Minor | Closed | Fixed | May 25, 2016 | Feb 27, 2024 | | +| [Bug](https://issues.apache.org/jira/browse/HIVE-13377?src=confmacro) | [HIVE-13377](https://issues.apache.org/jira/browse/HIVE-13377?src=confmacro) | [Lost rows when using compact index on parquet table](https://issues.apache.org/jira/browse/HIVE-13377?src=confmacro) | Unassigned | Gabriel C Balan | Minor | Open | Unresolved | Mar 29, 2016 | Mar 29, 2016 | | +| [Bug](https://issues.apache.org/jira/browse/HIVE-12877?src=confmacro) | [HIVE-12877](https://issues.apache.org/jira/browse/HIVE-12877?src=confmacro) | [Hive use index for queries will lose some data if the Query file is compressed.](https://issues.apache.org/jira/browse/HIVE-12877?src=confmacro) | Unassigned | yangfang | Major | Patch Available | Unresolved | Jan 15, 2016 | Feb 01, 2016 | Jan 15, 2016 | +| [Bug](https://issues.apache.org/jira/browse/HIVE-11227?src=confmacro) | [HIVE-11227](https://issues.apache.org/jira/browse/HIVE-11227?src=confmacro) | [Kryo exception during table creation in Hive](https://issues.apache.org/jira/browse/HIVE-11227?src=confmacro) | Unassigned | Akamai | Major | Open | Unresolved | Jul 10, 2015 | Oct 21, 2022 | Jul 12, 2015 | +| [Bug](https://issues.apache.org/jira/browse/HIVE-11154?src=confmacro) | [HIVE-11154](https://issues.apache.org/jira/browse/HIVE-11154?src=confmacro) | [Indexing not activated with left outer join and where clause](https://issues.apache.org/jira/browse/HIVE-11154?src=confmacro) | Bennie Can | Bennie Can | Major | Open | Unresolved | Jun 30, 2015 | Jun 30, 2015 | Jul 11, 2015 | +| [Bug](https://issues.apache.org/jira/browse/HIVE-10021?src=confmacro) | [HIVE-10021](https://issues.apache.org/jira/browse/HIVE-10021?src=confmacro) | ["Alter index rebuild" statements submitted through HiveServer2 fail when Sentry is enabled](https://issues.apache.org/jira/browse/HIVE-10021?src=confmacro) | Aihua Xu | Richard Williams | Major | Closed | Fixed | Mar 19, 2015 | Feb 16, 2016 | | +| [Bug](https://issues.apache.org/jira/browse/HIVE-9656?src=confmacro) | [HIVE-9656](https://issues.apache.org/jira/browse/HIVE-9656?src=confmacro) | [Create Index Failed without WITH DEFERRED REBUILD](https://issues.apache.org/jira/browse/HIVE-9656?src=confmacro) | Chaoyu Tang | Will Du | Major | Open | Unresolved | Feb 11, 2015 | Sep 28, 2015 | | +| [Bug](https://issues.apache.org/jira/browse/HIVE-9639?src=confmacro) | [HIVE-9639](https://issues.apache.org/jira/browse/HIVE-9639?src=confmacro) | [Create Index failed in Multiple version of Hive running](https://issues.apache.org/jira/browse/HIVE-9639?src=confmacro) | Unassigned | Will Du | Major | Open | Unresolved | Feb 10, 2015 | Mar 14, 2015 | | +| [Bug](https://issues.apache.org/jira/browse/HIVE-8475?src=confmacro) | [HIVE-8475](https://issues.apache.org/jira/browse/HIVE-8475?src=confmacro) | [add test case for use of index from not-current database](https://issues.apache.org/jira/browse/HIVE-8475?src=confmacro) | Thejas Nair | Thejas Nair | Major | Closed | Fixed | Oct 15, 2014 | Nov 13, 2014 | | +| [Bug](https://issues.apache.org/jira/browse/HIVE-7692?src=confmacro) | [HIVE-7692](https://issues.apache.org/jira/browse/HIVE-7692?src=confmacro) | [when table is dropped associated indexes also should be dropped](https://issues.apache.org/jira/browse/HIVE-7692?src=confmacro) | Thejas Nair | Thejas Nair | Major | Resolved | Not A Problem | Aug 12, 2014 | Nov 04, 2014 | | +| [Bug](https://issues.apache.org/jira/browse/HIVE-7239?src=confmacro) | [HIVE-7239](https://issues.apache.org/jira/browse/HIVE-7239?src=confmacro) | [Fix bug in HiveIndexedInputFormat implementation that causes incorrect query result when input backed by Sequence/RC files](https://issues.apache.org/jira/browse/HIVE-7239?src=confmacro) | Illya Yalovyy | Sumit Kumar | Major | Closed | Fixed | Jun 16, 2014 | Jul 26, 2017 | | +| [Bug](https://issues.apache.org/jira/browse/HIVE-6996?src=confmacro) | [HIVE-6996](https://issues.apache.org/jira/browse/HIVE-6996?src=confmacro) | [FS based stats broken with indexed tables](https://issues.apache.org/jira/browse/HIVE-6996?src=confmacro) | Ashutosh Chauhan | Ashutosh Chauhan | Major | Closed | Fixed | Apr 30, 2014 | Jun 09, 2014 | | +| [Bug](https://issues.apache.org/jira/browse/HIVE-6921?src=confmacro) | [HIVE-6921](https://issues.apache.org/jira/browse/HIVE-6921?src=confmacro) | [index creation fails with sql std auth turned on](https://issues.apache.org/jira/browse/HIVE-6921?src=confmacro) | Ashutosh Chauhan | Ashutosh Chauhan | Major | Closed | Fixed | Apr 16, 2014 | Jun 09, 2014 | | +| [Bug](https://issues.apache.org/jira/browse/HIVE-5902?src=confmacro) | [HIVE-5902](https://issues.apache.org/jira/browse/HIVE-5902?src=confmacro) | [Cannot create INDEX on TABLE in HIVE 0.12](https://issues.apache.org/jira/browse/HIVE-5902?src=confmacro) | Unassigned | Juraj Volentier | Major | Open | Unresolved | Nov 27, 2013 | Mar 14, 2015 | | +| [Bug](https://issues.apache.org/jira/browse/HIVE-5664?src=confmacro) | [HIVE-5664](https://issues.apache.org/jira/browse/HIVE-5664?src=confmacro) | [Drop cascade database fails when the db has any tables with indexes](https://issues.apache.org/jira/browse/HIVE-5664?src=confmacro) | Venki Korukanti | Venki Korukanti | Major | Closed | Fixed | Oct 28, 2013 | Feb 19, 2015 | | +| [Bug](https://issues.apache.org/jira/browse/HIVE-5631?src=confmacro) | [HIVE-5631](https://issues.apache.org/jira/browse/HIVE-5631?src=confmacro) | [Index creation on a skew table fails](https://issues.apache.org/jira/browse/HIVE-5631?src=confmacro) | Venki Korukanti | Venki Korukanti | Major | Closed | Fixed | Oct 23, 2013 | Feb 19, 2015 | | Showing 20 out of -[57 issues](https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project%20=%20HIVE%20AND%20Component%20in%20%28indexing%29&tempMax=1000&src=confmacro) - - - - +[57 issues](https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project%20=%20HIVE%20AND%20Component%20in%20%28indexing%29&tempMax=1000&src=confmacro) diff --git a/content/Development/desingdocs/links.md b/content/Development/desingdocs/links.md index 1780ab2d..126adce3 100644 --- a/content/Development/desingdocs/links.md +++ b/content/Development/desingdocs/links.md @@ -93,7 +93,7 @@ The idea here is to tag tables/partitions with the namespaces that they belong t * Capacity tracking and management will be more complex than using databases for this purpose. * Data migration is more complex since the data is not contained with the root folder of one database -None of these are insurmountable problems, but using databases to model namespaces is a cleaner approach. +None of these are insurmountable problems, but using databases to model namespaces is a cleaner approach. (Taking this idea further, a database in Hive could itself have been implemented using tags in a single global namespace which would have not been as elegant as the current implementation of a database being a first class concept in Hive.) @@ -101,9 +101,9 @@ None of these are insurmountable problems, but using databases to model namespac * The view would be a simple select * using Y.T syntax. It’s a degenerate case of view. * We would need a registry of all views which import tables/partitions from other databases for namespace accounting. This requires adding metadata to these views to distinguish them from other user-created views. -* It would be harder to single instance imports using views (same table/partitions imported twice into the same namespace). Views are too opaque. +* It would be harder to single instance imports using views (same table/partitions imported twice into the same namespace). Views are too opaque. -Using partitioned views: +Using partitioned views: By definition, there isn't a one-one mapping between a view partition and a table partition. In fact, hive today does not even know about this dependency between view partitions and table partitions. Partitioned views is just a metadata concept - it is not something that the query layer understands. For e.g: if a view V partitioned on ds had 2 partitions: 1 and 2, then a query like select … from V where ds = 3 may still give valid results if the ds=3 is satisfied by the table underlying V. This means that: * View metadata doesn’t stay in sync with source partitions (as partitions get added and dropped). The user has to explicitly do this, which won't work for our case. diff --git a/content/Development/desingdocs/listbucketing.md b/content/Development/desingdocs/listbucketing.md index 81d7bdd6..2f8c8116 100644 --- a/content/Development/desingdocs/listbucketing.md +++ b/content/Development/desingdocs/listbucketing.md @@ -27,11 +27,11 @@ Create a partition per value of 'x'. * `create table T(a,b,c, .......) partitioned by (ds, x);` * Advantages - + Existing Hive is good enough. + + Existing Hive is good enough. * Disadvantages - + HDFS scalability: Number of files in HDFS increases. - + HDFS scalability: Number of intermediate files in HDFS increases. For example if there are 1000 mappers and 1000 partitions, and each mapper gets at least 1 row for each key, we will end up creating 1 million intermediate files. - + Metastore scalability: Will the metastore scale with the number of partitions. + + HDFS scalability: Number of files in HDFS increases. + + HDFS scalability: Number of intermediate files in HDFS increases. For example if there are 1000 mappers and 1000 partitions, and each mapper gets at least 1 row for each key, we will end up creating 1 million intermediate files. + + Metastore scalability: Will the metastore scale with the number of partitions. ## List Bucketing @@ -64,7 +64,7 @@ This approach can be extended to the scenario when there are more than one clust A query with all the clustering keys specified can be optimized easily. However, queries with some of the clustering keys specified: * + `select ... from T where x = 10;` - + `select ... from T where y = 'b';` + + `select ... from T where y = 'b';` can only be used to prune very few directories. It does not really matter if the prefix of the clustering keys is specified or not. For example for x=10, the Hive compiler can prune the file corresponding to (20, 'c'). And for y='b', the files corresponding to (10, 'a') and (20, 'c') can be pruned. Hashing for others does not really help, when the complete key is not specified. diff --git a/content/Development/desingdocs/llap.md b/content/Development/desingdocs/llap.md index 857cb62b..e1039990 100644 --- a/content/Development/desingdocs/llap.md +++ b/content/Development/desingdocs/llap.md @@ -23,7 +23,7 @@ Similar to the DataNode, LLAP daemons can be used by other applications as well, Last, but not least, fine-grained column-level access control – a key requirement for mainstream adoption of Hive – fits nicely into this model. -The diagram below shows an example execution with LLAP. Tez AM orchestrates overall execution. The initial stage of the query is pushed into LLAP. In the reduce stage, large shuffles are performed in separate containers. Multiple queries and applications can access LLAP concurrently. +The diagram below shows an example execution with LLAP. Tez AM orchestrates overall execution. The initial stage of the query is pushed into LLAP. In the reduce stage, large shuffles are performed in separate containers. Multiple queries and applications can access LLAP concurrently. ![](/attachments/62689557/65871225.png) @@ -83,8 +83,8 @@ LLAP servers are a natural place to enforce access control at a more fine-graine ## Monitoring -Configurations for LLAP monitoring are stored in resources.json, appConfig.json, metainfo.xml which are embedded into [templates.py](https://github.com/apache/hive/blob/master/llap-server/src/main/resources/templates.py) used by Slider.  - +Configurations for LLAP monitoring are stored in resources.json, appConfig.json, metainfo.xml which are embedded into [templates.py](https://github.com/apache/hive/blob/master/llap-server/src/main/resources/templates.py) used by Slider.  + LLAP Monitor Daemon runs on YARN container, similar to LLAP Daemon, and listens on the same port.   The LLAP Metrics Collection Server collects JMX metrics from all LLAP Daemons periodically.   The list of LLAP Daemons are extracted from the Zookeeper server which launched in the cluster.  @@ -159,19 +159,19 @@ Example usage. ``` ``` - -f,--findAppTimeout Amount of time(s) that the tool will sleep to wait for the YARN application to start. negative values=wait - forever, 0=Do not wait. default=20s - -H,--help Print help information - --hiveconf Use value for given property. Overridden by explicit parameters - -i,--refreshInterval Amount of time in seconds to wait until subsequent status checks in watch mode. Valid only for watch mode. - (Default 1s) - -n,--name LLAP cluster name - -o,--outputFile File to which output should be written (Default stdout) - -r,--runningNodesThreshold When watch mode is enabled (-w), wait until the specified threshold of nodes are running (Default 1.0 - which means 100% nodes are running) - -t,--watchTimeout Exit watch mode if the desired state is not attained until the specified timeout. (Default 300s) - -w,--watch Watch mode waits until all LLAP daemons are running or subset of the nodes are running (threshold can be - specified via -r option) (Default wait until all nodes are running) +-f,--findAppTimeout Amount of time(s) that the tool will sleep to wait for the YARN application to start. negative values=wait + forever, 0=Do not wait. default=20s +-H,--help Print help information + --hiveconf Use value for given property. Overridden by explicit parameters +-i,--refreshInterval Amount of time in seconds to wait until subsequent status checks in watch mode. Valid only for watch mode. + (Default 1s) +-n,--name LLAP cluster name +-o,--outputFile File to which output should be written (Default stdout) +-r,--runningNodesThreshold When watch mode is enabled (-w), wait until the specified threshold of nodes are running (Default 1.0 + which means 100% nodes are running) +-t,--watchTimeout Exit watch mode if the desired state is not attained until the specified timeout. (Default 300s) +-w,--watch Watch mode waits until all LLAP daemons are running or subset of the nodes are running (threshold can be + specified via -r option) (Default wait until all nodes are running) ``` Version information @@ -192,7 +192,3 @@ The watch and running nodes options were added in release 2.2.0 with [HIVE-15217 ![](images/icons/bullet_blue.gif) - - - - diff --git a/content/Development/desingdocs/locking.md b/content/Development/desingdocs/locking.md index 284cb99e..4e61e560 100644 --- a/content/Development/desingdocs/locking.md +++ b/content/Development/desingdocs/locking.md @@ -24,7 +24,7 @@ The compatibility matrix is as follows: | **Lock** **Compatibility**  | **Existing Lock** | | **S**  | **X** | | **Requested** **Lock** | **S** | **True** | **False** | -| **X** | **False** | **False** | +| **X** | **False** | **False*--> For some operations, locks are hierarchical in nature -- for example for some partition operations, the table is also locked (to make sure that the table cannot be dropped while a new partition is being created). @@ -38,25 +38,25 @@ A 'S' lock on table and relevant partition is acquired when a read is being perf Based on this, the lock acquired for an operation is as follows: -| **Hive Command** | **Locks Acquired** | -| --- | --- | -| **select .. T1 partition P1** | **S on T1, T1.P1** | -| **insert into T2(partition P2) select .. T1 partition P1** | **S on T2, T1, T1.P1 and X on T2.P2** | +| **Hive Command** | **Locks Acquired** | +|-------------------------------------------------------------|----------------------------------------------| +| **select .. T1 partition P1** | **S on T1, T1.P1** | +| **insert into T2(partition P2) select .. T1 partition P1** | **S on T2, T1, T1.P1 and X on T2.P2** | | **insert into T2(partition P.Q) select .. T1 partition P1** | **S on T2, T2.P, T1, T1.P1 and X on T2.P.Q** | -| **alter table T1 rename T2** | **X on T1** | -| **alter table T1 add cols** | **X on T1** | -| **alter table T1 replace cols** | **X on T1** | -| **alter table T1 change cols** | **X on T1** | -| **alter table T1 **concatenate**** | **X on T1** | -| **alter table T1 add partition P1** | **S on T1, X on T1.P1** | -| **alter table T1 drop partition P1** | **S on T1, X on T1.P1** | -| **alter table T1 touch partition P1** | **S on T1, X on T1.P1** | -| **alter table T1 set serdeproperties** | **S on T1** | -| **alter table T1 set serializer** | **S on T1** | -| **alter table T1 set file format** | **S on T1** | -| **alter table T1 set tblproperties** | **X on T1** | -| **alter table T1 partition P1 concatenate** | **X on T1.P1** | -| **drop table T1** | **X on T1** | +| **alter table T1 rename T2** | **X on T1** | +| **alter table T1 add cols** | **X on T1** | +| **alter table T1 replace cols** | **X on T1** | +| **alter table T1 change cols** | **X on T1** | +| **alter table T1 **concatenate**** | **X on T1** | +| **alter table T1 add partition P1** | **S on T1, X on T1.P1** | +| **alter table T1 drop partition P1** | **S on T1, X on T1.P1** | +| **alter table T1 touch partition P1** | **S on T1, X on T1.P1** | +| **alter table T1 set serdeproperties** | **S on T1** | +| **alter table T1 set serializer** | **S on T1** | +| **alter table T1 set file format** | **S on T1** | +| **alter table T1 set tblproperties** | **X on T1** | +| **alter table T1 partition P1 concatenate** | **X on T1.P1** | +| **drop table T1** | **X on T1** | In order to avoid deadlocks, a very simple scheme is proposed here. All the objects to be locked are sorted lexicographically, and the required mode lock is acquired. Note that in some cases, the list of objects may not be known -- for example in case of dynamic partitions, the list of partitions being modified is not known at compile time -- so, the list is generated conservatively. Since the number of partitions may not be known, an exclusive lock is supposed to be taken (but currently not due to [HIVE-3509](https://issues.apache.org/jira/browse/HIVE-3509) bug) on the table, or the prefix that is known. @@ -108,7 +108,3 @@ Hive [0.13.0](https://issues.apache.org/jira/browse/HIVE-5317) adds transactions * [ACID and Transactions in Hive]({{< ref "hive-transactions" >}}) * [Lock Manager]({{< ref "#lock-manager" >}}) - - - - diff --git a/content/Development/desingdocs/mapjoin-and-partition-pruning.md b/content/Development/desingdocs/mapjoin-and-partition-pruning.md index 63ddf5e6..f6bf1a85 100644 --- a/content/Development/desingdocs/mapjoin-and-partition-pruning.md +++ b/content/Development/desingdocs/mapjoin-and-partition-pruning.md @@ -11,7 +11,7 @@ In Hive, Map-Join is a technique that materializes data for all tables involved ### Problem -Map-Join predicates where the joining columns from big table (streamed table) are partition columns and corresponding columns from small table is not partitioned, the join would not prune the unnecessary partitions from big table. Since data for all small tables is materialized before big table is streamed, theoretically it would be possible to prune the unnecessary partitions from big table. +Map-Join predicates where the joining columns from big table (streamed table) are partition columns and corresponding columns from small table is not partitioned, the join would not prune the unnecessary partitions from big table. Since data for all small tables is materialized before big table is streamed, theoretically it would be possible to prune the unnecessary partitions from big table. HIVE-5119 has been created to track this feature improvement. @@ -21,7 +21,7 @@ Figure out the set of values from all small tables for each join column from big ### Possible Extensions -• If LHS and RHS of join predicate are partitioned then for tables from inner side, Partitions can be decided statically at compile time. +• If LHS and RHS of join predicate are partitioned then for tables from inner side, Partitions can be decided statically at compile time. • Even if the Big Table columns are not partitioned, the set of values generated from small tables could be pushed down as a predicate on the big table. Storage Handlers like ORC, which can handle predicate push down, could take advantage of this. @@ -31,39 +31,39 @@ This optimization has compile time and run/execution time pieces to it. Compile ### Compile Time -1. Identify Map Join Operators that can participate in partition pruning. +1. Identify Map Join Operators that can participate in partition pruning. 2. For each of the Map-Join operator in the task, identify columns from big table that can participate in the partition pruning. -    Columns that are identified from big table has following characteristics: +    Columns that are identified from big table has following characteristics: -      • They are part of join condition +      • They are part of join condition -      • Big table is on the inner side of the join +      • Big table is on the inner side of the join -      • Columns are not involved in any functions in the join conditions +      • Columns are not involved in any functions in the join conditions -      • Column value is not mutated (no function) before value reaches join condition from Table Scan. +      • Column value is not mutated (no function) before value reaches join condition from Table Scan.       • Column is a partition column. -3. Identify small tables and columns from small table that can participate in partition pruning. +3. Identify small tables and columns from small table that can participate in partition pruning. -    Columns that are identified from small table has following characteristics: +    Columns that are identified from small table has following characteristics: -      • Column is the other side of predicate in the join condition and Big Table column is identified as a target for partition pruning. +      • Column is the other side of predicate in the join condition and Big Table column is identified as a target for partition pruning. -      • Column is not part of any function on the join predicate. +      • Column is not part of any function on the join predicate.       • Column is part of join in which big table is on the outer side. 4. Modify MapRedLocalTask to assemble set of values for each of the column from small tables that participate in partition pruning and to generate PartitionDesc for big table. -**NOTE:** +**NOTE:** -• This requires adding a terminal operator to the operator DAG in the MapRedLocalTask. +• This requires adding a terminal operator to the operator DAG in the MapRedLocalTask. -• Note that the new terminal operator would get tuples from all small tables of interest (just like HashTableSink Operator). +• Note that the new terminal operator would get tuples from all small tables of interest (just like HashTableSink Operator). • Cascading Map-Join operators (joining on different keys in the same task using same big table) would still use the same terminal operator in the MapRedLocalTask. @@ -71,35 +71,35 @@ This optimization has compile time and run/execution time pieces to it. Compile 1. As tuples flow in to the new terminal operator in MapRedLocal task, it would extract columns of interest and would add it to a set of values for that column. -2. When close is called on the new terminal operator it would generate partitions of big table by consulting Meta Store (using values generated at #1). +2. When close is called on the new terminal operator it would generate partitions of big table by consulting Meta Store (using values generated at #1). -    **NOTE:** +    **NOTE:**     • Meta Store would need to answer queries with in clauses. Ex: give me all partitions for Table R where column x in (1,2,3) and column y in (5,6,7).     • In case of cascading MapJoinOperators the big table would be pruned based on multiple keys (& hence set generation needs to handle it). -3. Modify the PartitionDesc for BigTable in the MapRedTask with the list from #2. +3. Modify the PartitionDesc for BigTable in the MapRedTask with the list from #2. -**NOTE:** +**NOTE:** -    • PartitionDesc from #2 should be merged with existing PartitionDesc for the Big Table by finding the intersection. +    • PartitionDesc from #2 should be merged with existing PartitionDesc for the Big Table by finding the intersection.     • This modification of partition descriptor is designed as a prelaunch activity on each task. Task in turn would call prelaunch on associated work. Work may keep an ordered list of operators on which prelaunch needs to be called. -**Assumptions**: +**Assumptions**: -• In HIVE currently Join predicates can only include conjunctions. +• In HIVE currently Join predicates can only include conjunctions. • Hive only supports Equijoin ### Pseudo Code -1. Walk through Task DAG looking for MapredTask. Perform #2 - #6 for each such MapRedTask. +1. Walk through Task DAG looking for MapredTask. Perform #2 - #6 for each such MapRedTask. -2. Skip Task if it contains backup join plan (i.e if not MAPJOIN_ONLY_NOBACKUP or if backupTask is not null). +2. Skip Task if it contains backup join plan (i.e if not MAPJOIN_ONLY_NOBACKUP or if backupTask is not null). -**NOTE:** +**NOTE:**     This is aggressive; in my limited exposure to the hive code, it seemed like conditional tasks are currently set only for joins. @@ -107,41 +107,41 @@ This optimization has compile time and run/execution time pieces to it. Compile 4. Flag a Map-Join Operator as candidate for Partition Pruning -   4.1 Collect small tables that might participate in Big Table pruning +   4.1 Collect small tables that might participate in Big Table pruning -        a. Walk the join conditions. If Join Type is “outer” then check if big-table is on the outer side. If so then bailout. +        a. Walk the join conditions. If Join Type is “outer” then check if big-table is on the outer side. If so then bailout.         b. If big-table is on inner side then add the position of small table in to the set. -  4.2 If set from #4.1 is empty then bailout. Otherwise collect join keys from big table which is not wrapped in a functions +  4.2 If set from #4.1 is empty then bailout. Otherwise collect join keys from big table which is not wrapped in a functions -        a) Get the join key from “MapJoinDesc.getKeys().get(MapJoinDesc .getPosBigTable)” +        a) Get the join key from “MapJoinDesc.getKeys().get(MapJoinDesc .getPosBigTable)”         b) Walk through list of “ExpressionNodeDesc”; if “ExprNodeDesc” is of type “ExprNodeGenericFuncDesc” then check if any of partition pruner candidate key is contained with in it (“ExprNodeDescUtils.containsPredicate”). If any candidate key is contained within the function then remove it from the partition-pruner-bigtable-candidate list.        c) Create a pair of “ExprNodeColumnDesc position Integer within the list from #b” and “ExprNodecolumnDesc” and add to partition-pruner-bigtable-candidate list. -4.3 If partition-pruner-bigtable-candidate list is empty then bailout. Otherwise find join keys from #4.1 that is not wrapped in function using partition pruner candidate set. +4.3 If partition-pruner-bigtable-candidate list is empty then bailout. Otherwise find join keys from #4.1 that is not wrapped in function using partition pruner candidate set. -      a) Walk the set from 4.1 +      a) Walk the set from 4.1 -      b) Get the join key for each element from 4.1 +      b) Get the join key for each element from 4.1 -      c) Walk the join key list from #b checking if any of it is a function +      c) Walk the join key list from #b checking if any of it is a function -      d) If any of the element from #c is a function then check if it contains any element from partition-pruner-bigtable-candidate list. If yes then remove that element from partition-pruner-bigtable-candidate List and set-generation-key-map. +      d) If any of the element from #c is a function then check if it contains any element from partition-pruner-bigtable-candidate list. If yes then remove that element from partition-pruner-bigtable-candidate List and set-generation-key-map. -      e) Create a pair of table position and join key element from #d. +      e) Create a pair of table position and join key element from #d.       f) Add element to set-generation-key-map where key is the position of element within the partition-pruner-bigtable-candidate list and value is element from #e. -4.4 If partition-pruner-bigtable-candidate set is empty then bail out. Otherwise find BigTable Columns from partition-pruner-bigtable-candidate set that is partitioned. +4.4 If partition-pruner-bigtable-candidate set is empty then bail out. Otherwise find BigTable Columns from partition-pruner-bigtable-candidate set that is partitioned. -     a) Construct list of “ExprNodeDesc” from the set of #4.2 +     a) Construct list of “ExprNodeDesc” from the set of #4.2 -     b) Find out the root table column descriptors for #a (“ExprNodeDescUtils.backtrack”) +     b) Find out the root table column descriptors for #a (“ExprNodeDescUtils.backtrack”) -     c) From Hive get Table metadata for big table +     c) From Hive get Table metadata for big table      d) Walk through the list from #b & check with Table meta data to see if any of those columns is partitioned (“Table.isPartitionKey”). If column is not partition key then remove it from the partition pruner candidate list. @@ -149,19 +149,19 @@ This optimization has compile time and run/execution time pieces to it. Compile 4.6 If partition-pruner-bigtable-candidate list from #4.5 is empty then bail out. Otherwise add partition-pruner-bigtable-candidate list and set-generation-key-map from #4.5 to the existing list of values in the PhysicalCtx. -    a) Create a pair of partition-pruner-bigtable-candidate list & set-generation-key-map. +    a) Create a pair of partition-pruner-bigtable-candidate list & set-generation-key-map.     b) Add it to the existing list in the physical context (this is to handle cascading mapjoin operators in the same MapRedTask) -5. If partition-pruner-bigtable-candidate set and set-generation-keys are non empty then Modify corresponding LocalMRTask to introduce the new PartitionPrunerSink Operator (if not already). +5. If partition-pruner-bigtable-candidate set and set-generation-keys are non empty then Modify corresponding LocalMRTask to introduce the new PartitionPrunerSink Operator (if not already). -   a) Add to Physical Context a map of MapJoinOperator – HashTableSink Operator. This needs to happen during HashTableSink generation time. +   a) Add to Physical Context a map of MapJoinOperator – HashTableSink Operator. This needs to happen during HashTableSink generation time. -   b) From physical context get the HashTableSinkOperator corresponding to the MapJoinOperator. +   b) From physical context get the HashTableSinkOperator corresponding to the MapJoinOperator. -   c) From all the parents of MapJoin Operator identify the ones representing small tables in the set-generation-key-map. +   c) From all the parents of MapJoin Operator identify the ones representing small tables in the set-generation-key-map. -   d) Create a new PartitionDescGenSinkOp (with set-generation-key-map) +   d) Create a new PartitionDescGenSinkOp (with set-generation-key-map)    e) Add it as child of elements from #c. @@ -169,15 +169,15 @@ This optimization has compile time and run/execution time pieces to it. Compile   Two different MapRedTask (that contains MapJoin Operators) would result in two different MapRedLocalTask even if they share the same set of small tables. -  Implementation of PartitionDescGenSink +  Implementation of PartitionDescGenSink -     a) A map is maintained between BigTable column and HashSet. +     a) A map is maintained between BigTable column and HashSet. -     b) From each tuple extract values corresponding to each column with in set-generation-key. +     b) From each tuple extract values corresponding to each column with in set-generation-key. -     c) Add these to a HashSet +     c) Add these to a HashSet -     d) On Close of PartitionDescGenSink consult Metadata to get partitions for the key columns corresponding. This requires potential enhancements to Hive Metadata handling to provide an api “Get all partitions where column1 has set1 of values, or column2 has set2 of values. +     d) On Close of PartitionDescGenSink consult Metadata to get partitions for the key columns corresponding. This requires potential enhancements to Hive Metadata handling to provide an api “Get all partitions where column1 has set1 of values, or column2 has set2 of values.      e) Write the partition info to file. The file name & location needs to be finalized. @@ -185,13 +185,9 @@ This optimization has compile time and run/execution time pieces to it. Compile 7. At execution time call prelaunch on each task. Task will call prelaunch on the work. Work will call prelaunch on the operators in the list in order. For TableScan, prelaunch will result in reading the PartitionDescriptor info and would find intersection of existing PartitionDesc and the new list produced by PartitionDescGenSink. Partition state info kept in MapWork would be updated with the new partitons (“MapWork.pathToAliases”, “MapWork.aliasToPartnInfo”, “MapWork.pathToPartitionInfo”). This would be then picked up by “ExecDriver.execute” to setup input paths for InputFormat. -**NOTE:** +**NOTE:** -   • In Mapwork, we may need to maintain a map of Table alias to List. One choice is to introduce a new “addPathToPartitionInfo” method and switch current callers to use the new convenience method; this method then could maintain a Map of table alias to list of PartitionDesc. +   • In Mapwork, we may need to maintain a map of Table alias to List. One choice is to introduce a new “addPathToPartitionInfo” method and switch current callers to use the new convenience method; this method then could maintain a Map of table alias to list of PartitionDesc.   • Current design assumes the partition descriptor info generated by Local Task would be communicated to MapRed Task through files. This is obviously sub optimal. As an enhancement different mechanisms can be brought in to pass this info. - - - - diff --git a/content/Development/desingdocs/mapjoinoptimization.md b/content/Development/desingdocs/mapjoinoptimization.md index 74ed9438..9d41cd6f 100644 --- a/content/Development/desingdocs/mapjoinoptimization.md +++ b/content/Development/desingdocs/mapjoinoptimization.md @@ -51,9 +51,9 @@ As shown in Table1, the optimized Map Join will be 12 ~ 26 times faster than the Since map join is faster than the common join, it would be better to run the map join whenever possible. Previously, Hive users need to give a hint in the query to assign which table the small table is. For example, ***select /****+mapjoin(a)**/ * from src1 x join src2y on x.key=y.key***; It is not a good way for user experience and query performance, because sometimes user may give a wrong hint and also users may not give any hints. It would be much better to convert the Common Join into Map Join without users' hint. - ([HIVE-1642](http://issues.apache.org/jira/browse/HIVE-1642)) has solved the problem by converting the Common Join into Map Join automatically. For the Map Join, the query processor should know which input table the big table is. The other input tables will be recognize as the small tables during the execution stage and these tables need to be held in the memory. However, in general, the query processor has no idea of input file size during compiling time (even with statistics) because some of the table may be intermediate tables generated from sub queries. So the query processor can only figure out the input file size during the execution time. +([HIVE-1642](http://issues.apache.org/jira/browse/HIVE-1642)) has solved the problem by converting the Common Join into Map Join automatically. For the Map Join, the query processor should know which input table the big table is. The other input tables will be recognize as the small tables during the execution stage and these tables need to be held in the memory. However, in general, the query processor has no idea of input file size during compiling time (even with statistics) because some of the table may be intermediate tables generated from sub queries. So the query processor can only figure out the input file size during the execution time. -Right now, users need to enable this feature by **set hive.auto.convert.join = true;** +Right now, users need to enable this feature by **set hive.auto.convert.join = true;** This would become default in hive 0.11 with ([HIVE-3297](http://issues.apache.org/jira/browse/HIVE-3297)) @@ -93,7 +93,3 @@ For the previous common join, the experiment only calculates the average time of From the result, if the new common join can be converted into map join, it will get 57% ~163 % performance improvement. - - - - diff --git a/content/Development/desingdocs/metastore-tlp-proposal.md b/content/Development/desingdocs/metastore-tlp-proposal.md index 414da3e3..96a57e59 100644 --- a/content/Development/desingdocs/metastore-tlp-proposal.md +++ b/content/Development/desingdocs/metastore-tlp-proposal.md @@ -43,7 +43,7 @@ Moving the code from Hive into a new project is not straightforward and will tak 1. A new TLP is established.  As mentioned above, any existing Hive PMC members will be welcome to join the PMC, and any existing Hive committers will be granted committership in the new project. 2. Hive begins the process of detangling the metastore code inside the Hive project.  This will be done inside Hive to avoid a time where the code is in both Hive and the new project that would require double patching of any new features or bugs. -In order to enable the new project to begin adding layers around the core metastore and make releases, Hive can make source-only releases of only the metastore code during this interim period, similar to how the storage-api is released now.  The new project can then depend on those releases. + In order to enable the new project to begin adding layers around the core metastore and make releases, Hive can make source-only releases of only the metastore code during this interim period, similar to how the storage-api is released now.  The new project can then depend on those releases. 3. Once the detangling is complete and Hive is satisfied that the result works, the code will be moved from Hive to the new project. There are many technical questions of how to separate out the code.  These mainly center around which pieces of code should be moved into the new project, and whether the new project continues to depend on Hive’s storage-api (as ORC does today) or whether it copies any code that both it and Hive require (such as parts of the shim layer) in order to avoid any Hive dependencies.  Also there are places where metastore "calls back" into QL via reflection (e.g. partition expression evaluation).  We will need to determine how to continue this without pulling a dependency on all of Hive into the new project.  Discussions and decisions on this will happen throughout the process via the normal methods. @@ -63,7 +63,3 @@ The following have been suggested as a name for this project:   - - - - diff --git a/content/Development/desingdocs/outerjoinbehavior.md b/content/Development/desingdocs/outerjoinbehavior.md index 842cf610..39a91a33 100644 --- a/content/Development/desingdocs/outerjoinbehavior.md +++ b/content/Development/desingdocs/outerjoinbehavior.md @@ -11,12 +11,12 @@ This document is based on a writeup of [DB2 Outer Join Behavior](http://www.ibm. ## Definitions -| | | -| --- | --- | -| **Preserved Row table** | The table in an Outer Join that must return all rows. For left outer joins this is the *Left* table, for right outer joins it is the *Right* table, and for full outer joins both tables are *Preserved Row* tables. | -| **Null Supplying table** | This is the table that has nulls filled in for its columns in unmatched rows. In the non-full outer join case, this is the other table in the Join. For full outer joins both tables are also *Null Supplying* tables. | -| **During Join predicate** | A predicate that is in the JOIN **ON** clause. For example, in '`R1 join R2 on R1.x = 5`' the predicate '`R1.x = 5`' is a *During Join predicate*. | -| **After Join predicate** | A predicate that is in the WHERE clause. | +| | | +|---------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| **Preserved Row table** | The table in an Outer Join that must return all rows. For left outer joins this is the *Left* table, for right outer joins it is the *Right* table, and for full outer joins both tables are *Preserved Row* tables. | +| **Null Supplying table** | This is the table that has nulls filled in for its columns in unmatched rows. In the non-full outer join case, this is the other table in the Join. For full outer joins both tables are also *Null Supplying* tables. | +| **During Join predicate** | A predicate that is in the JOIN **ON** clause. For example, in '`R1 join R2 on R1.x = 5`' the predicate '`R1.x = 5`' is a *During Join predicate*. | +| **After Join predicate** | A predicate that is in the WHERE clause. | ## Predicate Pushdown Rules @@ -27,10 +27,10 @@ The logic can be summarized by these two rules: This captured in the following table: -|   | Preserved Row Table | Null Supplying Table | -| --- | --- | --- | -| Join Predicate | Case J1: Not Pushed | Case J2: Pushed | -| Where Predicate | Case W1: Pushed | Case W2: Not Pushed | +|   | Preserved Row Table | Null Supplying Table | +|------------------|----------------------|----------------------| +| Join Predicate | Case J1: Not Pushed | Case J2: Pushed | +| Where Predicate | Case W1: Pushed | Case W2: Not Pushed | See [Examples]({{< ref "#examples" >}}) below for illustrations of cases J1, J2, W1, and W2. @@ -38,7 +38,7 @@ See [Examples]({{< ref "#examples" >}}) below for illustrations of cases J1, J2, Hive enforces the rules by these methods in the SemanticAnalyzer and JoinPPD classes: -**Rule 1:** During **QBJoinTree** construction in Plan Gen, the `parseJoinCondition()` logic applies this rule. +**Rule 1:** During **QBJoinTree** construction in Plan Gen, the `parseJoinCondition()` logic applies this rule. **Rule 2:** During **JoinPPD** (Join Predicate PushDown) the `getQualifiedAliases()` logic applies this rule. @@ -303,7 +303,3 @@ STAGE PLANS: ``` - - - - diff --git a/content/Development/desingdocs/partitionedviews.md b/content/Development/desingdocs/partitionedviews.md index bb06d3f0..1faf0e95 100644 --- a/content/Development/desingdocs/partitionedviews.md +++ b/content/Development/desingdocs/partitionedviews.md @@ -76,7 +76,3 @@ and then capturing the table/partition inputs for this query and passing them on This allows applications to track the dependencies themselves. In the future, Hive will automatically populate these dependencies into the metastore as part of [HIVE-1073](https://issues.apache.org/jira/browse/HIVE-1073). - - - - diff --git a/content/Development/desingdocs/query-results-caching.md b/content/Development/desingdocs/query-results-caching.md index 075769c9..a1a6500f 100644 --- a/content/Development/desingdocs/query-results-caching.md +++ b/content/Development/desingdocs/query-results-caching.md @@ -37,7 +37,6 @@ During query execution: * If the results cache can be used for this query: + The query will simply be the FetchTask reading from the cached results directory. No cluster tasks will be required. - * If the results cache cannot be used: + Run the cluster tasks as normal + Check if the query results that have been computed are eligible to add to the results cache. @@ -86,6 +85,7 @@ Currently each Hive instance has its own separate results cache. Originally the  Other conditions:  * If the query has no cluster tasks (fetch-only query), no need to cache + ### Saving query results from getting deleted by query cleanup To keep the query results for possible re-use, we need to make sure that the results directory is not deleted as part of query cleanup. Note that the Driver Context cleans up not only the query results directory but the parent directories of the results directory, which may complicate trying to save the query results directory while also performing the rest of the query cleanup. @@ -105,13 +105,13 @@ The query rewrites that occur for Hive masking/filtering are done on the AST (Se  1. Allow entries in the results cache to be valid for only a configured length of time. This is a simpler implementation which does not have to rely having to detect when the underlying tables of a cached query result have been changed. The disadvantage is that when the underlying tables are changed, any stale results would be served by the results cache until the cached result expires. 2. Add a command to clear the results cache. This can give the user a chance to prevent stale results from being served by the cache if they know that a table has been updated. Disadvantage is this requires manual intervention by the user to prevent stale results from being served. 3. Expire results cache entries if there are updates to the underlying tables. It may be possible to reuse the same mechanism that Materialized Views use to determine if any of the underlying tables of a cached query have been modified. + ### Cleanup of cache directories Each Hive instance can keep track of its cached results directories, and set them to be deleted on process exit. Under normal conditions this may be able to take care of cache directory cleanup, but in the case that the Hive process is terminated without a chance to perform graceful cleanup, these directories may still be left around. - +Hive also has a cleanup mechanism for scratch directories (ClearDanglingScratchDir). It may be possible to reuse this for results cache cleanup. This cleanup works by creating a lockfile in the directory, and keeping the file open for the duration of the Hive instance. The cleanup thread will not delete the directory as long as the lock file is held by the Hive instance. This mechanism would work as long as each Hive instance is responsible for its own results cache directory (which is the current plan), as opposed to having a cache that is shared among different Hive instances. -Hive also has a cleanup mechanism for scratch directories (ClearDanglingScratchDir). It may be possible to reuse this for results cache cleanup. This cleanup works by creating a lockfile in the directory, and keeping the file open for the duration of the Hive instance. The cleanup thread will not delete the directory as long as the lock file is held by the Hive instance. This mechanism would work as long as each Hive instance is responsible for its own results cache directory (which is the current plan), as opposed to having a cache that is shared among different Hive instances. ### Table Locks At the start of query execution the Hive Driver will acquire locks for all tables being queried. This behavior should still remain even for queries that are using the cache - the read lock will prevent other users from being able to write to the underlying tables and invalidating the cache at the time the results cache is being checked for the query results. @@ -120,7 +120,3 @@ At the start of query execution the Hive Driver will acquire locks for all table   - - - - diff --git a/content/Development/desingdocs/security.md b/content/Development/desingdocs/security.md index 706493c8..b2f70eec 100644 --- a/content/Development/desingdocs/security.md +++ b/content/Development/desingdocs/security.md @@ -21,7 +21,3 @@ The links below refer to the [original Hive authorization mode]({{< ref "hive-de Note that Howl was the precursor to [HCatalog]({{< ref "hcatalog-usinghcat" >}}). - - - - diff --git a/content/Development/desingdocs/skewed-join-optimization.md b/content/Development/desingdocs/skewed-join-optimization.md index 95d39952..65bd08a0 100644 --- a/content/Development/desingdocs/skewed-join-optimization.md +++ b/content/Development/desingdocs/skewed-join-optimization.md @@ -12,7 +12,7 @@ date: 2024-12-12 A join of 2 large data tables is done by a set of MapReduce jobs which first sorts the tables based on the join key and then joins them. The Mapper gives all rows with a particular key to the same Reducer. e.g., Suppose we have table A with a key column, "id" which has values 1, 2, 3 and 4, and table B with a similar column, which has values 1, 2 and 3. - We want to do a join corresponding to the following query +We want to do a join corresponding to the following query * select A.id from A join B on A.id = B.id @@ -28,11 +28,11 @@ Do two separate queries The first query will not have any skew, so all the Reducers will finish at roughly the same time. If we assume that B has only few rows with B.id = 1, then it will fit into memory. So the join can be done efficiently by storing the B values in an in-memory hash table. This way, the join can be done by the Mapper itself and the data do not have to go to a Reducer. The partial results of the two queries can then be merged to get the final results. * Advantages - + If a small number of skewed keys make up for a significant percentage of the data, they will not become bottlenecks. + + If a small number of skewed keys make up for a significant percentage of the data, they will not become bottlenecks. * Disadvantages - + The tables A and B have to be read and processed twice. - + Because of the partial results, the results also have to be read and written twice. - + The user needs to be aware of the skew in the data and manually do the above process. + + The tables A and B have to be read and processed twice. + + Because of the partial results, the results also have to be read and written twice. + + The user needs to be aware of the skew in the data and manually do the above process. We can improve this further by trying to reduce the processing of skewed keys. First read B and store the rows with key 1 in an in-memory hash table. Now run a set of mappers to read A and do the following: diff --git a/content/Development/desingdocs/spatial-queries.md b/content/Development/desingdocs/spatial-queries.md index 7dbcbd1b..b60aa0b2 100644 --- a/content/Development/desingdocs/spatial-queries.md +++ b/content/Development/desingdocs/spatial-queries.md @@ -88,7 +88,3 @@ Changes are mostly at the language, and query optimization layer. The RESQUE library will be deployed as shared library, and a path to this library will be provided to hive to invoke functions in the library via RANSFORM mechanism.  - - - - diff --git a/content/Development/desingdocs/statsdev.md b/content/Development/desingdocs/statsdev.md index 79848979..d8122b62 100644 --- a/content/Development/desingdocs/statsdev.md +++ b/content/Development/desingdocs/statsdev.md @@ -33,20 +33,16 @@ The second milestone was to support column level statistics. See [Column Statist Supported column stats are: -| **BooleanColumnStatsData** | **DoubleColumnStatsData** | **LongColumnStatsData** | **StringColumnStatsData** | **BinaryColumnStatsData** | **DecimalColumnStatsData** | **Date** | **DateColumnStatsData** | **Timestamp** | **TimestampColumnStatsData** | **union ColumnStatisticsData** | -| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | -| 1: required i64 numTrues, | 1: optional double lowValue, | 1: optional i64 lowValue, | 1: required i64 maxColLen, | 1: required i64 maxColLen, | 1: optional Decimal lowValue, | 1: required i64 daysSinceEpoch | 1: optional Date lowValue, | 1: required i64 secondsSinceEpoch | 1: optional Timestamp lowValue, | 1: BooleanColumnStatsData booleanStats, | -| 2: required i64 numFalses, | 2: optional double highValue, | 2: optional i64 highValue, | 2: required double avgColLen, | 2: required double avgColLen, | 2: optional Decimal highValue, | | 2: optional Date highValue, | | 2: optional Timestamp highValue, | 2: LongColumnStatsData longStats, | -| 3: required i64 numNulls, | 3: required i64 numNulls, | 3: required i64 numNulls, | 3: required i64 numNulls, | 3: required i64 numNulls, | 3: required i64 numNulls, | | 3: required i64 numNulls, | | 3: required i64 numNulls, | 3: DoubleColumnStatsData doubleStats, | -| 4: optional binary bitVectors | 4: required i64 numDVs, | 4: required i64 numDVs, | 4: required i64 numDVs, | 4: optional binary bitVectors | 4: required i64 numDVs, | | 4: required i64 numDVs, | | 4: required i64 numDVs, | 4: StringColumnStatsData stringStats, | -| | 5: optional binary bitVectors, | 5: optional binary bitVectors, | 5: optional binary bitVectors | | 5: optional binary bitVectors, | | 5: optional binary bitVectors, | | 5: optional binary bitVectors, | 5: BinaryColumnStatsData binaryStats, | -| | 6: optional binary histogram | 6: optional binary histogram | | | 6: optional binary histogram | | 6: optional binary histogram | | 6: optional binary histogram | 6: DecimalColumnStatsData decimalStats, | -| | | | | | | | | | | 7: DateColumnStatsData dateStats, | -| | | | | | | | | | | 8: TimestampColumnStatsData timestampStats | - - - - +| **BooleanColumnStatsData** | **DoubleColumnStatsData** | **LongColumnStatsData** | **StringColumnStatsData** | **BinaryColumnStatsData** | **DecimalColumnStatsData** | **Date** | **DateColumnStatsData** | **Timestamp** | **TimestampColumnStatsData** | **union ColumnStatisticsData** | +|-------------------------------|--------------------------------|--------------------------------|-------------------------------|-------------------------------|--------------------------------|--------------------------------|--------------------------------|-----------------------------------|----------------------------------|--------------------------------------------| +| 1: required i64 numTrues, | 1: optional double lowValue, | 1: optional i64 lowValue, | 1: required i64 maxColLen, | 1: required i64 maxColLen, | 1: optional Decimal lowValue, | 1: required i64 daysSinceEpoch | 1: optional Date lowValue, | 1: required i64 secondsSinceEpoch | 1: optional Timestamp lowValue, | 1: BooleanColumnStatsData booleanStats, | +| 2: required i64 numFalses, | 2: optional double highValue, | 2: optional i64 highValue, | 2: required double avgColLen, | 2: required double avgColLen, | 2: optional Decimal highValue, | | 2: optional Date highValue, | | 2: optional Timestamp highValue, | 2: LongColumnStatsData longStats, | +| 3: required i64 numNulls, | 3: required i64 numNulls, | 3: required i64 numNulls, | 3: required i64 numNulls, | 3: required i64 numNulls, | 3: required i64 numNulls, | | 3: required i64 numNulls, | | 3: required i64 numNulls, | 3: DoubleColumnStatsData doubleStats, | +| 4: optional binary bitVectors | 4: required i64 numDVs, | 4: required i64 numDVs, | 4: required i64 numDVs, | 4: optional binary bitVectors | 4: required i64 numDVs, | | 4: required i64 numDVs, | | 4: required i64 numDVs, | 4: StringColumnStatsData stringStats, | +| | 5: optional binary bitVectors, | 5: optional binary bitVectors, | 5: optional binary bitVectors | | 5: optional binary bitVectors, | | 5: optional binary bitVectors, | | 5: optional binary bitVectors, | 5: BinaryColumnStatsData binaryStats, | +| | 6: optional binary histogram | 6: optional binary histogram | | | 6: optional binary histogram | | 6: optional binary histogram | | 6: optional binary histogram | 6: DecimalColumnStatsData decimalStats, | +| | | | | | | | | | | 7: DateColumnStatsData dateStats, | +| | | | | | | | | | | 8: TimestampColumnStatsData timestampStats | Version: Column statistics @@ -58,16 +54,14 @@ Column level statistics were added in Hive 0.10.0 by [HIVE-1362](https://issues. ## Quick overview -| Description | Stored in | Collected by | Since | -| --- | --- | --- | --- | -| Number of partition the dataset consists of | Fictional metastore property: **numPartitions** | computed during displaying the properties of a partitioned table | [Hive 2.3](https://issues.apache.org/jira/browse/HIVE-16315) | -| Number of files the dataset consists of | Metastore table property: **numFiles** | Automatically during Metastore operations | | -| Total size of the dataset as its seen at the filesystem level | Metastore table property: **totalSize** | | -| Uncompressed size of the dataset | Metastore table property: **rawDataSize** | Computed, these are the basic statistics. Calculated automatically when [hive.stats.autogather]({{< ref "#hive-stats-autogather" >}}) is enabled.Can be collected manually by: ANALYZE TABLE ... COMPUTE STATISTICS | [Hive 0.8](https://issues.apache.org/jira/browse/HIVE-2185) | -| Number of rows the dataset consist of | Metastore table property: **numRows** | | -| Column level statistics | Metastore; TAB_COL_STATS table | Computed, Calculated automatically when [hive.stats.column.autogather]({{< ref "#hive-stats-column-autogather" >}}) is enabled.Can be collected manually by: ANALYZE TABLE ... COMPUTE STATISTICS FOR COLUMNS | | - - +| Description | Stored in | Collected by | Since | +|---------------------------------------------------------------|-------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------| +| Number of partition the dataset consists of | Fictional metastore property: **numPartitions** | computed during displaying the properties of a partitioned table | [Hive 2.3](https://issues.apache.org/jira/browse/HIVE-16315) | +| Number of files the dataset consists of | Metastore table property: **numFiles** | Automatically during Metastore operations | | +| Total size of the dataset as its seen at the filesystem level | Metastore table property: **totalSize** | | +| Uncompressed size of the dataset | Metastore table property: **rawDataSize** | Computed, these are the basic statistics. Calculated automatically when [hive.stats.autogather]({{< ref "#hive-stats-autogather" >}}) is enabled.Can be collected manually by: ANALYZE TABLE ... COMPUTE STATISTICS | [Hive 0.8](https://issues.apache.org/jira/browse/HIVE-2185) | +| Number of rows the dataset consist of | Metastore table property: **numRows** | | +| Column level statistics | Metastore; TAB_COL_STATS table | Computed, Calculated automatically when [hive.stats.column.autogather]({{< ref "#hive-stats-column-autogather" >}}) is enabled.Can be collected manually by: ANALYZE TABLE ... COMPUTE STATISTICS FOR COLUMNS | | ## Implementation @@ -365,8 +359,6 @@ Feature not implemented Hive Metastore on HBase was discontinued and removed in Hive 3.0.0. See [HBaseMetastoreDevelopmentGuide]({{< ref "hbasemetastoredevelopmentguide" >}}) - - When Hive metastore is configured to use HBase, this command explicitly caches file metadata in HBase metastore.   *The goal of this feature is to cache file metadata (e.g. ORC file footers) to avoid reading lots of files from HDFS at split generation time, as well as potentially cache some information about splits (e.g. grouping based on location that would be good for some short time) to further speed up the generation and achieve better cache locality with consistent splits.* @@ -379,34 +371,29 @@ See feature details in [HBase Metastore Split Cache](https://issues.apache.org/j ## Current Status (JIRA) -| T | Key | Summary | Assignee | Reporter | P | Status | Resolution | Created | Updated | Due | -| --- | --- | --- | --- | --- | --- | --- | --- | --- | ---| ---| -| [Improvement](https://issues.apache.org/jira/browse/HIVE-28363?src=confmacro) | [HIVE-28363](https://issues.apache.org/jira/browse/HIVE-28363?src=confmacro) | [Improve heuristics of FilterStatsRule without column stats](https://issues.apache.org/jira/browse/HIVE-28363?src=confmacro) | Shohei Okumiya | Shohei Okumiya | Major | Resolved | Fixed | Jul 08, 2024 | Sep 28, 2024 | | -| [Bug](https://issues.apache.org/jira/browse/HIVE-28124?src=confmacro) | [HIVE-28124](https://issues.apache.org/jira/browse/HIVE-28124?src=confmacro) | [Do not allow non-numeric values in Hive table stats during an alter table](https://issues.apache.org/jira/browse/HIVE-28124?src=confmacro) | Miklos Szurap | Miklos Szurap | Major | Open | Unresolved | Mar 18, 2024 | Mar 18, 2024 | | -| [Bug](https://issues.apache.org/jira/browse/HIVE-27479?src=confmacro) | [HIVE-27479](https://issues.apache.org/jira/browse/HIVE-27479?src=confmacro) | [Incorrect filter selectivity of BETWEEN expressions when using histograms](https://issues.apache.org/jira/browse/HIVE-27479?src=confmacro) | Ryan Johnson | Ryan Johnson | Major | Resolved | Fixed | Jul 01, 2023 | Jul 03, 2023 | | -| [Bug](https://issues.apache.org/jira/browse/HIVE-27142?src=confmacro) | [HIVE-27142](https://issues.apache.org/jira/browse/HIVE-27142?src=confmacro) | [Map Join not working as expected when joining non-native tables with native tables](https://issues.apache.org/jira/browse/HIVE-27142?src=confmacro) | Syed Shameerur Rahman | Syed Shameerur Rahman | Major | Open | Unresolved | Mar 15, 2023 | Jul 10, 2024 | | -| [Bug](https://issues.apache.org/jira/browse/HIVE-27065?src=confmacro) | [HIVE-27065](https://issues.apache.org/jira/browse/HIVE-27065?src=confmacro) | [Exception in partition column statistics update with SQL Server db when histogram statistics is not enabled](https://issues.apache.org/jira/browse/HIVE-27065?src=confmacro) | Venugopal Reddy K | Venugopal Reddy K | Major | Closed | Fixed | Feb 10, 2023 | Aug 15, 2023 | | -| [Improvement](https://issues.apache.org/jira/browse/HIVE-27000?src=confmacro) | [HIVE-27000](https://issues.apache.org/jira/browse/HIVE-27000?src=confmacro) | [Improve the modularity of the *ColumnStatsMerger classes](https://issues.apache.org/jira/browse/HIVE-27000?src=confmacro) | Alessandro Solimando | Alessandro Solimando | Major | Closed | Fixed | Jan 30, 2023 | Aug 15, 2023 | | -| [Improvement](https://issues.apache.org/jira/browse/HIVE-26772?src=confmacro) | [HIVE-26772](https://issues.apache.org/jira/browse/HIVE-26772?src=confmacro) | [Add support for specific column statistics to ANALYZE TABLE command](https://issues.apache.org/jira/browse/HIVE-26772?src=confmacro) | Unassigned | Alessandro Solimando | Major | Open | Unresolved | Nov 23, 2022 | Nov 23, 2022 | | -| [Bug](https://issues.apache.org/jira/browse/HIVE-26370?src=confmacro) | [HIVE-26370](https://issues.apache.org/jira/browse/HIVE-26370?src=confmacro) | [Check stats are up-to-date when getting materialized view state](https://issues.apache.org/jira/browse/HIVE-26370?src=confmacro) | shahbaz | Krisztian Kasa | Major | Resolved | Won't Fix | Jul 05, 2022 | Oct 21, 2022 | | -| [Improvement](https://issues.apache.org/jira/browse/HIVE-26313?src=confmacro) | [HIVE-26313](https://issues.apache.org/jira/browse/HIVE-26313?src=confmacro) | [Aggregate all column statistics into a single field in metastore](https://issues.apache.org/jira/browse/HIVE-26313?src=confmacro) | Unassigned | Alessandro Solimando | Major | In Progress | Unresolved | Jun 10, 2022 | Mar 20, 2023 | | -| [Sub-task](https://issues.apache.org/jira/browse/HIVE-26297?src=confmacro) | [HIVE-26297](https://issues.apache.org/jira/browse/HIVE-26297?src=confmacro) | [Refactoring ColumnStatsAggregator classes to reduce warnings](https://issues.apache.org/jira/browse/HIVE-26297?src=confmacro) | Alessandro Solimando | Alessandro Solimando | Minor | Resolved | Abandoned | Jun 07, 2022 | Dec 15, 2022 | | -| [Bug](https://issues.apache.org/jira/browse/HIVE-26277?src=confmacro) | [HIVE-26277](https://issues.apache.org/jira/browse/HIVE-26277?src=confmacro) | [NPEs and rounding issues in ColumnStatsAggregator classes](https://issues.apache.org/jira/browse/HIVE-26277?src=confmacro) | Alessandro Solimando | Alessandro Solimando | Major | Closed | Fixed | Jun 01, 2022 | Nov 16, 2022 | | -| [Improvement](https://issues.apache.org/jira/browse/HIVE-26221?src=confmacro) | [HIVE-26221](https://issues.apache.org/jira/browse/HIVE-26221?src=confmacro) | [Add histogram-based column statistics](https://issues.apache.org/jira/browse/HIVE-26221?src=confmacro) | Alessandro Solimando | Alessandro Solimando | Major | Closed | Fixed | May 11, 2022 | Aug 15, 2023 | | -| [Task](https://issues.apache.org/jira/browse/HIVE-26066?src=confmacro) | [HIVE-26066](https://issues.apache.org/jira/browse/HIVE-26066?src=confmacro) | [Remove deprecated GenericUDAFComputeStats](https://issues.apache.org/jira/browse/HIVE-26066?src=confmacro) | Unassigned | Alessandro Solimando | Minor | Resolved | Duplicate | Mar 24, 2022 | Mar 24, 2022 | | -| [Sub-task](https://issues.apache.org/jira/browse/HIVE-25918?src=confmacro) | [HIVE-25918](https://issues.apache.org/jira/browse/HIVE-25918?src=confmacro) | [Invalid stats after multi inserting into the same partition](https://issues.apache.org/jira/browse/HIVE-25918?src=confmacro) | Krisztian Kasa | Krisztian Kasa | Major | Resolved | Fixed | Feb 01, 2022 | Feb 22, 2022 | | -| [Bug](https://issues.apache.org/jira/browse/HIVE-25771?src=confmacro) | [HIVE-25771](https://issues.apache.org/jira/browse/HIVE-25771?src=confmacro) | [Stats may be incorrect under concurrent inserts if direct-insert is Off](https://issues.apache.org/jira/browse/HIVE-25771?src=confmacro) | Krisztian Kasa | Krisztian Kasa | Major | Open | Unresolved | Dec 03, 2021 | Dec 03, 2021 | | -| [Bug](https://issues.apache.org/jira/browse/HIVE-25654?src=confmacro) | [HIVE-25654](https://issues.apache.org/jira/browse/HIVE-25654?src=confmacro) | [Stats of transactional table updated when transaction is rolled back](https://issues.apache.org/jira/browse/HIVE-25654?src=confmacro) | Unassigned | Krisztian Kasa | Major | Open | Unresolved | Oct 27, 2021 | Oct 27, 2021 | | -| [Improvement](https://issues.apache.org/jira/browse/HIVE-24056?src=confmacro) | [HIVE-24056](https://issues.apache.org/jira/browse/HIVE-24056?src=confmacro) | [Column stats gather stage as part of import table command plan generation](https://issues.apache.org/jira/browse/HIVE-24056?src=confmacro) | Ashish Sharma | Ashish Sharma | Major | In Progress | Unresolved | Aug 21, 2020 | Aug 21, 2020 | | -| [Improvement](https://issues.apache.org/jira/browse/HIVE-23901?src=confmacro) | [HIVE-23901](https://issues.apache.org/jira/browse/HIVE-23901?src=confmacro) | [Overhead of Logger in ColumnStatsMerger damage the performance](https://issues.apache.org/jira/browse/HIVE-23901?src=confmacro) | Yu-Wen Lai | Yu-Wen Lai | Major | Closed | Fixed | Jul 23, 2020 | Nov 17, 2022 | | -| [Bug](https://issues.apache.org/jira/browse/HIVE-23887?src=confmacro) | [HIVE-23887](https://issues.apache.org/jira/browse/HIVE-23887?src=confmacro) | [Reset table level basic/column stats during import.](https://issues.apache.org/jira/browse/HIVE-23887?src=confmacro) | Ashish Sharma | Ashish Sharma | Minor | Closed | Fixed | Jul 21, 2020 | Nov 17, 2022 | | -| [Bug](https://issues.apache.org/jira/browse/HIVE-23796?src=confmacro) | [HIVE-23796](https://issues.apache.org/jira/browse/HIVE-23796?src=confmacro) | [Multiple insert overwrite into a partitioned table doesn't gather column statistics for all partitions](https://issues.apache.org/jira/browse/HIVE-23796?src=confmacro) | Unassigned | Yu-Wen Lai | Major | Open | Unresolved | Jul 02, 2020 | Jul 02, 2020 | | - +| T | Key | Summary | Assignee | Reporter | P | Status | Resolution | Created | Updated | Due | +|-------------------------------------------------------------------------------|------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------|-----------------------|-------|-------------|------------|--------------|--------------|-----| +| [Improvement](https://issues.apache.org/jira/browse/HIVE-28363?src=confmacro) | [HIVE-28363](https://issues.apache.org/jira/browse/HIVE-28363?src=confmacro) | [Improve heuristics of FilterStatsRule without column stats](https://issues.apache.org/jira/browse/HIVE-28363?src=confmacro) | Shohei Okumiya | Shohei Okumiya | Major | Resolved | Fixed | Jul 08, 2024 | Sep 28, 2024 | | +| [Bug](https://issues.apache.org/jira/browse/HIVE-28124?src=confmacro) | [HIVE-28124](https://issues.apache.org/jira/browse/HIVE-28124?src=confmacro) | [Do not allow non-numeric values in Hive table stats during an alter table](https://issues.apache.org/jira/browse/HIVE-28124?src=confmacro) | Miklos Szurap | Miklos Szurap | Major | Open | Unresolved | Mar 18, 2024 | Mar 18, 2024 | | +| [Bug](https://issues.apache.org/jira/browse/HIVE-27479?src=confmacro) | [HIVE-27479](https://issues.apache.org/jira/browse/HIVE-27479?src=confmacro) | [Incorrect filter selectivity of BETWEEN expressions when using histograms](https://issues.apache.org/jira/browse/HIVE-27479?src=confmacro) | Ryan Johnson | Ryan Johnson | Major | Resolved | Fixed | Jul 01, 2023 | Jul 03, 2023 | | +| [Bug](https://issues.apache.org/jira/browse/HIVE-27142?src=confmacro) | [HIVE-27142](https://issues.apache.org/jira/browse/HIVE-27142?src=confmacro) | [Map Join not working as expected when joining non-native tables with native tables](https://issues.apache.org/jira/browse/HIVE-27142?src=confmacro) | Syed Shameerur Rahman | Syed Shameerur Rahman | Major | Open | Unresolved | Mar 15, 2023 | Jul 10, 2024 | | +| [Bug](https://issues.apache.org/jira/browse/HIVE-27065?src=confmacro) | [HIVE-27065](https://issues.apache.org/jira/browse/HIVE-27065?src=confmacro) | [Exception in partition column statistics update with SQL Server db when histogram statistics is not enabled](https://issues.apache.org/jira/browse/HIVE-27065?src=confmacro) | Venugopal Reddy K | Venugopal Reddy K | Major | Closed | Fixed | Feb 10, 2023 | Aug 15, 2023 | | +| [Improvement](https://issues.apache.org/jira/browse/HIVE-27000?src=confmacro) | [HIVE-27000](https://issues.apache.org/jira/browse/HIVE-27000?src=confmacro) | [Improve the modularity of the *ColumnStatsMerger classes](https://issues.apache.org/jira/browse/HIVE-27000?src=confmacro) | Alessandro Solimando | Alessandro Solimando | Major | Closed | Fixed | Jan 30, 2023 | Aug 15, 2023 | | +| [Improvement](https://issues.apache.org/jira/browse/HIVE-26772?src=confmacro) | [HIVE-26772](https://issues.apache.org/jira/browse/HIVE-26772?src=confmacro) | [Add support for specific column statistics to ANALYZE TABLE command](https://issues.apache.org/jira/browse/HIVE-26772?src=confmacro) | Unassigned | Alessandro Solimando | Major | Open | Unresolved | Nov 23, 2022 | Nov 23, 2022 | | +| [Bug](https://issues.apache.org/jira/browse/HIVE-26370?src=confmacro) | [HIVE-26370](https://issues.apache.org/jira/browse/HIVE-26370?src=confmacro) | [Check stats are up-to-date when getting materialized view state](https://issues.apache.org/jira/browse/HIVE-26370?src=confmacro) | shahbaz | Krisztian Kasa | Major | Resolved | Won't Fix | Jul 05, 2022 | Oct 21, 2022 | | +| [Improvement](https://issues.apache.org/jira/browse/HIVE-26313?src=confmacro) | [HIVE-26313](https://issues.apache.org/jira/browse/HIVE-26313?src=confmacro) | [Aggregate all column statistics into a single field in metastore](https://issues.apache.org/jira/browse/HIVE-26313?src=confmacro) | Unassigned | Alessandro Solimando | Major | In Progress | Unresolved | Jun 10, 2022 | Mar 20, 2023 | | +| [Sub-task](https://issues.apache.org/jira/browse/HIVE-26297?src=confmacro) | [HIVE-26297](https://issues.apache.org/jira/browse/HIVE-26297?src=confmacro) | [Refactoring ColumnStatsAggregator classes to reduce warnings](https://issues.apache.org/jira/browse/HIVE-26297?src=confmacro) | Alessandro Solimando | Alessandro Solimando | Minor | Resolved | Abandoned | Jun 07, 2022 | Dec 15, 2022 | | +| [Bug](https://issues.apache.org/jira/browse/HIVE-26277?src=confmacro) | [HIVE-26277](https://issues.apache.org/jira/browse/HIVE-26277?src=confmacro) | [NPEs and rounding issues in ColumnStatsAggregator classes](https://issues.apache.org/jira/browse/HIVE-26277?src=confmacro) | Alessandro Solimando | Alessandro Solimando | Major | Closed | Fixed | Jun 01, 2022 | Nov 16, 2022 | | +| [Improvement](https://issues.apache.org/jira/browse/HIVE-26221?src=confmacro) | [HIVE-26221](https://issues.apache.org/jira/browse/HIVE-26221?src=confmacro) | [Add histogram-based column statistics](https://issues.apache.org/jira/browse/HIVE-26221?src=confmacro) | Alessandro Solimando | Alessandro Solimando | Major | Closed | Fixed | May 11, 2022 | Aug 15, 2023 | | +| [Task](https://issues.apache.org/jira/browse/HIVE-26066?src=confmacro) | [HIVE-26066](https://issues.apache.org/jira/browse/HIVE-26066?src=confmacro) | [Remove deprecated GenericUDAFComputeStats](https://issues.apache.org/jira/browse/HIVE-26066?src=confmacro) | Unassigned | Alessandro Solimando | Minor | Resolved | Duplicate | Mar 24, 2022 | Mar 24, 2022 | | +| [Sub-task](https://issues.apache.org/jira/browse/HIVE-25918?src=confmacro) | [HIVE-25918](https://issues.apache.org/jira/browse/HIVE-25918?src=confmacro) | [Invalid stats after multi inserting into the same partition](https://issues.apache.org/jira/browse/HIVE-25918?src=confmacro) | Krisztian Kasa | Krisztian Kasa | Major | Resolved | Fixed | Feb 01, 2022 | Feb 22, 2022 | | +| [Bug](https://issues.apache.org/jira/browse/HIVE-25771?src=confmacro) | [HIVE-25771](https://issues.apache.org/jira/browse/HIVE-25771?src=confmacro) | [Stats may be incorrect under concurrent inserts if direct-insert is Off](https://issues.apache.org/jira/browse/HIVE-25771?src=confmacro) | Krisztian Kasa | Krisztian Kasa | Major | Open | Unresolved | Dec 03, 2021 | Dec 03, 2021 | | +| [Bug](https://issues.apache.org/jira/browse/HIVE-25654?src=confmacro) | [HIVE-25654](https://issues.apache.org/jira/browse/HIVE-25654?src=confmacro) | [Stats of transactional table updated when transaction is rolled back](https://issues.apache.org/jira/browse/HIVE-25654?src=confmacro) | Unassigned | Krisztian Kasa | Major | Open | Unresolved | Oct 27, 2021 | Oct 27, 2021 | | +| [Improvement](https://issues.apache.org/jira/browse/HIVE-24056?src=confmacro) | [HIVE-24056](https://issues.apache.org/jira/browse/HIVE-24056?src=confmacro) | [Column stats gather stage as part of import table command plan generation](https://issues.apache.org/jira/browse/HIVE-24056?src=confmacro) | Ashish Sharma | Ashish Sharma | Major | In Progress | Unresolved | Aug 21, 2020 | Aug 21, 2020 | | +| [Improvement](https://issues.apache.org/jira/browse/HIVE-23901?src=confmacro) | [HIVE-23901](https://issues.apache.org/jira/browse/HIVE-23901?src=confmacro) | [Overhead of Logger in ColumnStatsMerger damage the performance](https://issues.apache.org/jira/browse/HIVE-23901?src=confmacro) | Yu-Wen Lai | Yu-Wen Lai | Major | Closed | Fixed | Jul 23, 2020 | Nov 17, 2022 | | +| [Bug](https://issues.apache.org/jira/browse/HIVE-23887?src=confmacro) | [HIVE-23887](https://issues.apache.org/jira/browse/HIVE-23887?src=confmacro) | [Reset table level basic/column stats during import.](https://issues.apache.org/jira/browse/HIVE-23887?src=confmacro) | Ashish Sharma | Ashish Sharma | Minor | Closed | Fixed | Jul 21, 2020 | Nov 17, 2022 | | +| [Bug](https://issues.apache.org/jira/browse/HIVE-23796?src=confmacro) | [HIVE-23796](https://issues.apache.org/jira/browse/HIVE-23796?src=confmacro) | [Multiple insert overwrite into a partitioned table doesn't gather column statistics for all partitions](https://issues.apache.org/jira/browse/HIVE-23796?src=confmacro) | Unassigned | Yu-Wen Lai | Major | Open | Unresolved | Jul 02, 2020 | Jul 02, 2020 | | Showing 20 out of -[306 issues](https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project%20=%20HIVE%20AND%20component%20in%20%28%22Statistics%22%29&tempMax=1000&src=confmacro) - - - - +[306 issues](https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project%20=%20HIVE%20AND%20component%20in%20%28%22Statistics%22%29&tempMax=1000&src=confmacro) diff --git a/content/Development/desingdocs/storage-api-release-proposal.md b/content/Development/desingdocs/storage-api-release-proposal.md index 151c62ff..40961ea3 100644 --- a/content/Development/desingdocs/storage-api-release-proposal.md +++ b/content/Development/desingdocs/storage-api-release-proposal.md @@ -23,7 +23,3 @@ To commit the change, the developer will need to commit the Storage API change o   - - - - diff --git a/content/Development/desingdocs/storagehandlers.md b/content/Development/desingdocs/storagehandlers.md index 468762cb..63f5aa97 100644 --- a/content/Development/desingdocs/storagehandlers.md +++ b/content/Development/desingdocs/storagehandlers.md @@ -8,12 +8,12 @@ date: 2024-12-12 # Hive Storage Handlers * [Hive Storage Handlers]({{< ref "#hive-storage-handlers" >}}) - + [Introduction]({{< ref "#introduction" >}}) - + [Terminology]({{< ref "#terminology" >}}) - + [DDL]({{< ref "#ddl" >}}) - + [Storage Handler Interface]({{< ref "#storage-handler-interface" >}}) - + [HiveMetaHook Interface]({{< ref "#hivemetahook-interface" >}}) - + [Open Issues]({{< ref "#open-issues" >}}) + + [Introduction]({{< ref "#introduction" >}}) + + [Terminology]({{< ref "#terminology" >}}) + + [DDL]({{< ref "#ddl" >}}) + + [Storage Handler Interface]({{< ref "#storage-handler-interface" >}}) + + [HiveMetaHook Interface]({{< ref "#hivemetahook-interface" >}}) + + [Open Issues]({{< ref "#open-issues" >}}) ## Introduction @@ -157,7 +157,3 @@ Also note that there is no facility for two-phase commit in metadata transaction * The CREATE TABLE grammar isn't quite as strict as the one given above; some changes are needed in order to prevent STORED BY and row_format both being specified at once * CREATE TABLE AS SELECT is currently prohibited for creating a non-native table. It should be possible to support this, although it may not make sense for all storage handlers. For example, for HBase, it won't make sense until the storage handler is capable of automatically filling in column mappings. - - - - diff --git a/content/Development/desingdocs/subqueries-in-select.md b/content/Development/desingdocs/subqueries-in-select.md index 9c7d50ab..f8fc1514 100644 --- a/content/Development/desingdocs/subqueries-in-select.md +++ b/content/Development/desingdocs/subqueries-in-select.md @@ -44,6 +44,7 @@ SELECT CASE WHEN (select count(*) from store_sales FROM reason WHERE r_reason_sk = 1 ``` + * Scalar subqueries can only return at most one row. Hive will check for this case at runtime and throw an error if not satisfied. For example the following query is invalid: **Not Supported** @@ -56,6 +57,7 @@ SELECT customer.customer_num, ) AS total_ship_chg FROM customer ``` + * Scalar subqueries can only have one column. Hive will check for this case during compilation and throw an error. For example the following query is invalid: **Not Supported** @@ -67,6 +69,7 @@ SELECT customer.customer_num, ) AS total_ship_chg FROM customer ``` + * Correlated variables are only permitted in a filter, that is, a WHERE or HAVING clause. For example the following query is invalid: **Not Supported** @@ -79,6 +82,7 @@ SELECT customer.customer_num, ) AS total_ship_chg FROM customer ``` + * Subqueries with DISTINCT are not allowed. Since DISTINCT will be evaluated as GROUP BY , subqueries with DISTINCT are disallowed for now. # Design @@ -95,6 +99,7 @@ SELECT customer.customer_num, ) AS total_ship_chg FROM customer ``` + * IN subqueries, for example: ``` @@ -102,6 +107,7 @@ SELECT p_size IN ( SELECT MAX(p_size) FROM part) FROM part ``` + * EXISTS subqueries, for example: ``` @@ -114,14 +120,10 @@ All of the above queries could be **correlated** or **uncorrelated**. Design for this will be similar to the work done in [HIVE-15456](https://issues.apache.org/jira/browse/HIVE-15456). * genLogicalPlan will go over the select list to do the following: - + If subquery is not a top-level expression, throw an error. - + Otherwise, generate an appropriate plan by using RexSubquery to represent the subquery. + + If subquery is not a top-level expression, throw an error. + + Otherwise, generate an appropriate plan by using RexSubquery to represent the subquery. * HiveSubqueryRemoveRule will then be applied to remove the RexSubquery node and rewrite the query into a join. * HiveRelDecorrelator::decorrelateQuery will then be used to decorrelate correlated queries.  [HIVE-16091](https://issues.apache.org/jira/browse/HIVE-16091) covers the initial work for supporting subqueries in SELECT. - - - - diff --git a/content/Development/desingdocs/suggestion-for-ddl-commands-in-hms-schema-upgrade-scripts.md b/content/Development/desingdocs/suggestion-for-ddl-commands-in-hms-schema-upgrade-scripts.md index 6b2d7f1c..3e0c45b8 100644 --- a/content/Development/desingdocs/suggestion-for-ddl-commands-in-hms-schema-upgrade-scripts.md +++ b/content/Development/desingdocs/suggestion-for-ddl-commands-in-hms-schema-upgrade-scripts.md @@ -11,11 +11,3 @@ In this page, I would like to share the information I learned from Braintree's B * PostgreSQL at Scale: Database Schema Changes Without Downtime:  * Understanding PostgreSQL locks: [http://shiroyasha.io/understanding-postgresql-locks.html#:~:targetText=ROW%20SHARE%20%E2%80%94%20Acquired%20by%20the,of%20the%20alter%20table%20commands](http://shiroyasha.io/understanding-postgresql-locks.html#:~:targetText=ROW%20SHARE%20%E2%80%94%20Acquired%20by%20the,of%20the%20alter%20table%20commands.) - - - - - - - - diff --git a/content/Development/desingdocs/support-saml-2-0-authentication-mode.md b/content/Development/desingdocs/support-saml-2-0-authentication-mode.md index 5bd9c3ea..a6fb66a9 100644 --- a/content/Development/desingdocs/support-saml-2-0-authentication-mode.md +++ b/content/Development/desingdocs/support-saml-2-0-authentication-mode.md @@ -27,8 +27,6 @@ In order to make sure that the SAML assertions received by HiveServer2 are valid ![](/attachments/170266662/170266668.png) - - #### Authentication Flow Details 1. **JDBC Driver**: Receives a connection url from the user-facing clients (Beeline, Tableau). The connection URL uses http transport and has auth=browser @@ -67,8 +65,8 @@ This configuration will provide a path to the IDP metadata xml file.   hive.server2.saml2.sp.entity.id   test_sp_entity_id - - + + This configuration should be same the service provider entity id as configured in the IDP. Some identity providers require this to be same as the ACS URL. @@ -104,11 +102,13 @@ For example, in case of browser the URL will look like ``` jdbc:hive2://HiveServer2-host:10001/default;transportMode=http;httpPath=cliservice;auth=browser ``` + A token based URL will look like: ``` jdbc:hive2://HiveServer2-host:10001/default;transportMode=http;httpPath=cliservice;auth=token;token= ``` + The Jdbc connection parameters will be passed in over the connection URL.  ##### SSO mode URL validations @@ -133,6 +133,7 @@ The driver will have a configurable timeout value which will be used to error ou ``` jdbc:hive2://HiveServer2-host:10001/default;transportMode=http;httpPath=cliservice;auth=browser;samlResponsePort=12345;samlResponseTimeout=120 ``` + ##### Token Expiry and renewal In the initial version the token returned by the server will be used for a one-time validation within the default period of 30 seconds (token will be valid for only 30 seconds) which could be configurable. The token will be used by the server to set a cookie which will be used for further requests. However, this is a server side implementation detail which client does not need to be aware of. When the session expires the server will return a HTTP 401 status code which will be used by the clients to re authenticate using the browser flow again. Unfortunately, [RFC 7231](https://tools.ietf.org/html/rfc7231) does have an explicit status code for session expiration. However, the clients can detect if the response code of 401 is due to session expiry by keeping track of whether we are sending a token in the header in the previous request or not. In such case the JDBC Application can choose to close the connection or re-authenticate as per the application logic. @@ -141,7 +142,3 @@ In the initial version the token returned by the server will be used for a one-t ![](images/icons/bullet_blue.gif) - - - - diff --git a/content/Development/desingdocs/synchronized-metastore-cache.md b/content/Development/desingdocs/synchronized-metastore-cache.md index 71605858..81aba161 100644 --- a/content/Development/desingdocs/synchronized-metastore-cache.md +++ b/content/Development/desingdocs/synchronized-metastore-cache.md @@ -19,7 +19,7 @@ The problem we try to solve here is the cache consistency issue. We already buil The only data structure change is adding ValidWriteIdList into SharedCache.TableWrapper, which represents the transaction state of the cached table. -![](/attachments/110692851/110692854.png) +![](/attachments/110692851/110692854.png) Note there is no db table structure change, and we don’t store extra information in db. We don’t update TBLS.WRITE_ID field as we will use db as the fact of truth. We assume db always carry the latest copy and every time we fetch from db, we will tag it with the transaction state of the query. @@ -32,7 +32,7 @@ Metastore read request will compare ValidWriteIdList parameter with the cached o Here is the example for a get_table request: 1. At the beginning of the query, Hive will retrieve the global transaction state and store in config (ValidTxnList.VALID_TXNS_KEY) -2. Hive translate ValidTxnList to ValidWriteIdList of the table [12:7,8,12] (The format for writeid is [hwm:exceptions], all writeids from 1 to hwm minus exceptions, are committed. In this example, writeid 1..6,9,10,11 are committed) +2. Hive translate ValidTxnList to ValidWriteIdList of the table [12:7,8,12](The format for writeid is [hwm:exceptions], all writeids from 1 to hwm minus exceptions, are committed. In this example, writeid 1..6,9,10,11 are committed) 3. Hive pass the ValidWriteIdList to HMS 4. HMS compare ValidWriteIdList [12:7,8,12] with the cached one [11:7,8] using TxnIdUtils.compare, if it is fresh or newer (Fresh or newer means no transaction committed between two states. In this example, [11:7,8] means writeid 1..6,9,10,11 are committed, the same as the requested writeid [12:7,8,12]), HMS return cached table entry 5. If the cached ValidWriteIdList is [12:7,12], the comparison fails because writeid 8 is committed since then. HMS will fetch the table from ObjectStore @@ -49,7 +49,7 @@ Every write request will advance the write id for the table for both DML/DDL. Th ## Cache update -In the previous discussion, we know if the cache is stale, HMS will serve the request from ObjectStore. We need to catch up the cache with the latest change. This can be done by the existing notification log based cache update mechanism. A thread in HMS constantly poll from notification log, update the cache with the entries from notification log. The interesting entries in notification log are table/partition writes, and corresponding commit transaction message. When processing table/partition writes, HMS will put the table/partition entry in cache. However, the entry is not immediately usable until the commit message of the corresponding writes is processed, and mark writeid of corresponding table entry committed. +In the previous discussion, we know if the cache is stale, HMS will serve the request from ObjectStore. We need to catch up the cache with the latest change. This can be done by the existing notification log based cache update mechanism. A thread in HMS constantly poll from notification log, update the cache with the entries from notification log. The interesting entries in notification log are table/partition writes, and corresponding commit transaction message. When processing table/partition writes, HMS will put the table/partition entry in cache. However, the entry is not immediately usable until the commit message of the corresponding writes is processed, and mark writeid of corresponding table entry committed. Here is a complete flow for a cache update when write happen (and illustrated in the diagram): @@ -61,7 +61,7 @@ Here is a complete flow for a cache update when write happen (and illustrated in 6. The cache update thread will further read commit event from notification log, mark writeid 12 as committed, the tag of cached table entry changed to [12:7,8] 7. The next read from HMS 2 will serve from cache -![](/attachments/110692851/110692855.png) +![](/attachments/110692851/110692855.png) ## Bootstrap @@ -149,15 +149,7 @@ For every managed table write, advance the writeid for the table: AcidUtils.advanceWriteId(conf, tbl); ``` - - - - ## Attachments: ![](images/icons/bullet_blue.gif) - - - - diff --git a/content/Development/desingdocs/theta-join.md b/content/Development/desingdocs/theta-join.md index 4168c8a9..f34af8c7 100644 --- a/content/Development/desingdocs/theta-join.md +++ b/content/Development/desingdocs/theta-join.md @@ -74,12 +74,12 @@ As previously mention a detailed description of 1-Bucket-Theta is located [3]. A The matrix is partitioned by r, the number of reducers. An example join matrix follows, with four reducers 1-4 each a separate color: -| **Row Ids** | **T 1** | **T 2** | **T 3** | **T 4** | -| --- | --- | --- | --- | --- | -| **S 1** | 1 | 1 | 2 | 2 | -| **S 2** | 1 | 1 | 2 | 2 | -| **S 3** | 3 | 3 | 4 | 4 | -| **S 4** | 3 | 3 | 4 | 4 | +| **Row Ids** | **T 1** | **T 2** | **T 3** | **T 4** | +|-------------|---------|---------|---------|---------| +| **S 1** | 1 | 1 | 2 | 2 | +| **S 2** | 1 | 1 | 2 | 2 | +| **S 3** | 3 | 3 | 4 | 4 | +| **S 4** | 3 | 3 | 4 | 4 | In the map phase, each tuple in S is sent to all reducers which intersect the tuples’ row id. For example the S-tuple with the row id of 2, is sent to reducers 1, and 2. Similarly each tuple in T is sent to all reducers which intersect the tuples’ row id. For example, the tuple with rowid 4, is sent to reducers 2 and 4. @@ -101,7 +101,3 @@ The reducer is fairly simple, it buffers up the S relation and then performs the 4. [Efficient Multi-way Theta-Join Processing Using MapReduce](http://vldb.org/pvldb/vol5/p1184_xiaofeizhang_vldb2012.pdf) 5. [HIVE-2206](https://issues.apache.org/jira/browse/HIVE-2206) - - - - diff --git a/content/Development/desingdocs/top-k-stats.md b/content/Development/desingdocs/top-k-stats.md index 9d9fbb45..df74b8bb 100644 --- a/content/Development/desingdocs/top-k-stats.md +++ b/content/Development/desingdocs/top-k-stats.md @@ -4,14 +4,14 @@ date: 2024-12-12 --- # Apache Hive : Column Level Top K Statistics - + This document is an addition to [Statistics in Hive](/development/desingdocs/statsdev). It describes the support of collecting column level top K values for Hive tables (see [HIVE-3421](https://issues.apache.org/jira/browse/HIVE-3421)). ## Scope In addition to the partition statistics, column level top K values can also be estimated for Hive tables. - The name and top K values of the most skewed column is stored in the partition or non-partitioned table’s skewed information, if user did not specify [skew](/development/desingdocs/listbucketing). This works for both newly created and existing tables. - The algorithm for computing top K is based on this paper: [Efficient Computation of Frequent and Top-k Elements in Data Streams](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.114.9563&rep=rep1&type=pdf). +The name and top K values of the most skewed column is stored in the partition or non-partitioned table’s skewed information, if user did not specify [skew](/development/desingdocs/listbucketing). This works for both newly created and existing tables. +The algorithm for computing top K is based on this paper: [Efficient Computation of Frequent and Top-k Elements in Data Streams](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.114.9563&rep=rep1&type=pdf). ## Implementation @@ -181,7 +181,3 @@ Top K works the same way for ANALYZE commands as for INSERT commands. See [HIVE-3421](https://issues.apache.org/jira/browse/HIVE-3421). - - - - diff --git a/content/Development/desingdocs/type-qualifiers-in-hive.md b/content/Development/desingdocs/type-qualifiers-in-hive.md index eed01d3a..fa787806 100644 --- a/content/Development/desingdocs/type-qualifiers-in-hive.md +++ b/content/Development/desingdocs/type-qualifiers-in-hive.md @@ -7,24 +7,22 @@ date: 2024-12-12 ### Intro -Hive will need to support some kind of type qualifiers/parameters in its type metadata to be able to enforce type features such as decimal precision/scale or char/varchar length and collation. This involves changes to the PrimitiveTypeEntry/TypeInfo/ObjectInspectors, possibly metastore changes, +Hive will need to support some kind of type qualifiers/parameters in its type metadata to be able to enforce type features such as decimal precision/scale or char/varchar length and collation. This involves changes to the PrimitiveTypeEntry/TypeInfo/ObjectInspectors, possibly metastore changes, My impression is that the actual enforcement of the type qualifiers should be done by the ObjectInspectors/Converters/casts operations.  It should be ok to do col * col when col is a decimal(2) value of 99, it would fail if you try to cast the result to decimal(2) or try to insert it to a decimal(2) column.   ### Initial prototype work -There is some initial work on this in an initial patch for HIVE-4844. There is a BaseTypeParams object to represent type parameters, with VarcharTypeParams as a varchar-specific subclass containing the string length. The PrimitiveTypeEntryTypeInfo/ObjectInspectors are augmented to contain this BaseTypeParams object if the column/expression has type parameters. There also needed to be additional PrimitiveTypeEntryTypeInfo/ObjectInspectors factory methods which take a BaseTypeParams parameter. +There is some initial work on this in an initial patch for HIVE-4844. There is a BaseTypeParams object to represent type parameters, with VarcharTypeParams as a varchar-specific subclass containing the string length. The PrimitiveTypeEntryTypeInfo/ObjectInspectors are augmented to contain this BaseTypeParams object if the column/expression has type parameters. There also needed to be additional PrimitiveTypeEntryTypeInfo/ObjectInspectors factory methods which take a BaseTypeParams parameter. Some issues/questions from this: * There are some cases where Hive is trying to create a PrimitiveTypeEntry or ObjectInspector based on a Java class type. Such as ObjectInspectorFactory.getReflectionObjectInspector(). In these cases, there would be no data type params available to add to the PrimitiveTypeEntry/ObjectInspector, in which case we might have to default to some kind of type attributes - max precision decimal or max length char/varchar. This happens in a few places: - + TypedSerDe (used by ThriftByteStreamTypedSerDe). Might be ok since if it's just using Thrift types. - + S3LogDeserializer (in contrib). Might be ok, looks like it is only a deserializer, and for a custom S3 struct. - + MetadataTypedColumnsetSerDe. Might be ok, looks it might just use strings. - + GenericUDFUtils.ConversionHelper.ConversionHelper(), as well as GenericUDFBridge. This is used by old-style UDFs, in particular for the return type of the UDF. So in the general case it is not always possible to have type parameters for the return type of UDFs. GenericUDFs would be required if we want to be able to return a char length/decimal precision as part of the return type metadata, since they can customize the return type ObjectInspector. - + + TypedSerDe (used by ThriftByteStreamTypedSerDe). Might be ok since if it's just using Thrift types. + + S3LogDeserializer (in contrib). Might be ok, looks like it is only a deserializer, and for a custom S3 struct. + + MetadataTypedColumnsetSerDe. Might be ok, looks it might just use strings. + + GenericUDFUtils.ConversionHelper.ConversionHelper(), as well as GenericUDFBridge. This is used by old-style UDFs, in particular for the return type of the UDF. So in the general case it is not always possible to have type parameters for the return type of UDFs. GenericUDFs would be required if we want to be able to return a char length/decimal precision as part of the return type metadata, since they can customize the return type ObjectInspector. * If cast operators remain implemented as UDFs, then the UDF should probably be implemented as a Generic UDF so that the return type ObjectInspector can be set with the type params. In addition, the type parameters need to be somehow passed into the cast UDF before its initialize() method is called. - * Hive code does a lot of pointer-based equality using PrimitiveTypeEntry/TypeInfo/ObjectInspector objects. So a varchar(15) object inspector is not equal to a varchar(10). This may have some advantages such as requiring conversion/length enforcement in this case, but it seems like this may not always be desirable behavior. ### MetaStore Changes @@ -50,13 +48,9 @@ This approach would be similar to the attributes in the INFORMATION_SCHEMA.COLUM -We could add new columns to the COLUMNS_V2 table for any type qualifiers we are trying to support (initially looks like CHARACTER_MAXIMUM_LENGTH, NUMERIC_PRECISION, NUMERIC_SCALE). Advantages to this would be that it is easier to query these parameters than the first approach, though types with no parameters would still have these columns (set to null). +We could add new columns to the COLUMNS_V2 table for any type qualifiers we are trying to support (initially looks like CHARACTER_MAXIMUM_LENGTH, NUMERIC_PRECISION, NUMERIC_SCALE). Advantages to this would be that it is easier to query these parameters than the first approach, though types with no parameters would still have these columns (set to null). #### New table with type qualifiers in megastore -Rather than having to change the COLUMNS_V2 table we could have a new table to hold the type qualifier information. This would mean no additions to the existing COLUMNS_V2 table, and non-parameterized types would have no rows in this new table. But it would mean an extra query to this new table any time we are fetching column metadata from the metastore. - - - - +Rather than having to change the COLUMNS_V2 table we could have a new table to hold the type qualifier information. This would mean no additions to the existing COLUMNS_V2 table, and non-parameterized types would have no rows in this new table. But it would mean an extra query to this new table any time we are fetching column metadata from the metastore. diff --git a/content/Development/desingdocs/updatableviews.md b/content/Development/desingdocs/updatableviews.md index 98ed3aae..f3ef0144 100644 --- a/content/Development/desingdocs/updatableviews.md +++ b/content/Development/desingdocs/updatableviews.md @@ -37,7 +37,3 @@ Notes: See [Hive Views]({{< ref "viewdev" >}}) for general information about views. - - - - diff --git a/content/Development/desingdocs/vectorized-query-execution.md b/content/Development/desingdocs/vectorized-query-execution.md index 655c4ce7..c36406f1 100644 --- a/content/Development/desingdocs/vectorized-query-execution.md +++ b/content/Development/desingdocs/vectorized-query-execution.md @@ -139,7 +139,3 @@ set hive.fetch.task.conversion=none Vectorized execution is available in Hive 0.13.0 and later ([HIVE-5283](https://issues.apache.org/jira/browse/HIVE-5283)). - - - - diff --git a/content/Development/desingdocs/viewdev.md b/content/Development/desingdocs/viewdev.md index 151abeec..97ef0505 100644 --- a/content/Development/desingdocs/viewdev.md +++ b/content/Development/desingdocs/viewdev.md @@ -11,16 +11,17 @@ Views () are a standard DBMS feat ## Scope -At a minimum, we want to +At a minimum, we want to * add queryable view support at the SQL language level (specifics of the scoping are under discussion in the Issues section below) - + updatable views will not be supported (see the [Updatable Views]({{< ref "updatableviews" >}}) proposal) + + updatable views will not be supported (see the [Updatable Views]({{< ref "updatableviews" >}}) proposal) * make sure views and their definitions show up anywhere tables can currently be enumerated/searched/described * where relevant, provide additional metadata to allow views to be distinguished from tables Beyond this, we may want to * expose metadata about view definitions and dependencies (at table-level or column-level) in a way that makes them consumable by metadata-driven tools + ## Syntax ``` @@ -39,6 +40,7 @@ The basics of view implementation are very easy due to the fact that Hive alread * For **CREATE VIEW v AS view-def-select**, we extend SemanticAnalyzer to behave similarly to **CREATE TABLE t AS select**, except that we don't actually execute the query (we stop after plan generation). It's necessary to perform all of plan generation (even though we're not actually going to execute the plan) since currently some validations such as type compatibility-checking are only performed during plan generation. After successful validation, the text of the view is saved in the metastore (the simplest approach snips out the text from the parser's token stream, but this approach introduces problems described in the issues section below). * For **select ... from view-reference**, we detect the view reference in SemanticAnalyzer.getMetaData, load the text of its definition from the metastore, parse it back into an AST, prepare a QBExpr to hold it, and then plug this into the referencing query's QB, resulting in a tree equivalent to **select ... from (view-def-select)**; plan generation can then be carried out on the combined tree. + ## Issues Some of these are related to functionality/scope; others are related to implementation approaches. Opinions are welcome on all of them. @@ -53,7 +55,7 @@ Implementing this typically requires expanding the view definition into an expli However, storing both the expanded form and the original view definition text as well can also be useful for both DESCRIBE readability as well as functionality (see later section on ALTER VIEW v RECOMPILE). -**Update 7-Jan-2010**: Rather than adding full-blown unparse support to the AST model, I'm taking a parser-dependent shortcut. ANTLR's TokenRewriteStream provides a way to substitute text for token subsequences from the original token stream and then regenerate a transformed version of the parsed text. So, during column resolution, we map an expression such as "t.*" to replacement text "t.c1, t.c2, t.c3". Then once all columns have been resolved, we regenerate the view definition using these mapped replacements. Likewise, an unqualified column reference such as "c" gets replaced with the qualified reference "t.c". The rest of the parsed text remains unchanged. +**Update 7-Jan-2010**: Rather than adding full-blown unparse support to the AST model, I'm taking a parser-dependent shortcut. ANTLR's TokenRewriteStream provides a way to substitute text for token subsequences from the original token stream and then regenerate a transformed version of the parsed text. So, during column resolution, we map an expression such as "t.*" to replacement text "t.c1, t.c2, t.c3". Then once all columns have been resolved, we regenerate the view definition using these mapped replacements. Likewise, an unqualified column reference such as "c" gets replaced with the qualified reference "t.c". The rest of the parsed text remains unchanged. This approach will break if we ever need to perform more drastic (AST-based) rewrites as part of view expansion in the future. @@ -77,11 +79,11 @@ Alternately, if we choose to avoid inheritance, then we could just add a new vie Comparison of the two approaches: -|   | **Inheritance Model** | **Flat Model** | -| --- | --- | --- | -| *JDO Support* | Need to investigate how well inheritance works for our purposes | Nothing special | -| *Metadata queries from existing code/tools* | Existing queries for tables will NOT include views in results; those that need to will have to be modified to reference base class instead | Existing queries for tables WILL include views in results; those that are not supposed to will need to filter them out | -| *Metastore upgrade on deployment* | Need to test carefully to make sure introducing inheritance doesn't corrupt existing metastore instances | Nothing special, just adding a new attribute | +|   | **Inheritance Model** | **Flat Model** | +|---------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------| +| *JDO Support* | Need to investigate how well inheritance works for our purposes | Nothing special | +| *Metadata queries from existing code/tools* | Existing queries for tables will NOT include views in results; those that need to will have to be modified to reference base class instead | Existing queries for tables WILL include views in results; those that are not supposed to will need to filter them out | +| *Metastore upgrade on deployment* | Need to test carefully to make sure introducing inheritance doesn't corrupt existing metastore instances | Nothing special, just adding a new attribute | **Update 30-Dec-2009**: Based on a design review meeting, we're going to go with the flat model. Prasad pointed out that in the future, for materialized views, we may need the view definition to be tracked at the partition level as well, so that when we change the view definition, we don't have to discard existing materialized partitions if the new view result can be derived from the old one. So it may make sense to add the view definition as a new attribute of StorageDescriptor (since that is already present at both table and partition level). @@ -135,7 +137,7 @@ For **select * from t**, hive supports fast-path execution (skipping Map/Reduce) **Update 30-Dec-2009**: Based on feedback in JIRA, we'll leave this as dependent on getting the fast-path working for the underlying filters and projections. -**Update 6-Dec-2010**: This one is addressed by Hive's new "auto local mode" feature. +**Update 6-Dec-2010**: This one is addressed by Hive's new "auto local mode" feature. ### ORDER BY and LIMIT in view definition @@ -212,7 +214,3 @@ WHERE EXISTS( For MySQL, note that the "safe updates" feature will need to be disabled since these are full-table updates. - - - - diff --git a/content/Development/gettingStarted.md b/content/Development/gettingStarted.md index 68aeddcb..02788c0e 100644 --- a/content/Development/gettingStarted.md +++ b/content/Development/gettingStarted.md @@ -7,23 +7,22 @@ aliases: --- +http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. --> The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage @@ -31,20 +30,25 @@ using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive. - # Getting Started With Apache Hive Software + --- + * Check out the [Getting Started Guide][GETTING_STARTED]. * Learn more [About Hive's Functionality][HIVE_DETAILS]. * Read the [Getting Started Guide][GETTING_STARTED] to learn how to install Hive * The [User and Hive SQL documentation][HIVE_QL] shows how to program Hive ## Quick start with Docker + --- + Checkout the quickstart with Docker here: [DOCKER_QUICKSTART] # Getting Involved With The Apache Hive Community + --- + Apache Hive is an open source project run by volunteers at the Apache Software Foundation. Previously it was a subproject of [Apache® Hadoop®][APACHE_HADOOP], but has now graduated to become a @@ -69,4 +73,3 @@ project and contribute your expertise. [HIVE_TWITTER]: https://twitter.com/apachehive [DOCKER_QUICKSTART]: /development/quickstart/ - diff --git a/content/Development/gettingstarted-latest.md b/content/Development/gettingstarted-latest.md index 72b2184a..a5b138b1 100644 --- a/content/Development/gettingstarted-latest.md +++ b/content/Development/gettingstarted-latest.md @@ -14,9 +14,9 @@ You can install a stable release of Hive by downloading a tarball, or you can do ### Requirements * Java 1.7 -*Note:*  Hive versions [1.2](https://issues.apache.org/jira/browse/HIVE/fixforversion/12329345/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel) onward require Java 1.7 or newer. Hive versions 0.14 to 1.1 work with Java 1.6 as well. Users are strongly advised to start moving to Java 1.8 (see [HIVE-8607](https://issues.apache.org/jira/browse/HIVE-8607)). + *Note:*  Hive versions [1.2](https://issues.apache.org/jira/browse/HIVE/fixforversion/12329345/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel) onward require Java 1.7 or newer. Hive versions 0.14 to 1.1 work with Java 1.6 as well. Users are strongly advised to start moving to Java 1.8 (see [HIVE-8607](https://issues.apache.org/jira/browse/HIVE-8607)). * Hadoop 2.x (preferred), 1.x (not supported by Hive 2.0.0 onward). -Hive versions up to 0.13 also supported Hadoop 0.20.x, 0.23.x. + Hive versions up to 0.13 also supported Hadoop 0.20.x, 0.23.x. * Hive is commonly used in production Linux and Windows environment. Mac is a commonly used development environment. The instructions in this document are applicable to Linux and Mac. Using it on Windows would require slightly different steps. ### Installing Hive from a Stable Release @@ -58,21 +58,21 @@ As of 0.13, Hive is built using [Apache Maven](http://maven.apache.org). To build the current Hive code from the master branch: ``` - $ git clone https://git-wip-us.apache.org/repos/asf/hive.git - $ cd hive - $ mvn clean package -Pdist [-DskipTests -Dmaven.javadoc.skip=true] - $ cd packaging/target/apache-hive-{version}-SNAPSHOT-bin/apache-hive-{version}-SNAPSHOT-bin - $ ls - LICENSE - NOTICE - README.txt - RELEASE_NOTES.txt -  bin/ (all the shell scripts) - lib/ (required jar files) - conf/ (configuration files) - examples/ (sample input and query files) - hcatalog / (hcatalog installation) - scripts / (upgrade scripts for hive-metastore) + $ git clone https://git-wip-us.apache.org/repos/asf/hive.git + $ cd hive + $ mvn clean package -Pdist [-DskipTests -Dmaven.javadoc.skip=true] + $ cd packaging/target/apache-hive-{version}-SNAPSHOT-bin/apache-hive-{version}-SNAPSHOT-bin + $ ls + LICENSE + NOTICE + README.txt + RELEASE_NOTES.txt + bin/ (all the shell scripts) + lib/ (required jar files) + conf/ (configuration files) + examples/ (sample input and query files) + hcatalog / (hcatalog installation) + scripts / (upgrade scripts for hive-metastore) ``` Here, {version} refers to the current Hive version. @@ -93,21 +93,21 @@ In branch-1, Hive supports both Hadoop 1.x and 2.x.  You will need to specify w Prior to Hive 0.13, Hive was built using [Apache Ant](http://ant.apache.org/).  To build an older version of Hive on Hadoop 0.20: ``` - $ svn co http://svn.apache.org/repos/asf/hive/branches/branch-{version} hive - $ cd hive - $ ant clean package - $ cd build/dist - # ls - LICENSE - NOTICE - README.txt - RELEASE_NOTES.txt - bin/ (all the shell scripts) - lib/ (required jar files) - conf/ (configuration files) - examples/ (sample input and query files) - hcatalog / (hcatalog installation) - scripts / (upgrade scripts for hive-metastore) +$ svn co http://svn.apache.org/repos/asf/hive/branches/branch-{version} hive +$ cd hive +$ ant clean package +$ cd build/dist +# ls +LICENSE +NOTICE +README.txt +RELEASE_NOTES.txt +bin/ (all the shell scripts) +lib/ (required jar files) +conf/ (configuration files) +examples/ (sample input and query files) +hcatalog / (hcatalog installation) +scripts / (upgrade scripts for hive-metastore) ``` If using Ant, we will refer to the directory "`build/dist`" as ``. @@ -218,13 +218,13 @@ For more information, see [WebHCat Installation]({{< ref "webhcat-installwebhcat * Log4j configuration is stored in `/conf/hive-log4j.properties` * Hive configuration is an overlay on top of Hadoop – it inherits the Hadoop configuration variables by default. * Hive configuration can be manipulated by: - + Editing hive-site.xml and defining any desired variables (including Hadoop variables) in it - + Using the set command (see next section) - + Invoking Hive (deprecated), Beeline or HiveServer2 using the syntax: - - `$ bin/hive --hiveconf x1=y1 --hiveconf x2=y2  //this sets the variables x1 and x2 to y1 and y2 respectively` - - $ bin/hiveserver2 --hiveconf x1=y1 --hiveconf x2=y2  //this sets server-side variables x1 and x2 to y1 and y2 respectively - - $ bin/beeline --hiveconf x1=y1 --hiveconf x2=y2  //this sets client-side variables x1 and x2 to y1 and y2 respectively. - + Setting the `HIVE_OPTS` environment variable to "`--hiveconf x1=y1 --hiveconf x2=y2`" which does the same as above. + + Editing hive-site.xml and defining any desired variables (including Hadoop variables) in it + + Using the set command (see next section) + + Invoking Hive (deprecated), Beeline or HiveServer2 using the syntax: + - `$ bin/hive --hiveconf x1=y1 --hiveconf x2=y2  //this sets the variables x1 and x2 to y1 and y2 respectively` + - $ bin/hiveserver2 --hiveconf x1=y1 --hiveconf x2=y2  //this sets server-side variables x1 and x2 to y1 and y2 respectively + - $ bin/beeline --hiveconf x1=y1 --hiveconf x2=y2  //this sets client-side variables x1 and x2 to y1 and y2 respectively. + + Setting the `HIVE_OPTS` environment variable to "`--hiveconf x1=y1 --hiveconf x2=y2`" which does the same as above. ### Runtime Configuration @@ -253,7 +253,7 @@ While this usually points to a map-reduce cluster with multiple nodes, Hadoop al Starting with release 0.7, Hive fully supports local mode execution. To enable this, the user can enable the following option: ``` - hive> SET mapreduce.framework.name=local; +hive> SET mapreduce.framework.name=local; ``` In addition, `mapred.local.dir` should point to a path that's valid on the local machine (for example `/tmp//mapred/local`). (Otherwise, the user will get an exception allocating local disk space.) @@ -282,7 +282,7 @@ Hive uses log4j for logging. By default logs are not emitted to the console by t The logs are stored in the directory `/tmp/<*user.name*>`: * `/tmp/<*user.name*>/hive.log` -Note: In [local mode](/development/gettingstarted-latest#GettingStarted-Hive,Map-ReduceandLocal-Mode), prior to Hive 0.13.0 the log file name was "`.log`" instead of "`hive.log`". This bug was fixed in release 0.13.0 (see [HIVE-5528](https://issues.apache.org/jira/browse/HIVE-5528) and [HIVE-5676](https://issues.apache.org/jira/browse/HIVE-5676)). + Note: In [local mode](/development/gettingstarted-latest#GettingStarted-Hive,Map-ReduceandLocal-Mode), prior to Hive 0.13.0 the log file name was "`.log`" instead of "`hive.log`". This bug was fixed in release 0.13.0 (see [HIVE-5528](https://issues.apache.org/jira/browse/HIVE-5528) and [HIVE-5676](https://issues.apache.org/jira/browse/HIVE-5676)). To configure a different log location, set `hive.log.dir` in $HIVE_HOME/conf/hive-log4j.properties. Make sure the directory has the sticky bit set (`chmod 1777 <*dir*>`). @@ -439,7 +439,7 @@ NOTES: * NO verification of data against the schema is performed by the load command. * If the file is in hdfs, it is moved into the Hive-controlled file system namespace. -The root of the Hive directory is specified by the option `hive.metastore.warehouse.dir` in `hive-default.xml`. We advise users to create this directory before trying to create tables via Hive. + The root of the Hive directory is specified by the option `hive.metastore.warehouse.dir` in `hive-default.xml`. We advise users to create this directory before trying to create tables via Hive. ``` hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15'); @@ -670,7 +670,3 @@ STORED AS TEXTFILE; ``` - - - - diff --git a/content/Development/qtest.md b/content/Development/qtest.md index 076c786a..80d19550 100644 --- a/content/Development/qtest.md +++ b/content/Development/qtest.md @@ -5,22 +5,22 @@ draft: false --- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. --> # Query File Test(qtest) @@ -201,3 +201,4 @@ Following are a few rules of thumb that should be followed when adding new test * When you do need to use a `SELECT` statement, make sure you use the `ORDER BY` clause to minimize the chances of spurious diffs due to output order differences leading to test failures. * Limit your test to one table unless you require multiple tables specifically. * Make sure that you name your query file appropriately with a descriptive name. + diff --git a/content/Development/quickStart.md b/content/Development/quickStart.md index 19a4720f..c8fc192e 100644 --- a/content/Development/quickStart.md +++ b/content/Development/quickStart.md @@ -9,164 +9,214 @@ aliases: ### Introduction --- -Run Apache Hive inside docker container in pseudo-distributed mode, inorder to provide the following Quick-start/Debugging/Prepare a test env for Hive +Run Apache Hive inside docker container in pseudo-distributed mode, inorder to provide the following Quick-start/Debugging/Prepare a test env for Hive ### Quickstart --- -##### **STEP 1: Pull the image** +##### **STEP 1: Pull the image** - Pull the image from DockerHub: https://hub.docker.com/r/apache/hive/tags. Here are the latest images: - 4.0.0 - 3.1.3 + ```shell docker pull apache/hive:4.0.0 ``` + ` ` + ##### **STEP 2: Export the Hive version** + ```shell export HIVE_VERSION=4.0.0 ``` + ` ` + ##### **STEP 3: Launch the HiveServer2 with an embedded Metastore.** + This is lightweight and for a quick setup, it uses Derby as metastore db. + ```shell docker run -d -p 10000:10000 -p 10002:10002 --env SERVICE_NAME=hiveserver2 --name hive4 apache/hive:${HIVE_VERSION} ``` + ` ` + ##### **STEP 4: Connect to beeline** ```shell docker exec -it hiveserver2 beeline -u 'jdbc:hive2://hiveserver2:10000/' ``` + ` ` + ##### Note: Launch Standalone Metastore To use standalone Metastore with Derby, ```shell docker run -d -p 9083:9083 --env SERVICE_NAME=metastore --name metastore-standalone apache/hive:${HIVE_VERSION} ``` + ` ` + ## Detailed Setup --- + ##### - Build image Apache Hive relies on Hadoop, Tez and some others to facilitate reading, writing, and managing large datasets. The [/packaging/src/docker/build.sh] provides ways to build the image against specified version of the dependent, as well as build from source. ##### - Build from source + ```shell mvn clean package -pl packaging -DskipTests -Pdocker ``` + ` ` + ##### - Build with specified version There are some arguments to specify the component version: + ```shell -hadoop -tez -hive ``` + If the version is not provided, it will read the version from current `pom.xml`: `project.version`, `hadoop.version` and `tez.version` for Hive, Hadoop and Tez respectively. For example, the following command uses Hive 4.0.0, Hadoop `hadoop.version` and Tez `tez.version` to build the image, + ```shell ./build.sh -hive 4.0.0 ``` + If the command does not specify the Hive version, it will use the local `apache-hive-${project.version}-bin.tar.gz`(will trigger a build if it doesn't exist), together with Hadoop 3.3.6 and Tez 0.10.3 to build the image, + ```shell ./build.sh -hadoop 3.3.6 -tez 0.10.3 ``` + After building successfully, we can get a Docker image named `apache/hive` by default, the image is tagged by the provided Hive version. ### Run services + --- + Before going further, we should define the environment variable `HIVE_VERSION` first. For example, if `-hive 4.0.0` is specified to build the image, + ```shell export HIVE_VERSION=4.0.0 ``` + or assuming that you're relying on current `project.version` from pom.xml, + ```shell export HIVE_VERSION=$(mvn -f pom.xml -q help:evaluate -Dexpression=project.version -DforceStdout) ``` + ` ` + ##### **- Metastore** For a quick start, launch the Metastore with Derby, - ```shell - docker run -d -p 9083:9083 --env SERVICE_NAME=metastore --name metastore-standalone apache/hive:${HIVE_VERSION} - ``` + +```shell +docker run -d -p 9083:9083 --env SERVICE_NAME=metastore --name metastore-standalone apache/hive:${HIVE_VERSION} +``` + Everything would be lost when the service is down. In order to save the Hive table's schema and data, start the container with an external Postgres and Volume to keep them, - ```shell - docker run -d -p 9083:9083 --env SERVICE_NAME=metastore --env DB_DRIVER=postgres \ - --env SERVICE_OPTS="-Djavax.jdo.option.ConnectionDriverName=org.postgresql.Driver -Djavax.jdo.option.ConnectionURL=jdbc:postgresql://postgres:5432/metastore_db -Djavax.jdo.option.ConnectionUserName=hive -Djavax.jdo.option.ConnectionPassword=password" \ - --mount source=warehouse,target=/opt/hive/data/warehouse \ - --mount type=bind,source=`mvn help:evaluate -Dexpression=settings.localRepository -q -DforceStdout`/org/postgresql/postgresql/42.5.1/postgresql-42.5.1.jar,target=/opt/hive/lib/postgres.jar \ - --name metastore-standalone apache/hive:${HIVE_VERSION} - ``` +```shell +docker run -d -p 9083:9083 --env SERVICE_NAME=metastore --env DB_DRIVER=postgres \ + --env SERVICE_OPTS="-Djavax.jdo.option.ConnectionDriverName=org.postgresql.Driver -Djavax.jdo.option.ConnectionURL=jdbc:postgresql://postgres:5432/metastore_db -Djavax.jdo.option.ConnectionUserName=hive -Djavax.jdo.option.ConnectionPassword=password" \ + --mount source=warehouse,target=/opt/hive/data/warehouse \ + --mount type=bind,source=`mvn help:evaluate -Dexpression=settings.localRepository -q -DforceStdout`/org/postgresql/postgresql/42.5.1/postgresql-42.5.1.jar,target=/opt/hive/lib/postgres.jar \ + --name metastore-standalone apache/hive:${HIVE_VERSION} +``` + If you want to use your own `hdfs-site.xml` or `yarn-site.xml` for the service, you can provide the environment variable `HIVE_CUSTOM_CONF_DIR` for the command. For instance, put the custom configuration file under the directory `/opt/hive/conf`, then, - ```shell - docker run -d -p 9083:9083 --env SERVICE_NAME=metastore --env DB_DRIVER=postgres \ - -v /opt/hive/conf:/hive_custom_conf --env HIVE_CUSTOM_CONF_DIR=/hive_custom_conf \ - --mount type=bind,source=`mvn help:evaluate -Dexpression=settings.localRepository -q -DforceStdout`/org/postgresql/postgresql/42.5.1/postgresql-42.5.1.jar,target=/opt/hive/lib/postgres.jar \ - --name metastore apache/hive:${HIVE_VERSION} - ``` +```shell +docker run -d -p 9083:9083 --env SERVICE_NAME=metastore --env DB_DRIVER=postgres \ + -v /opt/hive/conf:/hive_custom_conf --env HIVE_CUSTOM_CONF_DIR=/hive_custom_conf \ + --mount type=bind,source=`mvn help:evaluate -Dexpression=settings.localRepository -q -DforceStdout`/org/postgresql/postgresql/42.5.1/postgresql-42.5.1.jar,target=/opt/hive/lib/postgres.jar \ + --name metastore apache/hive:${HIVE_VERSION} +``` + For Hive releases before 4.0, if you want to upgrade the existing external Metastore schema to the target version, then add `--env SCHEMA_COMMAND=upgradeSchema` to the command. To skip schematool initialisation or upgrade for metastore use `--env IS_RESUME="true"`, for verbose logging set `--env VERBOSE="true"`. ` ` -##### **- HiveServer2** + +##### **- HiveServer2** Launch the HiveServer2 with an embedded Metastore, - ```shell - docker run -d -p 10000:10000 -p 10002:10002 --env SERVICE_NAME=hiveserver2 --name hiveserver2-standalone apache/hive:${HIVE_VERSION} - ``` + +```shell +docker run -d -p 10000:10000 -p 10002:10002 --env SERVICE_NAME=hiveserver2 --name hiveserver2-standalone apache/hive:${HIVE_VERSION} +``` + or specify a remote Metastore if it's available, - ```shell - docker run -d -p 10000:10000 -p 10002:10002 --env SERVICE_NAME=hiveserver2 \ - --env SERVICE_OPTS="-Dhive.metastore.uris=thrift://metastore:9083" \ - --env IS_RESUME="true" \ - --name hiveserver2-standalone apache/hive:${HIVE_VERSION} - ``` + +```shell +docker run -d -p 10000:10000 -p 10002:10002 --env SERVICE_NAME=hiveserver2 \ + --env SERVICE_OPTS="-Dhive.metastore.uris=thrift://metastore:9083" \ + --env IS_RESUME="true" \ + --name hiveserver2-standalone apache/hive:${HIVE_VERSION} +``` + To save the data between container restarts, you can start the HiveServer2 with a Volume, - ```shell - docker run -d -p 10000:10000 -p 10002:10002 --env SERVICE_NAME=hiveserver2 \ - --env SERVICE_OPTS="-Dhive.metastore.uris=thrift://metastore:9083" \ - --mount source=warehouse,target=/opt/hive/data/warehouse \ - --env IS_RESUME="true" \ - --name hiveserver2 apache/hive:${HIVE_VERSION} - ``` + +```shell +docker run -d -p 10000:10000 -p 10002:10002 --env SERVICE_NAME=hiveserver2 \ + --env SERVICE_OPTS="-Dhive.metastore.uris=thrift://metastore:9083" \ + --mount source=warehouse,target=/opt/hive/data/warehouse \ + --env IS_RESUME="true" \ + --name hiveserver2 apache/hive:${HIVE_VERSION} +``` + ` ` + ##### **- HiveServer2, Metastore** To get a quick overview of both HiveServer2 and Metastore, there is a `docker-compose.yml` placed under `packaging/src/docker` for this purpose, specify the `POSTGRES_LOCAL_PATH` first: + ```shell export POSTGRES_LOCAL_PATH=your_local_path_to_postgres_driver ``` + Example: + ```shell mvn dependency:copy -Dartifact="org.postgresql:postgresql:42.5.1" && \ export POSTGRES_LOCAL_PATH=`mvn help:evaluate -Dexpression=settings.localRepository -q -DforceStdout`/org/postgresql/postgresql/42.5.1/postgresql-42.5.1.jar ``` + If you don't install maven or have problem in resolving the postgres driver, you can always download this jar yourself, change the `POSTGRES_LOCAL_PATH` to the path of the downloaded jar. Then, + ```shell docker compose up -d ``` + HiveServer2, Metastore and Postgres services will be started as a consequence. Volumes are used to persist data generated by Hive inside Postgres and HiveServer2 containers: -- hive_db - - The volume persists the metadata of Hive tables inside Postgres container. -- warehouse - - The volume stores tables' files inside HiveServer2 container. +- hive_db +- The volume persists the metadata of Hive tables inside Postgres container. +- warehouse +- The volume stores tables' files inside HiveServer2 container. To stop/remove them all, + ```shell docker compose down ``` @@ -176,21 +226,23 @@ docker compose down --- - HiveServer2 web - - Accessed on browser at http://localhost:10002/ + - Accessed on browser at http://localhost:10002/ - Beeline: + ```shell - docker exec -it hiveserver2 beeline -u 'jdbc:hive2://hiveserver2:10000/' - # If beeline is installed on host machine, HiveServer2 can be simply reached via: - beeline -u 'jdbc:hive2://localhost:10000/' + docker exec -it hiveserver2 beeline -u 'jdbc:hive2://hiveserver2:10000/' + # If beeline is installed on host machine, HiveServer2 can be simply reached via: + beeline -u 'jdbc:hive2://localhost:10000/' ``` - Run some queries + ```sql - show tables; - create table hive_example(a string, b int) partitioned by(c int); - alter table hive_example add partition(c=1); - insert into hive_example partition(c=1) values('a', 1), ('a', 2),('b',3); - select count(distinct a) from hive_example; - select sum(b) from hive_example; + show tables; + create table hive_example(a string, b int) partitioned by(c int); + alter table hive_example add partition(c=1); + insert into hive_example partition(c=1) values('a', 1), ('a', 2),('b',3); + select count(distinct a) from hive_example; + select sum(b) from hive_example; ``` #### `sys` Schema and `information_schema` Schema @@ -290,6 +342,7 @@ docker compose exec hiveserver2-standalone /bin/bash /opt/hive/bin/schematool -initSchema -dbType hive -metaDbType postgres -url jdbc:hive2://localhost:10000/default exit ``` - + ### Quick Start with REST Catalog Integration + Checkout the quickstart of REST Catalog Integration with Docker here: [REST Catalog Integration](/docs/latest/quickstart-rest-catalog) diff --git a/content/Development/versionControl.md b/content/Development/versionControl.md index 22248ce1..174c4e09 100644 --- a/content/Development/versionControl.md +++ b/content/Development/versionControl.md @@ -8,24 +8,25 @@ aliases: --- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. --> # Version Control + --- The Hive source code resides in Apache's [Hive GitHub](https://github.com/apache/hive) @@ -33,3 +34,4 @@ The Hive source code resides in Apache's [Hive GitHub](https://github.com/apache * Anonymous clone via http - * Authenticated clone via ssh - git@github.com:apache/hive.git * Instructions: [Apache committer git instructions](https://git.apache.org/) + diff --git a/content/_index.md b/content/_index.md index a6a4c8ab..682b441a 100644 --- a/content/_index.md +++ b/content/_index.md @@ -5,22 +5,22 @@ draft: false --- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. --> The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage @@ -60,4 +60,3 @@ project and contribute your expertise. [CONTRIBUTOR]: {{ site.wiki }}/Home#Home-ResourcesforContributors [HIVE_TWITTER]: https://twitter.com/apachehive - diff --git a/content/community/becomingcommitter/index.md b/content/community/becomingcommitter/index.md index ff21ce93..2bfeeb17 100644 --- a/content/community/becomingcommitter/index.md +++ b/content/community/becomingcommitter/index.md @@ -48,7 +48,3 @@ It should go without saying, but here it is anyway: your participation in the pr ![](images/icons/bullet_blue.gif) - - - - diff --git a/content/community/bylaws.md b/content/community/bylaws.md index 25501670..dc27dce5 100644 --- a/content/community/bylaws.md +++ b/content/community/bylaws.md @@ -70,12 +70,12 @@ Within the Hive project, different types of decisions require different forms of Decisions regarding the project are made by votes on the primary project development mailing list ([user@hive.apache.org]({{< ref "mailto:user@pig-apache-org" >}})). Where necessary, PMC voting may take place on the private Hive PMC mailing list. Votes are clearly indicated by subject line starting with [VOTE]. Votes may contain multiple items for approval and these should be clearly separated. Voting is carried out by replying to the vote mail. Voting may take four flavors -| Vote | Meaning | -| --- | --- | -| +1 | 'Yes,' 'Agree,' or 'the action should be performed.' In general, this vote also indicates a willingness on the behalf of the voter in 'making it happen'. | -| +0 | This vote indicates a willingness for the action under consideration to go ahead. The voter, however will not be able to help. | -| -0 | This vote indicates that the voter does not, in general, agree with the proposed action but is not concerned enough to prevent the action going ahead. | -| -1 | This is a negative vote. On issues where consensus is required, this vote counts as a **veto**. All vetoes must contain an explanation of why the veto is appropriate. Vetoes with no explanation are void. It may also be appropriate for a -1 vote to include an alternative course of action. | +| Vote | Meaning | +|------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| +1 | 'Yes,' 'Agree,' or 'the action should be performed.' In general, this vote also indicates a willingness on the behalf of the voter in 'making it happen'. | +| +0 | This vote indicates a willingness for the action under consideration to go ahead. The voter, however will not be able to help. | +| -0 | This vote indicates that the voter does not, in general, agree with the proposed action but is not concerned enough to prevent the action going ahead. | +| -1 | This is a negative vote. On issues where consensus is required, this vote counts as a **veto**. All vetoes must contain an explanation of why the veto is appropriate. Vetoes with no explanation are void. It may also be appropriate for a -1 vote to include an alternative course of action. | All participants in the Hive project are encouraged to show their agreement with or against a particular action by voting. For technical decisions, only the votes of active committers are binding. Non binding votes are still useful for those with binding votes to understand the perception of an action in the wider Hive community. For PMC decisions, only the votes of PMC members are binding. @@ -85,13 +85,13 @@ Voting can also be applied to changes already made to the Hive codebase. These t These are the types of approvals that can be sought. Different actions require different types of approvals. -| Approval Type | Definition | -| --- | --- | -| Consensus | For this to pass, all voters with binding votes must vote and there can be no binding vetoes (-1). Consensus votes are rarely required due to the impracticality of getting all eligible voters to cast a vote. | -| Lazy Consensus | Lazy consensus requires 3 binding +1 votes and no binding vetoes. | -| Lazy Majority | A lazy majority vote requires 3 binding +1 votes and more binding +1 votes that -1 votes. | -| Lazy Approval | An action with lazy approval is implicitly allowed unless a -1 vote is received, at which time, depending on the type of action, either lazy majority or lazy consensus approval must be obtained. | -| 2/3 Majority | Some actions require a 2/3 majority of active committers or PMC members to pass. Such actions typically affect the foundation of the project (e.g. adopting a new codebase to replace an existing product). The higher threshold is designed to ensure such changes are strongly supported. To pass this vote requires at least 2/3 of binding vote holders to vote +1. | +| Approval Type | Definition | +|----------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Consensus | For this to pass, all voters with binding votes must vote and there can be no binding vetoes (-1). Consensus votes are rarely required due to the impracticality of getting all eligible voters to cast a vote. | +| Lazy Consensus | Lazy consensus requires 3 binding +1 votes and no binding vetoes. | +| Lazy Majority | A lazy majority vote requires 3 binding +1 votes and more binding +1 votes that -1 votes. | +| Lazy Approval | An action with lazy approval is implicitly allowed unless a -1 vote is received, at which time, depending on the type of action, either lazy majority or lazy consensus approval must be obtained. | +| 2/3 Majority | Some actions require a 2/3 majority of active committers or PMC members to pass. Such actions typically affect the foundation of the project (e.g. adopting a new codebase to replace an existing product). The higher threshold is designed to ensure such changes are strongly supported. To pass this vote requires at least 2/3 of binding vote holders to vote +1. | ### Vetoes @@ -101,21 +101,17 @@ If you disagree with a valid veto, you must lobby the person casting the veto to ### Actions -| Actions | Description | Approval | Binding Votes | Minimum Length | Mailing List | -| --- | --- | --- | --- | --- | --- | -| Code Change | A change made to a codebase of the project and committed by a committer. This includes source code, documentation, website content, etc. | one +1 from a committer who has not authored the patch followed by a Lazy approval (not counting the vote of the contributor), moving to lazy majority if a -1 is receivedMinor issues (e.g. typos, code style issues, JavaDoc changes. At committer's discretion) can be committed after soliciting feedback/review on the mailing list and not receiving feedback within 2 days. | Active committers | 1 | JIRA (dev@hive.apache.org) | -| Release Plan | Defines the timetable and actions for a release. The plan also nominates a Release Manager. | Lazy majority | Active committers | 3 | user@hive.apache.org | -| Product Release | When a release of one of the project's products is ready, a vote is required to accept the release as an official release of the project. | Lazy Majority | Active PMC members | 3 | user@hive.apache.org | -| Adoption of New Codebase | When the codebase for an existing, released product is to be replaced with an alternative codebase. If such a vote fails to gain approval, the existing code base will continue. This also covers the creation of new sub-projects *and submodules* within the project. | 2/3 majority | Active PMC members | 6 | dev@hive.apache.org | -| New Committer | When a new committer is proposed for the project. | Lazy consensus | Active PMC members | 3 | private@hive.apache.org | -| New PMC Member | When a committer is proposed for the PMC. | Lazy consensus | Active PMC members | 3 | private@hive.apache.org | -| Committer Removal | When removal of commit privileges is sought. **Note:** Such actions will also be referred to the ASF board by the PMC chair. | Consensus | Active PMC members (excluding the committer in question if a member of the PMC). | 6 | private@hive.apache.org | -| PMC Member Removal | When removal of a PMC member is sought. **Note:** Such actions will also be referred to the ASF board by the PMC chair. | Consensus | Active PMC members (excluding the member in question). | 6 | private@hive.apache.org | -| Modifying Bylaws | Modifying this document. | 2/3 majority | Active PMC members | 6 | user@hive.apache.org | -| New Branch Committer | When a new branch committer is proposed for the project. | Lazy Consensus | Active PMC members | 3 | private@hive.apache.org | -| Removal of Branch Committer | When a branch committer is removed from the project. | Consensus | Active PMC members excluding the committer in question if they are PMC members too. | 6 | private@hive.apache.org | - - - - +| Actions | Description | Approval | Binding Votes | Minimum Length | Mailing List | +|-----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|----------------|----------------------------| +| Code Change | A change made to a codebase of the project and committed by a committer. This includes source code, documentation, website content, etc. | one +1 from a committer who has not authored the patch followed by a Lazy approval (not counting the vote of the contributor), moving to lazy majority if a -1 is receivedMinor issues (e.g. typos, code style issues, JavaDoc changes. At committer's discretion) can be committed after soliciting feedback/review on the mailing list and not receiving feedback within 2 days. | Active committers | 1 | JIRA (dev@hive.apache.org) | +| Release Plan | Defines the timetable and actions for a release. The plan also nominates a Release Manager. | Lazy majority | Active committers | 3 | user@hive.apache.org | +| Product Release | When a release of one of the project's products is ready, a vote is required to accept the release as an official release of the project. | Lazy Majority | Active PMC members | 3 | user@hive.apache.org | +| Adoption of New Codebase | When the codebase for an existing, released product is to be replaced with an alternative codebase. If such a vote fails to gain approval, the existing code base will continue. This also covers the creation of new sub-projects *and submodules* within the project. | 2/3 majority | Active PMC members | 6 | dev@hive.apache.org | +| New Committer | When a new committer is proposed for the project. | Lazy consensus | Active PMC members | 3 | private@hive.apache.org | +| New PMC Member | When a committer is proposed for the PMC. | Lazy consensus | Active PMC members | 3 | private@hive.apache.org | +| Committer Removal | When removal of commit privileges is sought. **Note:** Such actions will also be referred to the ASF board by the PMC chair. | Consensus | Active PMC members (excluding the committer in question if a member of the PMC). | 6 | private@hive.apache.org | +| PMC Member Removal | When removal of a PMC member is sought. **Note:** Such actions will also be referred to the ASF board by the PMC chair. | Consensus | Active PMC members (excluding the member in question). | 6 | private@hive.apache.org | +| Modifying Bylaws | Modifying this document. | 2/3 majority | Active PMC members | 6 | user@hive.apache.org | +| New Branch Committer | When a new branch committer is proposed for the project. | Lazy Consensus | Active PMC members | 3 | private@hive.apache.org | +| Removal of Branch Committer | When a branch committer is removed from the project. | Consensus | Active PMC members excluding the committer in question if they are PMC members too. | 6 | private@hive.apache.org | diff --git a/content/community/issueTracking.md b/content/community/issueTracking.md index 655c44b6..c96cd773 100644 --- a/content/community/issueTracking.md +++ b/content/community/issueTracking.md @@ -6,24 +6,25 @@ aliases: [/issue_tracking.html] --- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. --> # Issue Tracking + --- Hive tracks both bugs and enhancement requests using [Apache @@ -35,4 +36,5 @@ following: * Check the [user mailing list][mailingLists], both by searching the archives and by asking questions. [JIRA]:https://issues.apache.org/jira/browse/HIVE -[mailingLists]: /community/mailinglists/ \ No newline at end of file +[mailingLists]: /community/mailinglists/ + diff --git a/content/community/mailingLists.md b/content/community/mailingLists.md index 002bc5e8..6d7d42e0 100644 --- a/content/community/mailingLists.md +++ b/content/community/mailingLists.md @@ -7,31 +7,34 @@ aliases: [/mailing_lists.html] --- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. --> # Mailing Lists + --- We welcome you to join our mailing lists and let us know about your thoughts or ideas about Hive. ## User Mailing List + --- + The user list is for general discussion or questions on using Hive. Hive developers monitor this list and provide assistance when needed. @@ -41,7 +44,9 @@ developers monitor this list and provide assistance when needed. * Archives: [Apache][user_apache] ## Developer Mailing List + --- + The developer list is for Hive developers to discuss ongoing work, make decisions, and vote on technical issues. @@ -51,7 +56,9 @@ decisions, and vote on technical issues. * Archives: [Apache][dev_apache] ## Issues Mailing List + --- + The issues list receives all notifications from the [JIRA] issue tracker. * Subscribe: @@ -60,7 +67,9 @@ The issues list receives all notifications from the [JIRA] issue tracker. * Archives: [Apache][issues_apache] ## Commits Mailing List + --- + The commits list receives notifications with diffs when changes are committed to the Hive source tree. @@ -69,7 +78,9 @@ to the Hive source tree. * Archives: [Apache][commits_apache] ## Security Mailing List + --- + The security mailing list is a private list for discussion of potential security vulnerabilities issues. Please post potential security vulnerabilities to this list so that they may be investigated and fixed before the vulnerabilities is published. __Note: This mailing list is NOT for end-user questions and discussion on security. Please use the user mailing list for such issues.__ @@ -79,11 +90,8 @@ The Hive security mailing list is : . In order to post to the list, it is __NOT__ necessary to first subscribe to it. [user_apache]: http://mail-archives.apache.org/mod_mbox/hive-user - [dev_apache]: http://mail-archives.apache.org/mod_mbox/hive-dev - [JIRA]: https://issues.apache.org/jira/browse/HIVE - [issues_apache]: http://mail-archives.apache.org/mod_mbox/hive-issues - [commits_apache]: http://mail-archives.apache.org/mod_mbox/hive-commits + diff --git a/content/community/meetings/_index.md b/content/community/meetings/_index.md index de61657d..0e44f73a 100644 --- a/content/community/meetings/_index.md +++ b/content/community/meetings/_index.md @@ -2,3 +2,4 @@ title: "Contributor Meetings" date: 2025-07-24 --- + diff --git a/content/community/meetings/contributorday2011.md b/content/community/meetings/contributorday2011.md index 368b223a..a40e1e4f 100644 --- a/content/community/meetings/contributorday2011.md +++ b/content/community/meetings/contributorday2011.md @@ -13,3 +13,4 @@ Resources for mini-hackathon: * you'll need a Mac or Linux development environment with Hive+Hadoop already installed on it per [these instructions]({{< ref "gettingstarted-latest" >}}); for Hive, use the snapshot * you'll also need [Apache ant](http://ant.apache.org) installed. * [HIVE-1545](https://issues.apache.org/jira/browse/HIVE-1545) has the UDF libraries we'd like to get cleaned up for inclusion in Hive or extension libraries (download core.tar.gz and/or ext.tar.gz) + diff --git a/content/community/meetings/contributorminutes20110907.md b/content/community/meetings/contributorminutes20110907.md index eb3e8f05..baff89c5 100644 --- a/content/community/meetings/contributorminutes20110907.md +++ b/content/community/meetings/contributorminutes20110907.md @@ -31,7 +31,3 @@ A question was posted to the meetup page by Amareshwari. She wanted to know if a Amareshwari posted a second question on moving Hive from the old mapred interface to MapReduce to the newer mapreduce. There was consensus amongst the committers present that it was better to stay on mapred for now since it was guaranteed to be stable even in 0.23, while mapreduce is evolving. - - - - diff --git a/content/community/meetings/contributorminutes20111205.md b/content/community/meetings/contributorminutes20111205.md index 7b73ece3..2eddf720 100644 --- a/content/community/meetings/contributorminutes20111205.md +++ b/content/community/meetings/contributorminutes20111205.md @@ -23,7 +23,3 @@ Ashutosh asked about a registry of available Hive storage handlers, and John ref Code walkthroughs were carried out for HIVE-2616 and HIVE-2589. - - - - diff --git a/content/community/meetings/contributorminutes20120418.md b/content/community/meetings/contributorminutes20120418.md index b138526b..d3ddc2e0 100644 --- a/content/community/meetings/contributorminutes20120418.md +++ b/content/community/meetings/contributorminutes20120418.md @@ -17,13 +17,9 @@ Carl said that he is organizing the Hive BoF session at this year's Hadoop Summi The discussion next turned to problems with Arc and Phabricator. Carl expressed concern that bugs have crept in over the past couple of months, and that it's no longer clear who is responsible for making sure Hive works with Arc/Phabricator. John pointed out that the issues which were raised on the dev mailing list last week have already been resolved. There was general consensus that when it works, Arc/Phabricator is an improvement on ReviewBoard. John proposed that we continue using Arc/Phabricator, and raise any problems with it on the dev maligning list. There were no objections. -Harish gave a short [presentation](https://github.com/hbutani/SQLWindowing/wiki/MoveToHive) on the [SQL Windowing library](https://github.com/hbutani/SQLWindowing) he wrote for Hive and how it might be integrated into Hive. Everyone agreed that adding this functionality to Hive makes sense. Several people suggested adding the toolkit to the contrib module as-is and using it to generate interest with users, but concerns were raised that this might be painful to support/deprecate in the future. The discussion ended with general agreement that we should start work now to incrementally push this capability into Hive's query compiler. +Harish gave a short [presentation](https://github.com/hbutani/SQLWindowing/wiki/MoveToHive) on the [SQL Windowing library](https://github.com/hbutani/SQLWindowing) he wrote for Hive and how it might be integrated into Hive. Everyone agreed that adding this functionality to Hive makes sense. Several people suggested adding the toolkit to the contrib module as-is and using it to generate interest with users, but concerns were raised that this might be painful to support/deprecate in the future. The discussion ended with general agreement that we should start work now to incrementally push this capability into Hive's query compiler. Carl explained the motivations and design decisions behind the HiveServer2 API proposal. The main motivations are supporting concurrency and providing a better foundation on which to build ODBC and JDBC drivers. Work on this project has started and is being tracked in [HIVE-2935](https://issues.apache.org/jira/browse/HIVE-2935). Namit offered to host the next contrib meeting at Facebook. - - - - diff --git a/content/community/meetings/contributorsminutes110726.md b/content/community/meetings/contributorsminutes110726.md index 1bc7907d..565d13d6 100644 --- a/content/community/meetings/contributorsminutes110726.md +++ b/content/community/meetings/contributorsminutes110726.md @@ -25,7 +25,3 @@ Syed Albiz: working on automatic usage of indexes for (a) bitmap indexes and (b) Charles Chen: working on improving view support (ALTER VIEW RENAME, CREATE OR REPLACE VIEW, CREATE table LIKE view). Will follow up with support for partitioned join views and simple updatable views. - - - - diff --git a/content/community/meetings/development-contributorsmeetings-hivecontributorsminutes100601.md b/content/community/meetings/development-contributorsmeetings-hivecontributorsminutes100601.md index 3662f4ef..a778a1fa 100644 --- a/content/community/meetings/development-contributorsmeetings-hivecontributorsminutes100601.md +++ b/content/community/meetings/development-contributorsmeetings-hivecontributorsminutes100601.md @@ -18,27 +18,23 @@ The following were the main meeting minutes: 1. We should have these meetings more often, say every month. Cloudera will host the next meeting. - 2. We should try to have a release every 4 months. We should try to push out 0.6 before end of June, For the new release, Cloudera will take a lead on the release management issues and also help with documentation. Documentation for Hive leaves a lot to be desired. +2. We should try to have a release every 4 months. We should try to push out 0.6 before end of June, For the new release, Cloudera will take a lead on the release management issues and also help with documentation. Documentation for Hive leaves a lot to be desired. - 3. The test framework is pretty brittle, and it is pretty difficult for new people to do big contributions without having a very sound test-plan. Ideally, facebook should host a test cluster so that everyone can run tests there. +3. The test framework is pretty brittle, and it is pretty difficult for new people to do big contributions without having a very sound test-plan. Ideally, facebook should host a test cluster so that everyone can run tests there. - 4. A lot of external customers are asking for ODBC/JDBC support on top of Hive. Cloudera will take the lead on that. +4. A lot of external customers are asking for ODBC/JDBC support on top of Hive. Cloudera will take the lead on that. - 5. The process of making a new committer should be more transparent. In order to grow the community, it would be very desirable to add more committers outside Facebook. +5. The process of making a new committer should be more transparent. In order to grow the community, it would be very desirable to add more committers outside Facebook. - 6. Create new components for Drivers (ODBC/JDBC) and UDFs. +6. Create new components for Drivers (ODBC/JDBC) and UDFs. - 7. Yahoo will take the lead of making Hive work on top of Zebra +7. Yahoo will take the lead of making Hive work on top of Zebra Some new tasks were identified, but they can change if new priorities come in. - 8. Carl will focusing on 'having' support and co-related sub-queries. +8. Carl will focusing on 'having' support and co-related sub-queries. - 9. Arvind will be focusing on the cost-based optimizer +9. Arvind will be focusing on the cost-based optimizer The main idea was that we should meet more often and share our ideas. Time-based release will be very desirable. - - - - diff --git a/content/community/meetings/development-contributorsmeetings-hivecontributorsminutes100706.md b/content/community/meetings/development-contributorsmeetings-hivecontributorsminutes100706.md index 8ab4222a..509052b6 100644 --- a/content/community/meetings/development-contributorsmeetings-hivecontributorsminutes100706.md +++ b/content/community/meetings/development-contributorsmeetings-hivecontributorsminutes100706.md @@ -8,23 +8,19 @@ date: 2024-12-12 Attendees: Amr Awadallah, John Sichi, Paul Yang, Olga Natkovich, Ajay Kidave, Yongqiang He, Basab Malik, Vinithra Varadharajan, bc Wong, Arvind Prabhakar, Carl Steinbach * bc Wong gave a live demo of Cloudera's Hue framework and the Beeswax Hive web interface. - + Slides from this talk are available here: - + Hue was recently released as open source. The code is available on Github here: + + Slides from this talk are available here: + + Hue was recently released as open source. The code is available on Github here: * Olga Natkovich gave a whiteboard talk on HOwl. - + HOwl = Hive !MetaStore + Owl = shared metadata system between Pig, Hive, and Map Reduce - + HOwl will likely leverage the !MetaStore schema and ORM layer. - + A somewhat outdated Owl design document is available here: + + HOwl = Hive !MetaStore + Owl = shared metadata system between Pig, Hive, and Map Reduce + + HOwl will likely leverage the !MetaStore schema and ORM layer. + + A somewhat outdated Owl design document is available here: * Carl gave an update on progress with the 0.6.0 release. - + There was a discussion about the plan to move the documentation off of the wiki and into version control. - + Several people voiced concerns that developers/users are less likely to update the documentation if doing so requires them to submit a patch. - + The new proposal for documentation reached at the meeting is as follows: - - The trunk version of the documentation will be maintained on the wiki. - - As part of the release process the documentation will be copied off of the wiki and converted to xdoc, and then checked into svn. - - HTML documentation generated from the xdoc will be posted to the Hive webpage when the new release is posted. - + Carl is going to investigate the feasibility of writing a tool that converts documentation directly from !MoinMoin wiki markup to xdoc. + + There was a discussion about the plan to move the documentation off of the wiki and into version control. + + Several people voiced concerns that developers/users are less likely to update the documentation if doing so requires them to submit a patch. + + The new proposal for documentation reached at the meeting is as follows: + - The trunk version of the documentation will be maintained on the wiki. + - As part of the release process the documentation will be copied off of the wiki and converted to xdoc, and then checked into svn. + - HTML documentation generated from the xdoc will be posted to the Hive webpage when the new release is posted. + + Carl is going to investigate the feasibility of writing a tool that converts documentation directly from !MoinMoin wiki markup to xdoc. * John agreed to host the next contributors meeting at Facebook. - - - - diff --git a/content/community/meetings/development-contributorsmeetings-hivecontributorsminutes100808.md b/content/community/meetings/development-contributorsmeetings-hivecontributorsminutes100808.md index b2f4f67a..a238b6f8 100644 --- a/content/community/meetings/development-contributorsmeetings-hivecontributorsminutes100808.md +++ b/content/community/meetings/development-contributorsmeetings-hivecontributorsminutes100808.md @@ -8,22 +8,18 @@ date: 2024-12-12 August 8th, 2010 * Yongqiang He gave a presentation about his work on index support in Hive. - + Slides are available here: + + Slides are available here: * John Sichi talked about his work on filter-pushdown optimizations. This is applicable to the HBase storage handler and the new index infrastructure. * Pradeep Kamath gave an update on progress with Howl. - + The Howl source code is available on GitHub here: - + Starting to work on security for Howl. For the first iteration the plan is to base it on DFS permissions. + + The Howl source code is available on GitHub here: + + Starting to work on security for Howl. For the first iteration the plan is to base it on DFS permissions. * General agreement that we should aim to desupport pre-0.20.0 versions of Hadoop in Hive 0.7.0. This will allow us to remove the shim layer and will make it easier to transition to the new mapreduce APIs. But we also want to get a better idea of how many users are stuck on pre-0.20 versions of Hadoop. * Remove Thrift generated code from repository. - + Pro: reduce noise in diffs during reviews. - + Con: requires developers to install Thrift compiler. + + Pro: reduce noise in diffs during reviews. + + Con: requires developers to install Thrift compiler. * Discussed moving the documentation from the wiki to version control. - + Probably not practical to maintain the trunk version of the docs on the wiki and roll over to version control at release time, so trunk version of docs will be maintained in vcs. - + It was agreed that feature patches should include updates to the docs, but it is also acceptable to file a doc ticket if there is time pressure to commit.j - + Will maintain an errata page on the wiki for collecting updates/corrections from users. These notes will be rolled into the documentation in vcs on a monthly basis. + + Probably not practical to maintain the trunk version of the docs on the wiki and roll over to version control at release time, so trunk version of docs will be maintained in vcs. + + It was agreed that feature patches should include updates to the docs, but it is also acceptable to file a doc ticket if there is time pressure to commit.j + + Will maintain an errata page on the wiki for collecting updates/corrections from users. These notes will be rolled into the documentation in vcs on a monthly basis. * The next meeting will be held in September at Cloudera's office in Palo Alto. - - - - diff --git a/content/community/meetings/development-contributorsmeetings-hivecontributorsminutes100913.md b/content/community/meetings/development-contributorsmeetings-hivecontributorsminutes100913.md index 2230cc18..f2e9d91f 100644 --- a/content/community/meetings/development-contributorsmeetings-hivecontributorsminutes100913.md +++ b/content/community/meetings/development-contributorsmeetings-hivecontributorsminutes100913.md @@ -21,7 +21,3 @@ Finally, [HIVE-1476](https://issues.apache.org/jira/browse/HIVE-1476) (metastore The October meetup will be at Facebook HQ in Palo Alto. - - - - diff --git a/content/community/meetings/development-contributorsmeetings-hivecontributorsminutes101025.md b/content/community/meetings/development-contributorsmeetings-hivecontributorsminutes101025.md index 98b89230..40e8df48 100644 --- a/content/community/meetings/development-contributorsmeetings-hivecontributorsminutes101025.md +++ b/content/community/meetings/development-contributorsmeetings-hivecontributorsminutes101025.md @@ -35,7 +35,3 @@ Shyam mentioned some ongoing work on real-time Hadoop that might be an interesti Next meeting will be at Cloudera. - - - - diff --git a/content/community/meetings/development-contributorsmeetings-hivecontributorsminutes110425.md b/content/community/meetings/development-contributorsmeetings-hivecontributorsminutes110425.md index 0770c6f6..83cae3d7 100644 --- a/content/community/meetings/development-contributorsmeetings-hivecontributorsminutes110425.md +++ b/content/community/meetings/development-contributorsmeetings-hivecontributorsminutes110425.md @@ -39,7 +39,3 @@ There was discussion around using some Yahoo QA machines (or the OSUOSL cluster) We ended with some review and discussion of HIVE-2038 (metastore listener). It was decided that generic events are going to be dealt with in a followup JIRA. - - - - diff --git a/content/community/meetings/development-contributorsmeetings.md b/content/community/meetings/development-contributorsmeetings.md index 0feefc3a..72e99175 100644 --- a/content/community/meetings/development-contributorsmeetings.md +++ b/content/community/meetings/development-contributorsmeetings.md @@ -24,7 +24,3 @@ Active contributors to the Hive project are invited to attend the monthly Hive C * [July 6, 2010]({{< ref "development-contributorsmeetings-hivecontributorsminutes100706" >}}) * [June 1, 2010]({{< ref "development-contributorsmeetings-hivecontributorsminutes100601" >}}) - - - - diff --git a/content/community/meetings/hivecontributorsminutes100601.md b/content/community/meetings/hivecontributorsminutes100601.md index a942619c..9553570f 100644 --- a/content/community/meetings/hivecontributorsminutes100601.md +++ b/content/community/meetings/hivecontributorsminutes100601.md @@ -18,27 +18,23 @@ The following were the main meeting minutes: 1. We should have these meetings more often, say every month. Cloudera will host the next meeting. - 2. We should try to have a release every 4 months. We should try to push out 0.6 before end of June, For the new release, Cloudera will take a lead on the release management issues and also help with documentation. Documentation for Hive leaves a lot to be desired. +2. We should try to have a release every 4 months. We should try to push out 0.6 before end of June, For the new release, Cloudera will take a lead on the release management issues and also help with documentation. Documentation for Hive leaves a lot to be desired. - 3. The test framework is pretty brittle, and it is pretty difficult for new people to do big contributions without having a very sound test-plan. Ideally, facebook should host a test cluster so that everyone can run tests there. +3. The test framework is pretty brittle, and it is pretty difficult for new people to do big contributions without having a very sound test-plan. Ideally, facebook should host a test cluster so that everyone can run tests there. - 4. A lot of external customers are asking for ODBC/JDBC support on top of Hive. Cloudera will take the lead on that. +4. A lot of external customers are asking for ODBC/JDBC support on top of Hive. Cloudera will take the lead on that. - 5. The process of making a new committer should be more transparent. In order to grow the community, it would be very desirable to add more committers outside Facebook. +5. The process of making a new committer should be more transparent. In order to grow the community, it would be very desirable to add more committers outside Facebook. - 6. Create new components for Drivers (ODBC/JDBC) and UDFs. +6. Create new components for Drivers (ODBC/JDBC) and UDFs. - 7. Yahoo will take the lead of making Hive work on top of Zebra +7. Yahoo will take the lead of making Hive work on top of Zebra Some new tasks were identified, but they can change if new priorities come in. - 8. Carl will focusing on 'having' support and co-related sub-queries. +8. Carl will focusing on 'having' support and co-related sub-queries. - 9. Arvind will be focusing on the cost-based optimizer +9. Arvind will be focusing on the cost-based optimizer The main idea was that we should meet more often and share our ideas. Time-based release will be very desirable. - - - - diff --git a/content/community/meetings/hivecontributorsminutes100706.md b/content/community/meetings/hivecontributorsminutes100706.md index 1ec95d17..813bfe1b 100644 --- a/content/community/meetings/hivecontributorsminutes100706.md +++ b/content/community/meetings/hivecontributorsminutes100706.md @@ -8,23 +8,19 @@ date: 2024-12-12 Attendees: Amr Awadallah, John Sichi, Paul Yang, Olga Natkovich, Ajay Kidave, Yongqiang He, Basab Malik, Vinithra Varadharajan, bc Wong, Arvind Prabhakar, Carl Steinbach * bc Wong gave a live demo of Cloudera's Hue framework and the Beeswax Hive web interface. - + Slides from this talk are available here: - + Hue was recently released as open source. The code is available on Github here: + + Slides from this talk are available here: + + Hue was recently released as open source. The code is available on Github here: * Olga Natkovich gave a whiteboard talk on HOwl. - + HOwl = Hive !MetaStore + Owl = shared metadata system between Pig, Hive, and Map Reduce - + HOwl will likely leverage the !MetaStore schema and ORM layer. - + A somewhat outdated Owl design document is available here: + + HOwl = Hive !MetaStore + Owl = shared metadata system between Pig, Hive, and Map Reduce + + HOwl will likely leverage the !MetaStore schema and ORM layer. + + A somewhat outdated Owl design document is available here: * Carl gave an update on progress with the 0.6.0 release. - + There was a discussion about the plan to move the documentation off of the wiki and into version control. - + Several people voiced concerns that developers/users are less likely to update the documentation if doing so requires them to submit a patch. - + The new proposal for documentation reached at the meeting is as follows: - - The trunk version of the documentation will be maintained on the wiki. - - As part of the release process the documentation will be copied off of the wiki and converted to xdoc, and then checked into svn. - - HTML documentation generated from the xdoc will be posted to the Hive webpage when the new release is posted. - + Carl is going to investigate the feasibility of writing a tool that converts documentation directly from !MoinMoin wiki markup to xdoc. + + There was a discussion about the plan to move the documentation off of the wiki and into version control. + + Several people voiced concerns that developers/users are less likely to update the documentation if doing so requires them to submit a patch. + + The new proposal for documentation reached at the meeting is as follows: + - The trunk version of the documentation will be maintained on the wiki. + - As part of the release process the documentation will be copied off of the wiki and converted to xdoc, and then checked into svn. + - HTML documentation generated from the xdoc will be posted to the Hive webpage when the new release is posted. + + Carl is going to investigate the feasibility of writing a tool that converts documentation directly from !MoinMoin wiki markup to xdoc. * John agreed to host the next contributors meeting at Facebook. - - - - diff --git a/content/community/resources/_index.md b/content/community/resources/_index.md index 9e710a4b..e784def4 100644 --- a/content/community/resources/_index.md +++ b/content/community/resources/_index.md @@ -2,3 +2,4 @@ title: "Resources" date: 2025-07-24 --- + diff --git a/content/community/resources/books-about-hive.md b/content/community/resources/books-about-hive.md index b1096c11..ac8388a0 100644 --- a/content/community/resources/books-about-hive.md +++ b/content/community/resources/books-about-hive.md @@ -29,7 +29,3 @@ Related books: * [Apache Iceberg: The Definitive Guide: Data Lakehouse Functionality, Performance, and Scalability on the Data Lake](https://www.amazon.com/dp/1098148622) by Tomer Shiran, Jason Hughes, Alex Merced - - - - diff --git a/content/community/resources/books-blogs-talks.md b/content/community/resources/books-blogs-talks.md index 2d684a99..b652cef9 100644 --- a/content/community/resources/books-blogs-talks.md +++ b/content/community/resources/books-blogs-talks.md @@ -7,37 +7,33 @@ date: 2024-12-12 * **Books:** * + *[Programming Hive](http://shop.oreilly.com/product/0636920023555.do)* by Edward Capriolo, Dean Wampler, and Jason Rutherglen – O'Reilly Media, 2012 - + *[Apache Hive Essentials](https://www.packtpub.com/application-development/apache-hive-essentials-second-edition)* by Dayong Du – Packt Publishing, [2015](http://bit.ly/1QVANQA) and [2018 (second edition)](https://www.packtpub.com/application-development/apache-hive-essentials-second-edition) - + [*Apache Hive Cookbook*](https://www.packtpub.com/big-data-and-business-intelligence/apache-hive-cookbook) by Hanish Bansal, Saurabh Chauhan, and Shrey Mehrotra – Packt Publishing, 2016 - + *[Instant Apache Hive Essentials How-to](http://bit.ly/1iKrQhV)* by Darren Lee – Packt Publishing, 2013 - + *[Practical Hive](https://www.apress.com/us/book/9781484202722)* by Scott Shaw, Andreas François Vermeulen, Ankur Gupta, and David Kjerrumgaard – Apress, 2016 - + [*The Ultimate Guide to Programming Apache Hive*](https://www.goodreads.com/book/show/25948488-the-ultimate-guide-to-programming-apache-hive) by Fru Nde – NextGen Publishing, 2015 - + [*Learn Hive in 1 Day*](https://www.amazon.com/Learn-Hive-Day-Complete-Master/dp/1521596778/ref=tmm_pap_swatch_0?_encoding=UTF8&qid=1545009451&sr=1-1) by Krishna Rungta – independently published, 2017 + + *[Apache Hive Essentials](https://www.packtpub.com/application-development/apache-hive-essentials-second-edition)* by Dayong Du – Packt Publishing, [2015](http://bit.ly/1QVANQA) and [2018 (second edition)](https://www.packtpub.com/application-development/apache-hive-essentials-second-edition) + + [*Apache Hive Cookbook*](https://www.packtpub.com/big-data-and-business-intelligence/apache-hive-cookbook) by Hanish Bansal, Saurabh Chauhan, and Shrey Mehrotra – Packt Publishing, 2016 + + *[Instant Apache Hive Essentials How-to](http://bit.ly/1iKrQhV)* by Darren Lee – Packt Publishing, 2013 + + *[Practical Hive](https://www.apress.com/us/book/9781484202722)* by Scott Shaw, Andreas François Vermeulen, Ankur Gupta, and David Kjerrumgaard – Apress, 2016 + + [*The Ultimate Guide to Programming Apache Hive*](https://www.goodreads.com/book/show/25948488-the-ultimate-guide-to-programming-apache-hive) by Fru Nde – NextGen Publishing, 2015 + + [*Learn Hive in 1 Day*](https://www.amazon.com/Learn-Hive-Day-Complete-Master/dp/1521596778/ref=tmm_pap_swatch_0?_encoding=UTF8&qid=1545009451&sr=1-1) by Krishna Rungta – independently published, 2017 Books primarily about Hadoop, with some coverage of Hive: * + *[Hadoop: The Definitive Guide](http://shop.oreilly.com/product/0636920033448.do)* by Tom White (one chapter on Hive) – O'Reilly Media, [2009](http://shop.oreilly.com/product/9780596521981.do), [2010](http://shop.oreilly.com/product/0636920010388.do), [2012](http://shop.oreilly.com/product/0636920021773.do), and [2015 (fourth edition)](http://shop.oreilly.com/product/0636920033448.do) - + [*Hadoop in Action*](https://www.manning.com/books/hadoop-in-action) by Chuck Lam (one chapter on Hive) – Manning Publications, 2010 - + [*Hadoop in Practice*](https://www.manning.com/books/hadoop-in-practice) by Alex Holmes (one chapter on Hive) – Manning Publications, 2012 + + [*Hadoop in Action*](https://www.manning.com/books/hadoop-in-action) by Chuck Lam (one chapter on Hive) – Manning Publications, 2010 + + [*Hadoop in Practice*](https://www.manning.com/books/hadoop-in-practice) by Alex Holmes (one chapter on Hive) – Manning Publications, 2012 Online book: * + *[The Free Hive Book](https://github.com/Prokopp/the-free-hive-book)* by Christian Prokopp – GitHub * **Blogs:** - + [Apache Hive 4.x With Apache Iceberg (Part-I)](https://link.medium.com/WUuuzhdtxBb) by Ayush Saxena - + [Apache Hive-4.x with Iceberg Branches & Tags](https://medium.com/@ayushtkn/apache-hive-4-x-with-iceberg-branches-tags-3d52293ac0bf) by Ayush Saxena - + [Data Federation with Apache Hive](https://link.medium.com/JKRi6qMGmBb) by Akshat Mathur - + [Apache Hive: ESRI GeoSpatial Support](http://link.medium.com/i1p4vwWODAb) by Ayush Saxena - + [Open Data Lakehouse powered by Iceberg for all your Data Warehouse needs](https://blog.cloudera.com/open-data-lakehouse-powered-by-iceberg-for-all-your-data-warehouse-needs/) by Zoltan Borok-Nagy, Ayush Saxena, Tamas Mate, Simhadri Govindappa - + [Optimizing Hive on Tez Performance](https://blog.cloudera.com/optimizing-hive-on-tez-performance/) by Jay Desai - + [How to run queries periodically in Apache Hive](https://blog.cloudera.com/how-to-run-queries-periodically-in-apache-hive/) by Zoltan Haindrich and Jesus Camacho Rodriguez - + [Why We Need Hive Metastore](https://blog.jetbrains.com/big-data-tools/2022/07/01/why-we-need-hive-metastore/) by Pasha Finkelshteyn + + [Apache Hive 4.x With Apache Iceberg (Part-I)](https://link.medium.com/WUuuzhdtxBb) by Ayush Saxena + + [Apache Hive-4.x with Iceberg Branches & Tags](https://medium.com/@ayushtkn/apache-hive-4-x-with-iceberg-branches-tags-3d52293ac0bf) by Ayush Saxena + + [Data Federation with Apache Hive](https://link.medium.com/JKRi6qMGmBb) by Akshat Mathur + + [Apache Hive: ESRI GeoSpatial Support](http://link.medium.com/i1p4vwWODAb) by Ayush Saxena + + [Open Data Lakehouse powered by Iceberg for all your Data Warehouse needs](https://blog.cloudera.com/open-data-lakehouse-powered-by-iceberg-for-all-your-data-warehouse-needs/) by Zoltan Borok-Nagy, Ayush Saxena, Tamas Mate, Simhadri Govindappa + + [Optimizing Hive on Tez Performance](https://blog.cloudera.com/optimizing-hive-on-tez-performance/) by Jay Desai + + [How to run queries periodically in Apache Hive](https://blog.cloudera.com/how-to-run-queries-periodically-in-apache-hive/) by Zoltan Haindrich and Jesus Camacho Rodriguez + + [Why We Need Hive Metastore](https://blog.jetbrains.com/big-data-tools/2022/07/01/why-we-need-hive-metastore/) by Pasha Finkelshteyn * **Talks:** - + [Apache Hive Replication V3(ApacheCon)](https://www.youtube.com/watch?v=pzyaGC6i7t4) by Pravin Kumar Sinha - + [Transactional SQL in Apache Hive](https://www.youtube.com/watch?v=Rk8irGDjpuI) by Eugene  Koifman - + [Transactional operations in Apache Hive: present and future](https://www.youtube.com/watch?v=GyzU9wG0cFQ&t=834s) by Eugene  Koifman - - - - + + [Apache Hive Replication V3(ApacheCon)](https://www.youtube.com/watch?v=pzyaGC6i7t4) by Pravin Kumar Sinha + + [Transactional SQL in Apache Hive](https://www.youtube.com/watch?v=Rk8irGDjpuI) by Eugene  Koifman + + [Transactional operations in Apache Hive: present and future](https://www.youtube.com/watch?v=GyzU9wG0cFQ&t=834s) by Eugene  Koifman diff --git a/content/community/resources/building-hive-from-source.md b/content/community/resources/building-hive-from-source.md index 72cd5d1e..6818fb73 100644 --- a/content/community/resources/building-hive-from-source.md +++ b/content/community/resources/building-hive-from-source.md @@ -6,18 +6,14 @@ date: 2024-12-12 # Apache Hive : Building Hive from Source * **Fetching the source code** - + Using the source tar: Download the source tar from [TODO: Put link post release] and untar - + From Git tag: Checkout the release tag using git clone --branch rel/release-4.0.0 + + Using the source tar: Download the source tar from [TODO: Put link post release] and untar + + From Git tag: Checkout the release tag using git clone --branch rel/release-4.0.0 * **Building Distribution** - + Run: mvn clean install -DskipTests -Pdist -Piceberg -Pitests - + Find the built tar under packaging/target/apache-hive-* + + Run: mvn clean install -DskipTests -Pdist -Piceberg -Pitests + + Find the built tar under packaging/target/apache-hive-* * **Running Unit Tests** - + Run: mvn clean install -Piceberg + + Run: mvn clean install -Piceberg * **Running Integration Tests** - + GoTo itests directory - + Run: mvn clean test -pl itest -Piceberg - - - - + + GoTo itests directory + + Run: mvn clean test -pl itest -Piceberg diff --git a/content/community/resources/developerguide-udtf.md b/content/community/resources/developerguide-udtf.md index 1cb77fa0..e941c3a8 100644 --- a/content/community/resources/developerguide-udtf.md +++ b/content/community/resources/developerguide-udtf.md @@ -134,7 +134,3 @@ public abstract class GenericUDTF { ``` - - - - diff --git a/content/community/resources/developerguide.md b/content/community/resources/developerguide.md index e31fe648..bcbc5ec0 100644 --- a/content/community/resources/developerguide.md +++ b/content/community/resources/developerguide.md @@ -108,14 +108,14 @@ As of [Hive 0.14](https://issues.apache.org/jira/browse/HIVE-5976)a registration The following mappings have been added through this registration mechanism: -| Syntax | Equivalent | -| --- | --- | -| STORED AS AVRO /STORED AS AVROFILE | `ROW FORMAT SERDE``'org.apache.hadoop.hive.serde2.avro.AvroSerDe'``STORED AS INPUTFORMAT``'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'``OUTPUTFORMAT``'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'` | -| STORED AS ORC /STORED AS ORCFILE | `ROW FORMAT SERDE````'org.apache.hadoop.hive.ql.io.orc.OrcSerde````'``STORED AS INPUTFORMAT````'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat````'``OUTPUTFORMAT````'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat````'` | +| Syntax | Equivalent | +|------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| STORED AS AVRO /STORED AS AVROFILE | `ROW FORMAT SERDE``'org.apache.hadoop.hive.serde2.avro.AvroSerDe'``STORED AS INPUTFORMAT``'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'``OUTPUTFORMAT``'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'` | +| STORED AS ORC /STORED AS ORCFILE | `ROW FORMAT SERDE````'org.apache.hadoop.hive.ql.io.orc.OrcSerde````'``STORED AS INPUTFORMAT````'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat````'``OUTPUTFORMAT````'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat````'` | | STORED AS PARQUET /STORED AS PARQUETFILE | `ROW FORMAT SERDE```'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe```'``STORED AS INPUTFORMAT```'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat```'``OUTPUTFORMAT```'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat```'` | -| STORED AS RCFILE | `STORED AS INPUTFORMAT``'org.apache.hadoop.hive.ql.io.RCFileInputFormat'``OUTPUTFORMAT``'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'` | -| STORED AS SEQUENCEFILE | `STORED AS INPUTFORMAT``'org.apache.hadoop.mapred.SequenceFileInputFormat'``OUTPUTFORMAT``'org.apache.hadoop.mapred.SequenceFileOutputFormat'` | -| STORED AS TEXTFILE | `STORED AS INPUTFORMAT``'org.apache.hadoop.mapred.TextInputFormat'``OUTPUTFORMAT``'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'` | +| STORED AS RCFILE | `STORED AS INPUTFORMAT``'org.apache.hadoop.hive.ql.io.RCFileInputFormat'``OUTPUTFORMAT``'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'` | +| STORED AS SEQUENCEFILE | `STORED AS INPUTFORMAT``'org.apache.hadoop.mapred.SequenceFileInputFormat'``OUTPUTFORMAT``'org.apache.hadoop.mapred.SequenceFileOutputFormat'` | +| STORED AS TEXTFILE | `STORED AS INPUTFORMAT``'org.apache.hadoop.mapred.TextInputFormat'``OUTPUTFORMAT``'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'` | To add a new native SerDe with STORED AS keyword, follow these steps: @@ -251,8 +251,7 @@ Hive uses [JUnit](http://junit.org/) for unit tests. Each of the 3 main componen ### Debugging Hive Code - - Hive code includes both client-side code (e.g., compiler, semantic analyzer, and optimizer of HiveQL) and server-side code (e.g., operator/task/SerDe implementations). Debugging is different for client-side and server-side code, as described below. +Hive code includes both client-side code (e.g., compiler, semantic analyzer, and optimizer of HiveQL) and server-side code (e.g., operator/task/SerDe implementations). Debugging is different for client-side and server-side code, as described below. #### Debugging Client-Side Code @@ -260,13 +259,13 @@ The client-side code runs on your local machine so you can easily debug it using * Make sure that you have run `ant model-jar` in hive/metastore and `ant gen-test` in hive since the last time you ran `ant clean`. * To run all of the unit tests for the CLI: - + Open up TestCliDriver.java - + Click Run->Debug Configurations, select TestCliDriver, and click Debug. + + Open up TestCliDriver.java + + Click Run->Debug Configurations, select TestCliDriver, and click Debug. * To run a single test within TestCliDriver.java: - + Begin running the whole TestCli suite as before. - + Once it finishes the setup and starts executing the JUnit tests, stop the test execution. - + Find the desired test in the JUnit pane, - + Right click on that test and select Debug. + + Begin running the whole TestCli suite as before. + + Once it finishes the setup and starts executing the JUnit tests, stop the test execution. + + Find the desired test in the JUnit pane, + + Right click on that test and select Debug. #### Debugging Server-Side Code @@ -286,6 +285,7 @@ If you have already built Hive without javac.debug=on, you can clean the build a > ant -Djavac.debug=on package ``` + * Run ant test with additional options to tell the Java VM that is running Hive server-side code to wait for the debugger to attach. First define some convenient macros for debugging. You can put it in your .bashrc or .cshrc. ``` @@ -308,6 +308,7 @@ The unit test will run until it shows: [junit] Listening for transport dt_socket at address: 8000 ``` + * Now, you can use jdb to attach to port 8000 to debug ``` @@ -317,46 +318,42 @@ The unit test will run until it shows: or if you are running Eclipse and the Hive projects are already imported, you can debug with Eclipse. Under Eclipse Run -> Debug Configurations, find "Remote Java Application" at the bottom of the left panel. There should be a MapRedTask configuration already. If there is no such configuration, you can create one with the following property: - + Name: any task such as MapRedTask - + Project: the Hive project that you imported. - + Connection Type: Standard (Socket Attach) - + Connection Properties: - - Host: localhost - - Port: 8000 - Then hit the "Debug" button and Eclipse will attach to the JVM listening on port 8000 and continue running till the end. If you define breakpoints in the source code before hitting the "Debug" button, it will stop there. The rest is the same as debugging client-side Hive. ++ Name: any task such as MapRedTask ++ Project: the Hive project that you imported. ++ Connection Type: Standard (Socket Attach) ++ Connection Properties: + - Host: localhost + - Port: 8000 + Then hit the "Debug" button and Eclipse will attach to the JVM listening on port 8000 and continue running till the end. If you define breakpoints in the source code before hitting the "Debug" button, it will stop there. The rest is the same as debugging client-side Hive. #### Debugging without Ant (Client and Server Side) There is another way of debugging Hive code without going through Ant. - You need to install Hadoop and set the environment variable HADOOP_HOME to that. +You need to install Hadoop and set the environment variable HADOOP_HOME to that. ``` - > export HADOOP_HOME= - +> export HADOOP_HOME= ``` Then, start Hive: ``` - > ./build/dist/bin/hive --debug - +> ./build/dist/bin/hive --debug ``` It will then act similar to the debugging steps outlines in Debugging Hive code. It is faster since there is no need to compile Hive code, - and go through Ant. It can be used to debug both client side and server side Hive. +and go through Ant. It can be used to debug both client side and server side Hive. If you want to debug a particular query, start Hive and perform the steps needed before that query. Then start Hive again in debug to debug that query. ``` - > ./build/dist/bin/hive - > perform steps before the query - +> ./build/dist/bin/hive +> perform steps before the query ``` ``` - > ./build/dist/bin/hive --debug - > run the query - +> ./build/dist/bin/hive --debug +> run the query ``` Note that the local file system will be used, so the space on your machine will not be released automatically (unlike debugging via Ant, where the tables created in test are automatically dropped at the end of the test). Make sure to either drop the tables explicitly, or drop the data from /User/hive/warehouse. @@ -381,7 +378,3 @@ Please refer to [Hive User Group Meeting August 2009](http://www.slideshare.net/   - - - - diff --git a/content/community/resources/hivedeveloperfaq.md b/content/community/resources/hivedeveloperfaq.md index 8c6b2ff2..3e9e152a 100644 --- a/content/community/resources/hivedeveloperfaq.md +++ b/content/community/resources/hivedeveloperfaq.md @@ -145,13 +145,11 @@ mvn clean install -DskipTests -Pprotobuf mvn clean install -Pthriftif -DskipTests -Dthrift.home=/usr/local ``` - - -Don’t forget to update `hive_metastore.proto` when changing  `hive_metastore.thrift +Don’t forget to update `hive_metastore.proto` when changing  `hive_metastore.thrift [![](https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21141&avatarType=issuetype)HIVE-26769](https://issues.apache.org/jira/browse/HIVE-26769?src=confmacro) - - - [TRACKING] gRPC support for Hive metastore +- +[TRACKING] gRPC support for Hive metastore Open`### How to run findbugs after a change? ``` diff --git a/content/community/resources/howtocommit.md b/content/community/resources/howtocommit.md index f436536e..0fdb45c9 100644 --- a/content/community/resources/howtocommit.md +++ b/content/community/resources/howtocommit.md @@ -51,6 +51,7 @@ When you commit/merge a Pull Request, please: 6. Thank the contributor(s), the reviewers, and the reporter of the issue (if different from the contributor). It is easier to thank the people in GitHub by mentioning their GitHub ids under the respective Pull Request. Below you can find a sample commit message that adheres to the guidelines outlined here. + ``` HIVE-27424: Display dependency:tree in GitHub actions (#5756) ``` @@ -76,7 +77,3 @@ Note: Committers or individuals with Apache Id can directly join the #hive slack Instructions to add folks to ASF hive channel: - - - - diff --git a/content/community/resources/howtocontribute.md b/content/community/resources/howtocontribute.md index 3e7be745..9c96ef3f 100644 --- a/content/community/resources/howtocontribute.md +++ b/content/community/resources/howtocontribute.md @@ -29,8 +29,8 @@ This is an optional step. Eclipse has a lot of advanced features for Java develo This checklist tells you how to create accounts and obtain permissions needed by Hive contributors. See the [Hive website](http://hive.apache.org/) for additional information. * Request an Apache Software Foundation [JIRA account](/community/resources/howtocontribute#request-account), if you do not already have one. - + The ASF JIRA system dashboard is [here](https://issues.apache.org/jira/secure/Dashboard.jspa). - + The Hive JIRA is [here](https://issues.apache.org/jira/browse/HIVE). + + The ASF JIRA system dashboard is [here](https://issues.apache.org/jira/secure/Dashboard.jspa). + + The Hive JIRA is [here](https://issues.apache.org/jira/browse/HIVE). * To review patches check the open [pull requests on GitHub](https://github.com/apache/hive/pulls) * To contribute to the Hive wiki, follow the instructions in [About This Wiki]({{< ref "#about-this-wiki" >}}). * To edit the Hive website, follow the instructions in [How to edit the website](https://github.com/apache/hive-site/blob/main/README.md). @@ -49,24 +49,24 @@ Modify the source code and add some features using your favorite IDE. Please take care about the following points. * All public classes and methods should have informative [Javadoc comments](http://www.oracle.com/technetwork/java/javase/documentation/index-137868.html). - + Do not use @author tags. + + Do not use @author tags. * Code should be formatted according to [Sun's conventions](http://web.archive.org/web/20140228225807/http://www.oracle.com/technetwork/java/codeconventions-150003.pdf), with two exceptions: - + Indent two (2) spaces per level, not four (4). - + Line length limit is 120 chars, instead of 80 chars. + + Indent two (2) spaces per level, not four (4). + + Line length limit is 120 chars, instead of 80 chars. * An Eclipse [formatter](https://github.com/apache/hive/blob/master/dev-support/eclipse-styles.xml) is provided in the dev-support folder – this can be used with both Eclipse and Intellij. Please consider importing this before editing the source code. - + For Eclipse: - - Go to Preferences -> Java -> Code Style -> Formatter; Import eclipse-styles.xml; Apply. - - In addition update save actions: Java -> Editor -> Save Actions; Check the following: Perform the following actions on save; Format Source Code; Format edited lines. - + For Intellij: - - Go to Settings -> Editor -> Code Style -> Java -> Scheme; Click manage; Import eclipse-styles.xml; Apply. + + For Eclipse: + - Go to Preferences -> Java -> Code Style -> Formatter; Import eclipse-styles.xml; Apply. + - In addition update save actions: Java -> Editor -> Save Actions; Check the following: Perform the following actions on save; Format Source Code; Format edited lines. + + For Intellij: + - Go to Settings -> Editor -> Code Style -> Java -> Scheme; Click manage; Import eclipse-styles.xml; Apply. * Contributions should not introduce new Checkstyle violations. - + Check for new [Checkstyle](http://checkstyle.sourceforge.net/) violations by running `mvn checkstyle:checkstyle-aggregate`, and then inspect the results in the `target/site` directory. It is possible to run the checks for a specific module, if the  `mvn` command is issued in the root directory of the module. - + If you use Eclipse you should install the [eclipse-cs Checkstyle plugin](http://eclipse-cs.sourceforge.net/). This plugin highlights violations in your code and is also able to automatically correct some types of violations. + + Check for new [Checkstyle](http://checkstyle.sourceforge.net/) violations by running `mvn checkstyle:checkstyle-aggregate`, and then inspect the results in the `target/site` directory. It is possible to run the checks for a specific module, if the  `mvn` command is issued in the root directory of the module. + + If you use Eclipse you should install the [eclipse-cs Checkstyle plugin](http://eclipse-cs.sourceforge.net/). This plugin highlights violations in your code and is also able to automatically correct some types of violations. * Contributions should pass existing unit tests. * New unit tests should be provided to demonstrate bugs and fixes. [JUnit](http://www.junit.org) is our test framework: - + You should create test classes for junit4, whose class name must start with a 'Test' prefix. - + You can run all the unit tests with the command `mvn test`, or you can run a specific unit test with the command `mvn test -Dtest=` (for example: `mvn test -Dtest=TestFileSystem`). - + After uploading your patch, it might worthwhile to check if your new test has been executed in the precommit job. + + You should create test classes for junit4, whose class name must start with a 'Test' prefix. + + You can run all the unit tests with the command `mvn test`, or you can run a specific unit test with the command `mvn test -Dtest=` (for example: `mvn test -Dtest=TestFileSystem`). + + After uploading your patch, it might worthwhile to check if your new test has been executed in the precommit job. ### Understanding Maven @@ -226,7 +226,7 @@ Committers: for non-trivial changes, it is best to get another committer to revi ## JIRA -Hive uses [JIRA](https://issues.apache.org/jira/browse/HIVE) for issues/case management. You must have a JIRA account in order to log cases and issues. +Hive uses [JIRA](https://issues.apache.org/jira/browse/HIVE) for issues/case management. You must have a JIRA account in order to log cases and issues. Requests for the creation of new accounts can be submitted via the following form: @@ -265,71 +265,73 @@ Here are the steps relevant to `hive_metastore.thrift`: 1. Don't make any changes to `hive_metastore.thrift` until instructed below. 2. Use the approved version of Thrift. This is currently `thrift-0.14.1`, which you can obtain from . - 1. For Mac via Homebrew (since the version we need is not available by default): - - ``` - brew tap-new $USER/local-tap - brew extract --version='0.14.1' thrift $USER/local-tap - brew install thrift@0.14.1 - mkdir -p /usr/local/share/fb303/if - cp /usr/local/Cellar/thrift@0.14.1/0.14.1/share/fb303/if/fb303.thrift /usr/local/share/fb303/if - ``` - - 2. For Mac, building from sources: - - ``` - wget http://archive.apache.org/dist/thrift/0.14.1/thrift-0.14.1.tar.gz - - tar xzf thrift-0.14.1.tar.gz - - - brew install libtool - brew install automake - - #If configure fails with "syntax error near unexpected token `QT5", then run "brew install pkg-config" - - ./bootstrap.sh - - sudo ./configure --with-openssl=/usr/local/Cellar/openssl@1.1/1.1.1j --without-erlang --without-nodejs --without-python --without-py3 --without-perl --without-php --without-php_extension --without-ruby --without-haskell --without-go --without-swift --without-dotnetcore --without-qt5 - - brew install openssl - - sudo ln -s /usr/local/opt/openssl/include/openssl/ /usr/local/include/ - - sudo make - - sudo make install - - mkdir -p /usr/local/share/fb303/if - - cp path/to/thrift-0.14.1/contrib/fb303/if/fb303.thrift /usr/local/share/fb303/if/fb303.thrift - # or alternatively the following command - curl -o /usr/local/share/fb303/if/fb303.thrift https://raw.githubusercontent.com/apache/thrift/master/contrib/fb303/if/fb303.thrift - ``` - - 3. For Linux: - - ``` - cd /path/to/thrift-0.14.1 - /configure -without-erlang --without-nodejs --without-python --without-py3 --without-perl --without-php --without-php_extension --without-ruby --without-haskell --without-go --without-swift --without-dotnetcore --without-qt5 - sudo make - sudo make install - sudo mkdir -p /usr/local/share/fb303/if - sudo cp /path/to/thrift-0.14.1/contrib/fb303/if/fb303.thrift /usr/local/share/fb303/if/fb303.thrift - ``` + 1. For Mac via Homebrew (since the version we need is not available by default): + ``` + brew tap-new $USER/local-tap + brew extract --version='0.14.1' thrift $USER/local-tap + brew install thrift@0.14.1 + mkdir -p /usr/local/share/fb303/if + cp /usr/local/Cellar/thrift@0.14.1/0.14.1/share/fb303/if/fb303.thrift /usr/local/share/fb303/if + ``` + + 2. For Mac, building from sources: + + ``` + wget http://archive.apache.org/dist/thrift/0.14.1/thrift-0.14.1.tar.gz + + tar xzf thrift-0.14.1.tar.gz + + + brew install libtool + brew install automake + + #If configure fails with "syntax error near unexpected token `QT5", then run "brew install pkg-config" + + ./bootstrap.sh + + sudo ./configure --with-openssl=/usr/local/Cellar/openssl@1.1/1.1.1j --without-erlang --without-nodejs --without-python --without-py3 --without-perl --without-php --without-php_extension --without-ruby --without-haskell --without-go --without-swift --without-dotnetcore --without-qt5 + + brew install openssl + + sudo ln -s /usr/local/opt/openssl/include/openssl/ /usr/local/include/ + + sudo make + + sudo make install + + mkdir -p /usr/local/share/fb303/if + + cp path/to/thrift-0.14.1/contrib/fb303/if/fb303.thrift /usr/local/share/fb303/if/fb303.thrift + # or alternatively the following command + curl -o /usr/local/share/fb303/if/fb303.thrift https://raw.githubusercontent.com/apache/thrift/master/contrib/fb303/if/fb303.thrift + ``` + + 3. For Linux: + + ``` + cd /path/to/thrift-0.14.1 + /configure -without-erlang --without-nodejs --without-python --without-py3 --without-perl --without-php --without-php_extension --without-ruby --without-haskell --without-go --without-swift --without-dotnetcore --without-qt5 + sudo make + sudo make install + sudo mkdir -p /usr/local/share/fb303/if + sudo cp /path/to/thrift-0.14.1/contrib/fb303/if/fb303.thrift /usr/local/share/fb303/if/fb303.thrift + ``` 3. Before proceeding, verify that `which thrift` returns the build of Thrift you just installed (typically `/usr/local/bin` on Linux); if not, edit your PATH and repeat the verification. Also verify that the command 'thrift -version' returns the expected version number of Thrift. 4. Now you can run the Maven 'thriftif' profile to generate the Thrift code: - 1. `cd /path/to/hive/` - 2. `mvn clean install -Pthriftif -DskipTests -Dthrift.home=/usr/local` + 1. `cd /path/to/hive/` + 2. `mvn clean install -Pthriftif -DskipTests -Dthrift.home=/usr/local` 5. Verify that the code generation was a no-op, which should be the case if you have the correct Thrift version and everyone has been following these instructions. You may use `git status` for the same. If you can't figure out what is going wrong, ask for help from a committer. 6. Now make your changes to `hive_metastore.thrift`, and then run the compiler again, from /path/to/hive/: - `mvn clean install -Pthriftif -DskipTests -Dthrift.home=/usr/local` + `mvn clean install -Pthriftif -DskipTests -Dthrift.home=/usr/local` 7. Now use `git status and git diff` to verify that the regenerated code corresponds only to the changes you made to `hive_metastore.thrift`. You may also need `git add` if new files were generated (and or `git rm` if some files are now obsoleted). + 8. `cd /path/to/hive` + 9. `mvn clean package -DskiptTests (at the time of writing also "-Dmaven.javadoc.skip" is needed)` + 10. Verify that Hive is still working correctly with both embedded and remote metastore configurations. ## Stay Involved @@ -341,3 +343,4 @@ Contributors should join the [Hive mailing lists](/community/mailinglists/). In * [Apache contributor documentation](https://infra.apache.org/contributors.html) * [Apache voting documentation](http://www.apache.org/foundation/voting.html) * [GitHub repository of this website](https://github.com/apache/hive-site) + diff --git a/content/community/resources/howtorelease.md b/content/community/resources/howtorelease.md index 4c2a9ad9..9575bb92 100644 --- a/content/community/resources/howtorelease.md +++ b/content/community/resources/howtorelease.md @@ -7,7 +7,7 @@ date: 2024-12-12 ## Introduction -This page is prepared for Hive committers. You need committer rights to create a new Hive release. +This page is prepared for Hive committers. You need committer rights to create a new Hive release. This page assumes you are releasing from the master branch, and thus omits the use of Maven profiles to determine which version of Hadoop you are building against. If you are releasing from branch-1, you will need to add `-Phadoop-2` to most of your Maven commands. @@ -29,6 +29,7 @@ Skip this section if this is NOT the first release in a series (i.e., release X. ``` git checkout master ``` + 2. Increment the value of the `version` property in the storage-api/pom.xml file. For example, if the current value is 2.5.0`-SNAPSHOT`, the new value should be 2.6.0`-SNAPSHOT`. Please note that the `SNAPSHOT` suffix is required in order to indicate that this is an unreleased development branch. 3. Update the storage-api.version property in the root pom.xml and standalone-metastore/pom.xml to the new value from the step above. 4. Verify that the build is working with changes. @@ -44,6 +45,7 @@ Skip this section if this is NOT the first release in a series (i.e., release X. ``` git checkout -b storage-branch-X.Y origin/master ``` + 3. Update the `version` property value in the storage-api/pom.xml file. You should remove the `SNAPSHOT` suffix and set `version` equal to `X.Y.Z` where `Z` is the point release number in this release series (0 for the first one, in which case this step is a no-op since you already did this above when creating the branch). Use [Maven's Versions plugin](http://mojo.codehaus.org/versions-maven-plugin/set-mojo.html) to do this as follows: 4. Verify that the build is working with changes. 5. Commit these changes with a comment "Preparing for storage-api X.Y.Z release". @@ -65,6 +67,7 @@ git push origin storage-release-X.Y.Z-rcR git tag storage-release-X.Y.Z-rcR -m "Hive Storage API X.Y.Z-rcR release." git push origin storage-release-X.Y.Z-rcR ``` + 3. Build the release (binary and source versions) after running unit tests. Manually create the sha file. ``` @@ -74,14 +77,15 @@ git push origin storage-release-X.Y.Z-rcR % tar czvf hive-storage-X.Y.Z-rcR.tar.gz hive-storage-X.Y.Z % shasum -a 256 hive-storage-X.Y.Z-rcR.tar.gz > hive-storage-X.Y.Z-rcR.tar.gz.sha256 ``` -4. Setup your PGP keys for signing the release, if you don't have them already. - 1. See . +4. Setup your PGP keys for signing the release, if you don't have them already. + 1. See . 5. Sign the release (see [Step-By-Step Guide to Mirroring Releases](http://www.apache.org/dev/mirror-step-by-step.html?Step-By-Step) for more information). ``` % gpg --armor --detach-sig hive-storage-X.Y.Z-rcR.tar.gz ``` + 6. Check the signatures. ``` @@ -93,6 +97,7 @@ gpg: assuming signed data in `hive-storage-X.Y.Z-rcR.tar.gz' gpg: Signature made Fri Apr 28 12:50:03 2017 PDT using RSA key ID YOUR-KEY-ID gpg: Good signature from "Your Name " ``` + 7. Copy release files to a public place. ``` @@ -104,6 +109,7 @@ sftp> cd hive-storage-X.Y.Z sftp> put hive-storage-X.Y.Z-rcR.tar.gz* sftp> quit ``` + 8. Send email to [dev@hive.apache.org]({{< ref "mailto:dev@hive-apache-org" >}}) calling the vote. ### Publishing the Storage API Artifacts @@ -115,6 +121,7 @@ sftp> quit % cd storage-api % mvn clean deploy -Papache-release -DskipTests ``` + 2. Login to Nexus and close the repository. Mark the repository as released. 3. Create the final tag (be very careful, tags in "rel/" are not changeable). @@ -123,8 +130,9 @@ sftp> quit % git tag -s rel/storage-release-X.Y.Z -m "Hive Storage API X.Y.Z" % git push origin rel/storage-release-X.Y.Z ``` -4. Add the artifacts to Hive's dist area. There might be a problem with the size of the artifact. -[INFRA-23055](https://issues.apache.org/jira/browse/INFRA-23055) solved the issue. + +4. Add the artifacts to Hive's dist area. There might be a problem with the size of the artifact. + [INFRA-23055](https://issues.apache.org/jira/browse/INFRA-23055) solved the issue. ``` % svn checkout https://dist.apache.org/repos/dist/release/hive hive-dist @@ -157,8 +165,8 @@ sftp> quit ### Preparation 1. Bulk update Jira to unassign from this release all issues that are open non-blockers and send follow-up notification to the developer list that this was done. There are two kinds of JIRAs that need to be taken care of: - 1. Unresolved JIRAs with Target Version/s or Fix Version/s (legacy) set to the release in question. - 2. Resolved/closed(!) JIRAs with Target Version/s, but not Fix Version/s set to the release in question (e.g. a JIRA targets 2.0.0 and 1.3.0, but was only committed to 2.0.0). + 1. Unresolved JIRAs with Target Version/s or Fix Version/s (legacy) set to the release in question. + 2. Resolved/closed(!) JIRAs with Target Version/s, but not Fix Version/s set to the release in question (e.g. a JIRA targets 2.0.0 and 1.3.0, but was only committed to 2.0.0). 2. Run 'mvn clean apache-rat:check' and examine the generated report for any files, especially .java files which should all have Apache license headers. Note also, that each individual component will have a rat.txt inside it when you run this – be sure to check ql/target/rat.txt, for example. Add the license header to any file that is missing it (open a jira and submit a patch). 3. Update copyright date in NOTICE. If any components mentioned in them have updated versions, you would need to update the copyright dates for those. (Thejas comment: It sounds like entries are needed in NOTICE only if the license requires such attribution. See .) @@ -173,11 +181,13 @@ Skip this section if this is NOT the first release in a series (i.e., release X. git checkout -b branch-X.Y origin/master git push -u origin branch-X.Y ``` + 3. Increment the value of the `version` property in all pom.xml files. For example, if the current value is `0.7.0-SNAPSHOT`, the new value should be `0.8.0-SNAPSHOT`. Please note that the `SNAPSHOT` suffix is required in order to indicate that this is an unreleased development branch. This can be accomplished with a single command using [Maven's Versions plugin](http://mojo.codehaus.org/versions-maven-plugin/set-mojo.html) as follows:  ``` mvn versions:set -DnewVersion=X.Y.0-SNAPSHOT -DgenerateBackupPoms=false ``` + 4. Make changes to metastore upgrade scripts. See [HIVE-6555](https://issues.apache.org/jira/browse/HIVE-6555) on how this was done for HIVE 0.13. 5. Verify that the build is working with changes. 6. Commit these changes to master with a comment "Preparing for X.Y+1.0 development". @@ -193,6 +203,7 @@ git clone https://git-wip-us.apache.org/repos/asf/hive.git/ cd git checkout branch-X.Y ``` + 2. Update the `version` property value in all pom.xml files. You should remove the `SNAPSHOT` suffix and set `version` equal to `hive-X.Y.Z` where Z is the point release number in this release series (0 for the first one, in which case this step is a no-op since you already did this above when creating the branch). Use [Maven's Versions plugin](http://mojo.codehaus.org/versions-maven-plugin/set-mojo.html) to do this as follows: ``` @@ -207,9 +218,9 @@ Make sure to update the version property in standalone-metastore/pom.xml and upg 6. Commit these changes with a comment "Preparing for X.Y.Z release". 7. If not already done, merge desired patches from trunk into the branch and commit these changes. Avoid usage of "git merge" to avoid too many merge commits. Either request the committer who committed that patch in master to commit to this branch, or commit it yourself, or try doing a git cherry-pick for trivial patches. Specifics of this step can be laid down by the release manager. 8. You probably also want to commit a patch (on both trunk and branch) which updates README.txt to bring it up to date (at a minimum, search+replacing references to the version number). Also check NOTICE to see if anything needs to be updated for recent library dependency changes or additions. - 1. Select all of the JIRAs for the current release that aren't FIXED and do bulk update to clear the 'Fixed Version' field. - 2. Likewise, use JIRA's [Release Notes](https://issues.apache.org/jira/secure/ConfigureReleaseNote.jspa?projectId=12310843) link to generate content for the RELEASE_NOTES.txt file. Be sure to select 'Text' format. (It's OK to do this with a direct commit rather than a patch.) - 3. Update the release notes in trunk with the release notes in branch. + 1. Select all of the JIRAs for the current release that aren't FIXED and do bulk update to clear the 'Fixed Version' field. + 2. Likewise, use JIRA's [Release Notes](https://issues.apache.org/jira/secure/ConfigureReleaseNote.jspa?projectId=12310843) link to generate content for the RELEASE_NOTES.txt file. Be sure to select 'Text' format. (It's OK to do this with a direct commit rather than a patch.) + 3. Update the release notes in trunk with the release notes in branch. 9. Tag the release candidate (R is the release candidate number, and also starts from 0): ``` @@ -251,9 +262,10 @@ hive-standalone-metastore-X.Y.Z-bin.tar.gz: OK % shasum -a 256 -c hive-standalone-metastore-X.Y.Z-src.tar.gz.sha256 hive-standalone-metastore-X.Y.Z-src.tar.gz: OK ``` + 4. Check that release file looks ok -- e.g., install it and run examples from tutorial. 5. Setup your PGP keys for signing the release, if you don't have them already. - 1. See , . + 1. See , . ``` % gpg --full-generate-key @@ -264,6 +276,7 @@ hive-standalone-metastore-X.Y.Z-src.tar.gz: OK % svn add KEYS % svn commit -m 'Adding 's key' ``` + 6. Sign the release (see [Step-By-Step Guide to Mirroring Releases](http://www.apache.org/dev/mirror-step-by-step.html?Step-By-Step) for more information). ``` @@ -272,6 +285,7 @@ hive-standalone-metastore-X.Y.Z-src.tar.gz: OK % gpg --armor --output hive-standalone-metastore-X.Y.Z-bin.tar.gz.asc --detach-sig hive-standalone-metastore-X.Y.Z-bin.tar.gz % gpg --armor --output hive-standalone-metastore-X.Y.Z-src.tar.gz.asc --detach-sig hive-standalone-metastore-X.Y.Z-src.tar.gz ``` + 7. Follow instructions in to push the new release artifacts (tar.gz, tar.gz.asc, tar.gz.sha256) to the SVN staging area of the project (). Make sure to create a new directory for the release candidate. ``` @@ -290,6 +304,7 @@ svn add dev/hive/hive-X.Y.Z svn add dev/hive/hive-standalone-metastore-X.Y.Z svn commit -m "Hive X.Y.Z release" ``` + 8. Publish Maven artifacts to the Apache staging repository. Make sure to have this [setup](http://www.apache.org/dev/publishing-maven-artifacts.html#dev-env) for Apache releases. Use committer [setting.xml](https://maven.apache.org/developers/committer-settings.html). **Note**: If you get an error `gpg: signing failed: Inappropriate ioctl for device,` try doing ``export GPG_TTY=$(tty)`` @@ -299,6 +314,7 @@ svn commit -m "Hive X.Y.Z release" ``` % mvn deploy -Papache-release -DskipTests -Dmaven.javadoc.skip=true ``` + 9. Login to the [Apache Nexus server](https://repository.apache.org/index.html#stagingRepositories) and "close" the staged repository. This makes the artifacts available at a temporary URL. ### Voting @@ -411,8 +427,9 @@ gpg --verify apache-hive-X.Y.Z-src.tar.gz.asc apache-hive-X.Y.Z-src.tar.gz gpg --verify hive-standalone-metastore-X.Y.Z-bin.tar.gz.asc hive-standalone-metastore-X.Y.Z-bin.tar.gz gpg --verify hive-standalone-metastore-X.Y.Z-src.tar.gz.asc hive-standalone-metastore-X.Y.Z-src.tar.gz ``` + 2. Verifying the sha256 checksum: -See the step under Building. + See the step under Building. ### Publishing @@ -426,6 +443,7 @@ git push origin rel/release-X.Y.Z git tag -d release-X.Y.Z-rcR git push origin :release-X.Y.Z-rcR ``` + **NOTE:** If errors happen while "git tag -s", try to configure the git signing key by "git config user.signingkey your_gpg_key_id" then rerun the command. This step (`git push origin rel/release-X.Y.Z`) will trigger the Hive Docker image build and upload to Docker Hub on the [Hive Action Page](https://github.com/apache/hive/actions/workflows/docker-GA-images.yml). If the image build fails, click **Re-run** on the Actions page to retry or manually build and upload it. Finally, verify whether the image has been successfully uploaded by checking the [Docker Hub](https://hub.docker.com/r/apache/hive/tags). @@ -436,6 +454,7 @@ This step (`git push origin rel/release-X.Y.Z`) will trigger the Hive Docker ima svn mv https://dist.apache.org/repos/dist/dev/hive/hive-X.Y.Z https://dist.apache.org/repos/dist/release/hive/hive-X.Y.Z -m "Move hive-X.Y.Z release from dev to release" svn mv https://dist.apache.org/repos/dist/dev/hive/hive-standalone-metastore-X.Y.Z https://dist.apache.org/repos/dist/release/hive/hive-standalone-metastore-X.Y.Z -m "Move hive-standalone-metastore-X.Y.Z release from dev to release" ``` + 3. Wait till the release propagates to the mirrors and appears under: 4. In your base hive source directory, generate javadocs as follows: @@ -450,6 +469,7 @@ After you run this, you should have javadocs present in your /t ``` svn co --depth empty https://svn.apache.org/repos/infra/websites/production/hive/content/javadocs ``` + 6. Copy the generated javadocs from the source repository to the javadocs repository, add and commit: ``` @@ -467,6 +487,7 @@ If this is a bugfix release, svn rm the obsoleted version. (For eg., when commit ``` git clone https://github.com/apache/hive-site.git ``` + 8. Edit files content/downloads.mdtext and javadoc.mdtext to appropriately add entries for the new release in the appropriate location. For example, for 1.2.0, the entries made were as follows: ``` @@ -481,9 +502,9 @@ As you can see, you will need a release note link for this release as created pr 9. Push your changes to the branch, and you can preview the results at . If everything is ok, then you can push your changes to branch and see the results at site. 10. Update JIRA - 1. Ensure that only issues in the "Fixed" state have a "Fix Version" set to release X.Y.Z. - 2. Release the version. Visit the [releases page](https://issues.apache.org/jira/projects/HIVE?selectedItem=com.atlassian.jira.jira-projects-plugin%3Arelease-page&status=unreleased).  Select the version number you are releasing, and hit the release button. You need to have the "Admin" role in Hive's Jira for this step and the next. - 3. Close issues resolved in the release. Disable mail notifications for this bulk change. + 1. Ensure that only issues in the "Fixed" state have a "Fix Version" set to release X.Y.Z. + 2. Release the version. Visit the [releases page](https://issues.apache.org/jira/projects/HIVE?selectedItem=com.atlassian.jira.jira-projects-plugin%3Arelease-page&status=unreleased).  Select the version number you are releasing, and hit the release button. You need to have the "Admin" role in Hive's Jira for this step and the next. + 3. Close issues resolved in the release. Disable mail notifications for this bulk change. 11. Login to the [Apache Nexus server](https://repository.apache.org/index.html#stagingRepositories) and mark the release candidate artifacts as released. 12. Add the release in [Apache Committee Report Helper](https://reporter.apache.org/addrelease.html?hive) for the next board report to pick that up automatically. You may need PMC privileges to do this step – if you do not have such privileges, please ping a [PMC member](http://hive.apache.org/people.html) to do this for you. 13. Check whether the [Docker image](https://hub.docker.com/r/apache/hive) for the release is present or not. @@ -547,16 +568,19 @@ git clone https://git-wip-us.apache.org/repos/asf/hive.git/ cd git checkout branch-X.Y ``` + 2. Increment the `version` property value in all pom.xml files and add the `SNAPSHOT` suffix. For example, if the released version was `0.7.0`, the new value should be `0.7.1-SNAPSHOT`. Please note that the `SNAPSHOT` suffix is required in order to indicate that this is an unreleased development branch. Use [Maven's Versions plugin](http://mojo.codehaus.org/versions-maven-plugin/set-mojo.html) to do this as follows: ``` mvn versions:set -DnewVersion=0.7.1-SNAPSHOT -DgenerateBackupPoms=false ``` + 3. If the release number you are preparing moves the major (first) or minor (second) number, update the Hive version name in the poms.  In both `pom.xml` and `standalone-metastore/pom.xml` search for the property `hive.version.shortname`.  This should match the new version number.   -For example, if you are working on branch-3 and have just released Hive 3.2 and are preparing the branch for Hive 3.3 development, you need to update both poms to have `3.3.0`.  If however you are working on branch-3.1 and have just released Hive 3.1.2 and are preparing the branch for 3.1.3 development, this is not necessary. + For example, if you are working on branch-3 and have just released Hive 3.2 and are preparing the branch for Hive 3.3 development, you need to update both poms to have `3.3.0`.  If however you are working on branch-3.1 and have just released Hive 3.1.2 and are preparing the branch for 3.1.3 development, this is not necessary. 4. Verify that the build is working with changes. 5. Commit these changes with a comment "Preparing for X.Y.Z+1 development". ## See Also * [Apache Releases FAQ](http://www.apache.org/dev/release.html) + diff --git a/content/community/resources/metastore-api-tests.md b/content/community/resources/metastore-api-tests.md index 405879bb..35b5c9cf 100644 --- a/content/community/resources/metastore-api-tests.md +++ b/content/community/resources/metastore-api-tests.md @@ -37,7 +37,3 @@ The following test files will check the API functionalities:   - - - - diff --git a/content/community/resources/performance.md b/content/community/resources/performance.md index bda84527..8efdd8af 100644 --- a/content/community/resources/performance.md +++ b/content/community/resources/performance.md @@ -24,7 +24,3 @@ Here are some JIRA issues about benchmarks for Hive:   - - - - diff --git a/content/community/resources/plugindeveloperkit.md b/content/community/resources/plugindeveloperkit.md index a4ac220e..6e5ad1b8 100644 --- a/content/community/resources/plugindeveloperkit.md +++ b/content/community/resources/plugindeveloperkit.md @@ -69,13 +69,13 @@ All this buildfile does is define some variable settings and then import a build The imported PDK buildfile assumes a few things about the structure of your plugin source structure: * your-plugin-root - + build.xml - + src - - Java source files - + test - - setup.sql - - cleanup.sql - - any datafiles needed by your tests + + build.xml + + src + - Java source files + + test + - setup.sql + - cleanup.sql + - any datafiles needed by your tests For the example plugin, a datafile onerow.txt contains a single row of data; setup.sql creates a table named onerow and loads the datafile, whereas cleanup.sql drops the onerow table. The onerow table is convenient for testing UDF's. @@ -156,10 +156,10 @@ The PDK executes tests as follows: 1. Run top-level cleanup.sql (in case a previous test failed in the middle) 2. Run top-level setup.sql 3. For each class with @HivePdkUnitTests annotation - 1. Run class cleanup (if any) - 2. Run class setup (if any) - 3. For each @HivePdkUnitTest annotation, run query and verify that actual result matches expected result - 4. Run class cleanup (if any) + 1. Run class cleanup (if any) + 2. Run class setup (if any) + 3. For each @HivePdkUnitTest annotation, run query and verify that actual result matches expected result + 4. Run class cleanup (if any) 4. Run top-level cleanup.sql If you encounter problems during test execution, look in the file `TEST-org.apache.hive.pdk.PluginTest.txt` for details. @@ -172,7 +172,3 @@ If you encounter problems during test execution, look in the file `TEST-org.apac * move Hive builtins to use PDK for more convenient testing ([HIVE-2523](https://issues.apache.org/jira/browse/HIVE-2523)) * command-line option for invoking a single testcase - - - - diff --git a/content/community/resources/presentations.md b/content/community/resources/presentations.md index a071c8fd..c05d1847 100644 --- a/content/community/resources/presentations.md +++ b/content/community/resources/presentations.md @@ -124,157 +124,115 @@ date: 2024-12-12 ![](images/icons/bullet_blue.gif) [attachments/27362054/28016657.pdf](/attachments/27362054/28016657.pdf) (application/pdf) - ![](images/icons/bullet_blue.gif) [attachments/27362054/28016834.pdf](/attachments/27362054/28016834.pdf) (application/pdf) - ![](images/icons/bullet_blue.gif) [attachments/27362054/28016835.pdf](/attachments/27362054/28016835.pdf) (application/pdf) - ![](images/icons/bullet_blue.gif) [attachments/27362054/28016838.pdf](/attachments/27362054/28016838.pdf) (application/pdf) - ![](images/icons/bullet_blue.gif) [attachments/27362054/28017801.pdf](/attachments/27362054/28017801.pdf) (application/pdf) - ![](images/icons/bullet_blue.gif) [attachments/27362054/28017802.pdf](/attachments/27362054/28017802.pdf) (application/pdf) - ![](images/icons/bullet_blue.gif) [attachments/27362054/28017803.pdf](/attachments/27362054/28017803.pdf) (application/pdf) - ![](images/icons/bullet_blue.gif) [attachments/27362054/28017805.pdf](/attachments/27362054/28017805.pdf) (application/pdf) - ![](images/icons/bullet_blue.gif) [attachments/27362054/28017807.pdf](/attachments/27362054/28017807.pdf) (application/pdf) - ![](images/icons/bullet_blue.gif) [attachments/27362054/30966691.pdf](/attachments/27362054/30966691.pdf) (application/pdf) - ![](images/icons/bullet_blue.gif) [attachments/27362054/30966693.pdf](/attachments/27362054/30966693.pdf) (application/pdf) - ![](images/icons/bullet_blue.gif) [attachments/27362054/30966692.pdf](/attachments/27362054/30966692.pdf) (application/pdf) - ![](images/icons/bullet_blue.gif) [attachments/27362054/30966880.pdf](/attachments/27362054/30966880.pdf) (application/pdf) - ![](images/icons/bullet_blue.gif) [attachments/27362054/35193149-pptx](/attachments/27362054/35193149.pptx) (application/vnd.openxmlformats-officedocument.presentationml.presentation) - ![](images/icons/bullet_blue.gif) [attachments/27362054/35193150-pptx](/attachments/27362054/35193150.pptx) (application/vnd.openxmlformats-officedocument.presentationml.presentation) - ![](images/icons/bullet_blue.gif) [attachments/27362054/35193151-pptx](/attachments/27362054/35193151.pptx) (application/vnd.openxmlformats-officedocument.presentationml.presentation) - ![](images/icons/bullet_blue.gif) [attachments/27362054/HiveContrib-Nov13-groovy_plus_hive.pptx](/attachments/27362054/HiveContrib-Nov13-groovy_plus_hive.pptx) (application/vnd.openxmlformats-officedocument.presentationml.presentation) - ![](images/icons/bullet_blue.gif) [attachments/27362054/35193153-pptx](/attachments/27362054/35193153.pptx) (application/vnd.openxmlformats-officedocument.presentationml.presentation) - ![](images/icons/bullet_blue.gif) [attachments/27362054/35193154-pptx](/attachments/27362054/35193154.pptx) (application/vnd.openxmlformats-officedocument.presentationml.presentation) - ![](images/icons/bullet_blue.gif) [attachments/27362054/52036008-pptx](/attachments/27362054/52036008.pptx) (application/vnd.openxmlformats-officedocument.presentationml.presentation) - ![](images/icons/bullet_blue.gif) [attachments/27362054/52036063.pdf](/attachments/27362054/52036063.pdf) (application/pdf) - ![](images/icons/bullet_blue.gif) [attachments/27362054/55476524-key](/attachments/27362054/55476524-key) (application/x-iwork-keynote-sffkey) - ![](images/icons/bullet_blue.gif) [attachments/27362054/55476525-pptx](/attachments/27362054/55476525.pptx) (application/vnd.openxmlformats-officedocument.presentationml.presentation) - ![](images/icons/bullet_blue.gif) [attachments/27362054/55476526-pptx](/attachments/27362054/55476526.pptx) (application/vnd.openxmlformats-officedocument.presentationml.presentation) - ![](images/icons/bullet_blue.gif) [attachments/27362054/56131586.pdf](/attachments/27362054/56131586.pdf) (application/download) - ![](images/icons/bullet_blue.gif) [attachments/27362054/61329032-pptx](/attachments/27362054/61329032.pptx) (application/vnd.openxmlformats-officedocument.presentationml.presentation) - ![](images/icons/bullet_blue.gif) [attachments/27362054/61329033-pptx](/attachments/27362054/61329033.pptx) (application/vnd.openxmlformats-officedocument.presentationml.presentation) - ![](images/icons/bullet_blue.gif) [attachments/27362054/61329034-pptx](/attachments/27362054/61329034.pptx) (application/vnd.openxmlformats-officedocument.presentationml.presentation) - ![](images/icons/bullet_blue.gif) [attachments/27362054/61329036-pptx](/attachments/27362054/61329036.pptx) (application/vnd.openxmlformats-officedocument.presentationml.presentation) - ![](images/icons/bullet_blue.gif) [attachments/27362054/61329038-pptx](/attachments/27362054/61329038.pptx) (application/vnd.openxmlformats-officedocument.presentationml.presentation) - ![](images/icons/bullet_blue.gif) [attachments/27362054/61329039.pdf](/attachments/27362054/61329039.pdf) (application/pdf) - ![](images/icons/bullet_blue.gif) [attachments/27362054/61329040-pptx](/attachments/27362054/61329040.pptx) (application/vnd.openxmlformats-officedocument.presentationml.presentation) - ![](images/icons/bullet_blue.gif) [attachments/27362054/LLAP-Meetup-Nov.ppsx](/attachments/27362054/LLAP-Meetup-Nov.ppsx) (application/vnd.openxmlformats-officedocument.presentationml.slideshow) - ![](images/icons/bullet_blue.gif) [attachments/27362054/61337098-pptx](/attachments/27362054/61337098.pptx) (application/vnd.openxmlformats-officedocument.presentationml.presentation) - ![](images/icons/bullet_blue.gif) [attachments/27362054/61337312-ppsx](/attachments/27362054/61337312.ppsx) (application/vnd.openxmlformats-officedocument.presentationml.slideshow) - ![](images/icons/bullet_blue.gif) [attachments/27362054/61337313-pptx](/attachments/27362054/61337313.pptx) (application/vnd.openxmlformats-officedocument.presentationml.presentation) - ![](images/icons/bullet_blue.gif) [attachments/27362054/61337398-ppsx](/attachments/27362054/61337398.ppsx) (application/vnd.openxmlformats-officedocument.presentationml.slideshow) - ![](images/icons/bullet_blue.gif) [attachments/27362054/61337443.pdf](/attachments/27362054/61337443.pdf) (application/pdf) - - - - - diff --git a/content/community/resources/relatedprojects.md b/content/community/resources/relatedprojects.md index 5ae80a50..734ccd05 100644 --- a/content/community/resources/relatedprojects.md +++ b/content/community/resources/relatedprojects.md @@ -17,7 +17,3 @@ Apache Hivemall is a scalable machine learning library for Apache Hive, Apache S [Sentry](https://sentry.incubator.apache.org) is a role-based authorization system for Apache Hive. - - - - diff --git a/content/community/resources/running-yetus.md b/content/community/resources/running-yetus.md index d5fb6219..be839956 100644 --- a/content/community/resources/running-yetus.md +++ b/content/community/resources/running-yetus.md @@ -42,7 +42,3 @@ Then run the checks with the following command: ./dev-support/test-patch.sh ~/Downloads/HIVE-16345.2.patch ``` - - - - diff --git a/content/community/resources/testingdocs.md b/content/community/resources/testingdocs.md index 448b480a..d5a3708a 100644 --- a/content/community/resources/testingdocs.md +++ b/content/community/resources/testingdocs.md @@ -15,3 +15,4 @@ The following documents describe aspects of testing for Hive: * [Running Yetus]({{< ref "running-yetus" >}}) * [MetaStore API Tests]({{< ref "metastore-api-tests" >}}) * [Query File Test(qtest)](/development/qtest/) + diff --git a/content/docs/javadocs.md b/content/docs/javadocs.md index 61daab87..3648f705 100644 --- a/content/docs/javadocs.md +++ b/content/docs/javadocs.md @@ -6,26 +6,29 @@ aliases: [/javadoc.html] --- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. --> ## Recent versions: + --- + #### javadoc and sources jars for use in an IDE are also available via [Nexus](https://repository.apache.org/index.html#nexus-search;gav~org.apache.hive~~~~) + * [Hive 4.2.0 Javadocs](https://svn.apache.org/repos/infra/websites/production/hive/content/javadocs/r4.2.0/api/index.html) * [Hive 4.1.0 Javadocs](https://svn.apache.org/repos/infra/websites/production/hive/content/javadocs/r4.1.0/api/index.html) * [Hive 4.0.1 Javadocs](https://svn.apache.org/repos/infra/websites/production/hive/content/javadocs/r4.0.1/api/index.html) diff --git a/content/docs/latest/_index.md b/content/docs/latest/_index.md index f4ba6e04..02564a8c 100644 --- a/content/docs/latest/_index.md +++ b/content/docs/latest/_index.md @@ -2,3 +2,4 @@ title: "Documentation" date: 2025-07-24 --- + diff --git a/content/docs/latest/admin/_index.md b/content/docs/latest/admin/_index.md index 04dd41a3..02f8a661 100644 --- a/content/docs/latest/admin/_index.md +++ b/content/docs/latest/admin/_index.md @@ -2,3 +2,4 @@ title: "Administration Manual" date: 2025-07-24 --- + diff --git a/content/docs/latest/admin/adminmanual-configuration.md b/content/docs/latest/admin/adminmanual-configuration.md index b7dd3a55..f108042e 100644 --- a/content/docs/latest/admin/adminmanual-configuration.md +++ b/content/docs/latest/admin/adminmanual-configuration.md @@ -15,12 +15,14 @@ A number of configuration variables in Hive can be used by the administrator to set hive.exec.scratchdir=/tmp/mydir; ``` + * Using the **`--hiveconf`** option of the `[hive]({{< ref "#hive" >}})` command (in the CLI) or `[beeline]({{< ref "#beeline" >}})` command for the entire session. For example: ``` bin/hive --hiveconf hive.exec.scratchdir=/tmp/mydir ``` + * In **`hive-site.xml`**. This is used for setting values for the entire Hive configuration (see [hive-site.xml and hive-default.xml.template]({{< ref "#hive-sitexml-and-hive-defaultxmltemplate" >}}) below). For example: ``` @@ -31,19 +33,19 @@ A number of configuration variables in Hive can be used by the administrator to ``` + * In **server-specific configuration files** (supported starting [Hive 0.14](https://issues.apache.org/jira/browse/HIVE-7342)). You can set metastore-specific configuration values in **hivemetastore-site.xml**, and HiveServer2-specific configuration values in **hiveserver2-site.xml**. -The server-specific configuration file is useful in two situations: + The server-specific configuration file is useful in two situations: + 1. 1. You want a different configuration for one type of server (for example – enabling authorization only in HiveServer2 and not CLI). - 2. You want to set a configuration value only in a server-specific configuration file (for example – setting the metastore database password only in the metastore server configuration file). - HiveMetastore server reads hive-site.xml as well as hivemetastore-site.xml configuration files that are available in the $HIVE_CONF_DIR or in the classpath. If the metastore is being used in embedded mode (i.e., hive.metastore.uris is not set or empty) in `hive` commandline or HiveServer2, the hivemetastore-site.xml gets loaded by the parent process as well. - The value of hive.metastore.uris is examined to determine this, and the value should be set appropriately in hive-site.xml . - Certain [metastore configuration parameters]({{< ref "#metastore-configuration-parameters" >}}) like hive.metastore.sasl.enabled, hive.metastore.kerberos.principal, hive.metastore.execute.setugi, and hive.metastore.thrift.framed.transport.enabled are used by the metastore client as well as server. For such common parameters it is better to set the values in hive-site.xml, that will help in keeping them consistent. - - HiveServer2 reads hive-site.xml as well as hiveserver2-site.xml that are available in the $HIVE_CONF_DIR or in the classpath. - If HiveServer2 is using the metastore in embedded mode, hivemetastore-site.xml also is loaded. - - The order of precedence of the config files is as follows (later one has higher precedence) – - hive-site.xml -> hivemetastore-site.xml -> hiveserver2-site.xml -> '`-hiveconf`' commandline parameters. + 2. You want to set a configuration value only in a server-specific configuration file (for example – setting the metastore database password only in the metastore server configuration file). + HiveMetastore server reads hive-site.xml as well as hivemetastore-site.xml configuration files that are available in the $HIVE_CONF_DIR or in the classpath. If the metastore is being used in embedded mode (i.e., hive.metastore.uris is not set or empty) in `hive` commandline or HiveServer2, the hivemetastore-site.xml gets loaded by the parent process as well. + The value of hive.metastore.uris is examined to determine this, and the value should be set appropriately in hive-site.xml . + Certain [metastore configuration parameters]({{< ref "#metastore-configuration-parameters" >}}) like hive.metastore.sasl.enabled, hive.metastore.kerberos.principal, hive.metastore.execute.setugi, and hive.metastore.thrift.framed.transport.enabled are used by the metastore client as well as server. For such common parameters it is better to set the values in hive-site.xml, that will help in keeping them consistent. + HiveServer2 reads hive-site.xml as well as hiveserver2-site.xml that are available in the $HIVE_CONF_DIR or in the classpath. + If HiveServer2 is using the metastore in embedded mode, hivemetastore-site.xml also is loaded. + The order of precedence of the config files is as follows (later one has higher precedence) – + hive-site.xml -> hivemetastore-site.xml -> hiveserver2-site.xml -> '`-hiveconf`' commandline parameters. ### hive-site.xml and hive-default.xml.template @@ -93,42 +95,42 @@ Version information: Metrics #### Hive Configuration Variables -| Variable Name | Description | Default Value | -| --- | --- | --- | -| hive.ddl.output.format | The data format to use for DDL output (e.g. `DESCRIBE table`). One of "text" (for human readable text) or "json" (for a json object). (As of Hive [0.9.0](https://issues.apache.org/jira/browse/HIVE-2822).) | text | -| hive.exec.script.wrapper | Wrapper around any invocations to script operator e.g. if this is set to python, the script passed to the script operator will be invoked as `python