From bb0fcd53c620c97cf2f0a905c0ed5a94a07a95d6 Mon Sep 17 00:00:00 2001 From: Thomas Rebele Date: Sun, 1 Mar 2026 23:58:27 +0100 Subject: [PATCH] Fix some "Raw HTML omitted" warnings and formatting issues (part 2) --- .../hive-across-multiple-data-centers.md | 14 ++++----- .../hive-metadata-caching-proposal.md | 31 ++++++------------- .../hivereplicationv2development.md | 26 ++++++++-------- content/Development/desingdocs/indexdev.md | 2 +- .../desingdocs/subqueries-in-select.md | 2 +- .../support-saml-2-0-authentication-mode.md | 12 +++++++ .../desingdocs/type-qualifiers-in-hive.md | 6 ++-- content/Development/gettingstarted-latest.md | 2 +- .../latest/admin/adminmanual-configuration.md | 14 ++++----- ...dminmanual-metastore-3-0-administration.md | 2 +- .../adminmanual-metastore-administration.md | 16 +++++----- .../admin/hive-on-spark-getting-started.md | 6 ++-- .../latest/admin/setting-up-hiveserver2.md | 4 +-- 13 files changed, 67 insertions(+), 70 deletions(-) diff --git a/content/Development/desingdocs/hive-across-multiple-data-centers.md b/content/Development/desingdocs/hive-across-multiple-data-centers.md index 47ba7f8f..9354660e 100644 --- a/content/Development/desingdocs/hive-across-multiple-data-centers.md +++ b/content/Development/desingdocs/hive-across-multiple-data-centers.md @@ -84,10 +84,10 @@ been imposed to simplify the problem: The same idea can be extended for partitioned tables. * The user can also decide to run in a particular cluster. - + Use cluster + + Use cluster `` * The system will not make an attempt to choose the cluster for the user, but only try to figure out if the query can be run - in the . If the query can run in this cluster, it will succeed. Otherwise, it will fail. + in the ``. If the query can run in this cluster, it will succeed. Otherwise, it will fail. * The user can go back to the behavior to use the default cluster. + Use cluster @@ -101,7 +101,7 @@ The same idea can be extended for partitioned tables. PrimaryCluster - ClusterStorageDescriptor - and SecondaryClusters - Set + and SecondaryClusters - Set<ClusterStorageDescriptor>> The ClusterStorageDescriptor contains the following: @@ -128,12 +128,12 @@ The existing thrift API's will continue to work as if the user is trying to acce New APIs will be added which take the cluster as a new parameter. Almost all the existing APIs will be -enhanced to support this. The behavior will be the same as if, the user issued the command 'USE CLUSTER +enhanced to support this. The behavior will be the same as if, the user issued the command `USE CLUSTER ` * A new parameter will be added to keep the filesystem and jobtrackers for a cluster - + hive.cluster.properties: This will be json - ClusterName -> - + use cluster will fail if is not present hive.cluster.properties - + The other option was to support create cluster <> etc. but that would have required storing the cluster information in the + + hive.cluster.properties: This will be json - ClusterName -> <FileSystem, JobTracker> + + use cluster `` will fail if `` is not present hive.cluster.properties + + The other option was to support create cluster `<>` etc. but that would have required storing the cluster information in the metastore including jobtracker etc. which would be difficult to change per session. diff --git a/content/Development/desingdocs/hive-metadata-caching-proposal.md b/content/Development/desingdocs/hive-metadata-caching-proposal.md index 712757e3..3deaf115 100644 --- a/content/Development/desingdocs/hive-metadata-caching-proposal.md +++ b/content/Development/desingdocs/hive-metadata-caching-proposal.md @@ -63,57 +63,44 @@ Presto has the following cache: + userTablePrivileges * Range scan cache -+ databaseNamesCache: regex -> database names, facilitates database search ++ databaseNamesCache: regex -> database names, facilitates database search + tableNamesCache + viewNamesCache -+ partitionNamesCache: table name -> partition names ++ partitionNamesCache: table name -> partition names * Other -+ partitionFilterCache: PS -> partition names, facilitates partition pruning ++ partitionFilterCache: PS -> partition names, facilitates partition pruning For every partition filter condition, Presto breaks it down into tupleDomain and remainder: +``` AddExchanges.planTableScan: -            DomainTranslator.ExtractionResult decomposedPredicate = DomainTranslator.fromPredicate( -                    metadata, -                    session, -                    deterministicPredicate, -                    types); -    public static class ExtractionResult -    { -        private final TupleDomain tupleDomain; -        private final Expression remainingExpression; -    } +``` -tupleDomain is a mapping of column -> range or exact value. When converting to PS, any range will be converted into wildcard and only exact value will be considered: +tupleDomain is a mapping of column -> range or exact value. When converting to PS, any range will be converted into wildcard and only exact value will be considered: +``` HivePartitionManager.getFilteredPartitionNames: -        for (HiveColumnHandle partitionKey : partitionKeys) { -            if (domain != null && domain.isNullableSingleValue()) { -                    filter.add(((Slice) value).toStringUtf8()); -            else { -                filter.add(PARTITION_VALUE_WILDCARD); -            } -        } +``` -For example, the expression “state = CA and date between ‘201612’ and ‘201701’ will be broken down to PS (state = CA) and remainder date between ‘201612’ and ‘201701’. Presto will retrieve the partitions with state = CA from the PS -> partition name cache and partition object cache, and evaluates “date between ‘201612’ and ‘201701’ for every partitions returned. This is a good balance compare to caching partition names for every expression. +For example, the expression “state = CA and date between ‘201612’ and ‘201701’ will be broken down to PS (state = CA) and remainder date between ‘201612’ and ‘201701’. Presto will retrieve the partitions with state = CA from the PS -> partition name cache and partition object cache, and evaluates “date between ‘201612’ and ‘201701’ for every partitions returned. This is a good balance compare to caching partition names for every expression. ## Our Approach diff --git a/content/Development/desingdocs/hivereplicationv2development.md b/content/Development/desingdocs/hivereplicationv2development.md index 936198a9..9380e6ac 100644 --- a/content/Development/desingdocs/hivereplicationv2development.md +++ b/content/Development/desingdocs/hivereplicationv2development.md @@ -168,7 +168,7 @@ Event 100: ALTER TABLE tbl ADD PARTITION (p=1) SET LOCATION ; Event 110: ALTER TABLE tbl DROP PARTITION (p=1); Event 120: ALTER TABLE tbl ADD PARTITION (p=1) SET LOCATION ; ``` -When loading the dump on the destination side (at a much later point), when the event 100 is replayed, the load task on the destination will try to pull the files from the (the _files contains the path of ), which may contain new or different data. To replicate the exact state of the source at the time event 100 occurred at the source, we do the following: +When loading the dump on the destination side (at a much later point), when the event 100 is replayed, the load task on the destination will try to pull the files from the `` (the _files contains the path of ``), which may contain new or different data. To replicate the exact state of the source at the time event 100 occurred at the source, we do the following: 1. When Event 100 occurs at the source, in the notification event, we store the checksum of the file(s) in the newly added partition along with the file path(s). 2. When Event 110 occurs at the source, we move the files of the dropped partition to $cmroot/database/tbl/p=1 instead of purging them. @@ -212,7 +212,9 @@ The current implementation of replication is built upon existing commands EXPORT This is better described via various examples of each of the pieces of the command syntax, as follows: -(a) REPL DUMP sales;       REPL DUMP sales.['.*?']Replicates out sales database for bootstrap, from =0 (bootstrap case) to = with a batch size of 0, i.e. no batching. +(a) REPL DUMP sales;       REPL DUMP sales.['.*?'] + +Replicates out sales database for bootstrap, from `=0` (bootstrap case) to `=` with a batch size of 0, i.e. no batching.` (b) REPL DUMP sales.['T3', '[a-z]+']; @@ -228,15 +230,15 @@ This sets up db-level replication that excludes all the tables/views but include (e) REPL DUMP sales FROM 200 TO 1400; -The presence of a FROM tag makes this dump not a bootstrap, but a dump which looks at the event log to produce a delta dump. FROM 200 TO 1400 is self-evident in that it will go through event ids 200 to 1400 looking for events from the relevant db. +The presence of a FROM `` tag makes this dump not a bootstrap, but a dump which looks at the event log to produce a delta dump. FROM 200 TO 1400 is self-evident in that it will go through event ids 200 to 1400 looking for events from the relevant db. (f) REPL DUMP sales FROM 200; -Similar to above, but with an implicit assumed as being the current event id at the time the command is run. +Similar to above, but with an implicit assumed `` as being the current event id at the time the command is run. (g) REPL DUMP sales FROM 200 to 1400 LIMIT 100;REPL DUMP sales FROM 200 LIMIT 100; -Similar to cases (d) & (e), with the addition of a batch size of =100. This causes us to stop processing if we reach 100 events, and return at that point. Note that this does not mean that we stop processing at event id = 300, since we began at 200 - it means that we will stop processing events when we have processed 100 events in the event stream (that has unrelated events) belonging to this replication-definition, i.e. of a relevant db or db.table, then we stop. +Similar to cases (d) & (e), with the addition of a batch size of `=100`. This causes us to stop processing if we reach 100 events, and return at that point. Note that this does not mean that we stop processing at event id = 300, since we began at 200 - it means that we will stop processing events when we have processed 100 events in the event stream (that has unrelated events) belonging to this replication-definition, i.e. of a relevant db or db.table, then we stop. (h) REPL DUMP sales.['[a-z]+'] REPLACE sales FROM 200; @@ -258,8 +260,8 @@ The REPL DUMP command has an optional WITH clause to set command-specific confi 1. Error codes returned as return error codes (and over jdbc if with HS2) 2. Returns 2 columns in the ResultSet: - 1. - the directory to which it has dumped info. - 2. - the last event-id associated with this dump, which might be the end-evid, or the curr-evid, as the case may be. + 1. `` - the directory to which it has dumped info. + 2. `` - the last event-id associated with this dump, which might be the end-evid, or the curr-evid, as the case may be. #### Note: @@ -275,20 +277,18 @@ When bootstrap dump is in progress, it blocks rename table/partition operations Look up the HiveServer logs for below pair of log messages. -> REPL DUMP:: Set property for Database: , Property: , Value: ACTIVE -> -> REPL DUMP:: Reset property for Database: , Property: -> +> REPL DUMP:: Set property for Database: ``, Property: ``, Value: ACTIVE > +> REPL DUMP:: Reset property for Database: ``, Property: `` -If Reset property log is not found for the corresponding Set property log, then user need to manually reset the database property with value as "IDLE" using ALTER DATABASE command. +If Reset property log is not found for the corresponding Set property log, then user need to manually reset the database property `` with value as "IDLE" using ALTER DATABASE command. ## REPL LOAD `REPL LOAD {} FROM {WITH ('key1'='value1', 'key2'='value2')};` -This causes a REPL DUMP present in (which is to be a fully qualified HDFS URL) to be pulled and loaded. If is specified, and the original dump was a database-level dump, this allows Hive to do db-rename-mapping on import. If dbname is not specified, the original dbname as recorded in the dump would be used.The REPL LOAD command has an optional WITH clause to set command-specific configurations to be used when trying to copy from the source cluster. These configurations are only used by the corresponding REPL LOAD command and won't be used for other queries running in the same session. +This causes a REPL DUMP present in `` (which is to be a fully qualified HDFS URL) to be pulled and loaded. If `` is specified, and the original dump was a database-level dump, this allows Hive to do db-rename-mapping on import. If dbname is not specified, the original dbname as recorded in the dump would be used.The REPL LOAD command has an optional WITH clause to set command-specific configurations to be used when trying to copy from the source cluster. These configurations are only used by the corresponding REPL LOAD command and won't be used for other queries running in the same session. #### Return values: diff --git a/content/Development/desingdocs/indexdev.md b/content/Development/desingdocs/indexdev.md index a24afa97..5379e13c 100644 --- a/content/Development/desingdocs/indexdev.md +++ b/content/Development/desingdocs/indexdev.md @@ -281,7 +281,7 @@ TBD: we will be adding methods for calling the handler when an index is dropped The reference implementation creates what is referred to as a "compact" index. This means that rather than storing the HDFS location of each occurrence of a particular value, it only stores the addresses of HDFS blocks containing that value. This is optimized for point-lookups in the case where a value typically occurs more than once in nearby rows; the index size is kept small since there are many fewer blocks than rows. The tradeoff is that extra work is required during queries in order to filter out the other rows from the indexed blocks. -The compact index is stored in an index table. The index table columns consist of the indexed columns from the base table followed by a _bucketname string column (indicating the name of the file containing the indexed block) followed by an _offsets array column (indicating the block offsets within the corresponding file). The index table is stored as sorted on the indexed columns (but not on the generated columns). +The compact index is stored in an index table. The index table columns consist of the indexed columns from the base table followed by a _bucketname string column (indicating the name of the file containing the indexed block) followed by an `_offsets array` column (indicating the block offsets within the corresponding file). The index table is stored as sorted on the indexed columns (but not on the generated columns). The reference implementation can be plugged in with diff --git a/content/Development/desingdocs/subqueries-in-select.md b/content/Development/desingdocs/subqueries-in-select.md index 9c7d50ab..d1cee53e 100644 --- a/content/Development/desingdocs/subqueries-in-select.md +++ b/content/Development/desingdocs/subqueries-in-select.md @@ -79,7 +79,7 @@ SELECT customer.customer_num, ) AS total_ship_chg FROM customer ``` -* Subqueries with DISTINCT are not allowed. Since DISTINCT will be evaluated as GROUP BY , subqueries with DISTINCT are disallowed for now. +* Subqueries with DISTINCT are not allowed. Since `DISTINCT ` will be evaluated as `GROUP BY `, subqueries with `DISTINCT` are disallowed for now. # Design diff --git a/content/Development/desingdocs/support-saml-2-0-authentication-mode.md b/content/Development/desingdocs/support-saml-2-0-authentication-mode.md index 5bd9c3ea..29b23f33 100644 --- a/content/Development/desingdocs/support-saml-2-0-authentication-mode.md +++ b/content/Development/desingdocs/support-saml-2-0-authentication-mode.md @@ -50,45 +50,57 @@ In order to make sure that the SAML assertions received by HiveServer2 are valid Following new configurations will be added to the hive-site.xml which would need to be configured by the clients. +```   hive.server2.authentication   SAML +``` This configuration will be set to SAML to indicate that the server will use SAML 2.0 protocol to authenticate the user.  +```   hive.server2.saml2.idp.metadata   path_to_idp_metadata.xml +``` This configuration will provide a path to the IDP metadata xml file. +```   hive.server2.saml2.sp.entity.id   test_sp_entity_id +``` This configuration should be same the service provider entity id as configured in the IDP. Some identity providers require this to be same as the ACS URL. +```   hive.server2.saml2.group.attribute.name   group_attribute_name +``` This configuration will be used to map the SAML attribute in the response to the groups of the user. This should be configured in the identity provider as the attribute name for the group information. +```   hive.server2.saml2.group.filter   comma_separated_group_names +``` This configuration will be used to configure the allowed group names. +```   hive.server2.saml2.sp.callback.url   callback_url_of_hiveserver2 +``` The http URL endpoint where the SAML assertion is posted back by the IDP. Currently this must be on the same port as HiveServer2’s http endpoint and must be TLS enabled (https) on secure setups. diff --git a/content/Development/desingdocs/type-qualifiers-in-hive.md b/content/Development/desingdocs/type-qualifiers-in-hive.md index eed01d3a..ddd0cdd3 100644 --- a/content/Development/desingdocs/type-qualifiers-in-hive.md +++ b/content/Development/desingdocs/type-qualifiers-in-hive.md @@ -39,16 +39,14 @@ The type qualifiers could simply be stored as part of the type string for a colu This approach would be similar to the attributes in the INFORMATION_SCHEMA.COLUMNS that some DBMS catalog tables have, such as those listed below: -
-
+```
 |  CHARACTER_MAXIMUM_LENGTH  |  bigint(21) unsigned  |  YES  |   |  NULL  |   |
 |  CHARACTER_OCTET_LENGTH  |  bigint(21) unsigned  |  YES  |   |  NULL  |   |
 |  NUMERIC_PRECISION  |  bigint(21) unsigned  |  YES  |   |  NULL  |   |
 |  NUMERIC_SCALE  |  bigint(21) unsigned  |  YES  |   |  NULL  |   |
 |  CHARACTER_SET_NAME  |  varchar(32)  |  YES  |   |  NULL  |   |
 |  COLLATION_NAME  |  varchar(32)  |  YES  |   |  NULL  |   |
-
-
+``` We could add new columns to the COLUMNS_V2 table for any type qualifiers we are trying to support (initially looks like CHARACTER_MAXIMUM_LENGTH, NUMERIC_PRECISION, NUMERIC_SCALE). Advantages to this would be that it is easier to query these parameters than the first approach, though types with no parameters would still have these columns (set to null). diff --git a/content/Development/gettingstarted-latest.md b/content/Development/gettingstarted-latest.md index 72b2184a..e0712ef4 100644 --- a/content/Development/gettingstarted-latest.md +++ b/content/Development/gettingstarted-latest.md @@ -77,7 +77,7 @@ To build the current Hive code from the master branch: Here, {version} refers to the current Hive version. -If building Hive source using Maven (mvn), we will refer to the directory "/packaging/target/apache-hive-{version}-SNAPSHOT-bin/apache-hive-{version}-SNAPSHOT-bin" as for the rest of the page. +If building Hive source using Maven (mvn), we will refer to the directory "/packaging/target/apache-hive-{version}-SNAPSHOT-bin/apache-hive-{version}-SNAPSHOT-bin" as `` for the rest of the page. #### Compile Hive on branch-1 diff --git a/content/docs/latest/admin/adminmanual-configuration.md b/content/docs/latest/admin/adminmanual-configuration.md index b7dd3a55..732a8816 100644 --- a/content/docs/latest/admin/adminmanual-configuration.md +++ b/content/docs/latest/admin/adminmanual-configuration.md @@ -43,7 +43,7 @@ The server-specific configuration file is useful in two situations: If HiveServer2 is using the metastore in embedded mode, hivemetastore-site.xml also is loaded. The order of precedence of the config files is as follows (later one has higher precedence) – - hive-site.xml -> hivemetastore-site.xml -> hiveserver2-site.xml -> '`-hiveconf`' commandline parameters. + hive-site.xml -> hivemetastore-site.xml -> hiveserver2-site.xml -> '`-hiveconf`' commandline parameters. ### hive-site.xml and hive-default.xml.template @@ -61,8 +61,8 @@ The administrative configuration variables are listed [below]({{< ref "#below" > Hive uses temporary folders both on the machine running the Hive client and the default HDFS instance. These folders are used to store per-query temporary/intermediate data sets and are normally cleaned up by the hive client when the query is finished. However, in cases of abnormal hive client termination, some data may be left behind. The configuration details are as follows: -* On the HDFS cluster this is set to */tmp/hive-* by default and is controlled by the configuration variable *hive.exec.scratchdir* -* On the client machine, this is hardcoded to */tmp/* +* On the HDFS cluster this is set to `*/tmp/hive-*` by default and is controlled by the configuration variable *hive.exec.scratchdir* +* On the client machine, this is hardcoded to `*/tmp/*` Note that when writing data to a table/partition, Hive will first write to a temporary location on the target table's filesystem (using hive.exec.scratchdir as the temporary location) and then move the data to the target table. This applies in all cases - whether tables are stored in HDFS (normal case) or in file systems like S3 or even NFS. @@ -98,9 +98,9 @@ Version information: Metrics | hive.ddl.output.format | The data format to use for DDL output (e.g. `DESCRIBE table`). One of "text" (for human readable text) or "json" (for a json object). (As of Hive [0.9.0](https://issues.apache.org/jira/browse/HIVE-2822).) | text | | hive.exec.script.wrapper | Wrapper around any invocations to script operator e.g. if this is set to python, the script passed to the script operator will be invoked as `python