Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -84,10 +84,10 @@ been imposed to simplify the problem:
The same idea can be extended for partitioned tables.

* The user can also decide to run in a particular cluster.
+ Use cluster <ClusterName>
+ Use cluster `<ClusterName>`
* The system will not make an attempt to choose the cluster for the user, but only try to figure out if the query can be run

in the <clusterName>. If the query can run in this cluster, it will succeed. Otherwise, it will fail.
in the `<clusterName>`. If the query can run in this cluster, it will succeed. Otherwise, it will fail.
* The user can go back to the behavior to use the default cluster.
+ Use cluster

Expand All @@ -101,7 +101,7 @@ The same idea can be extended for partitioned tables.

PrimaryCluster - ClusterStorageDescriptor

and SecondaryClusters - Set<ClusterStorageDescriptor>
and SecondaryClusters - Set&lt;ClusterStorageDescriptor>&gt;

The ClusterStorageDescriptor contains the following:

Expand All @@ -128,12 +128,12 @@ The existing thrift API's will continue to work as if the user is trying to acce

New APIs will be added which take the cluster as a new parameter. Almost all the existing APIs will be

enhanced to support this. The behavior will be the same as if, the user issued the command 'USE CLUSTER <CLUSTERNAME>
enhanced to support this. The behavior will be the same as if, the user issued the command `USE CLUSTER <CLUSTERNAME>`

* A new parameter will be added to keep the filesystem and jobtrackers for a cluster
+ hive.cluster.properties: This will be json - ClusterName -> <FileSystem, JobTracker>
+ use cluster <cluster name> will fail if <cluster name> is not present hive.cluster.properties
+ The other option was to support create cluster <> etc. but that would have required storing the cluster information in the
+ hive.cluster.properties: This will be json - ClusterName -&gt; &lt;FileSystem, JobTracker&gt;
+ use cluster `<cluster name>` will fail if `<cluster name>` is not present hive.cluster.properties
+ The other option was to support create cluster `<>` etc. but that would have required storing the cluster information in the

metastore including jobtracker etc. which would be difficult to change per session.

Expand Down
31 changes: 9 additions & 22 deletions content/Development/desingdocs/hive-metadata-caching-proposal.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,57 +63,44 @@ Presto has the following cache:
+ userTablePrivileges

* Range scan cache
+ databaseNamesCache: regex -> database names, facilitates database search
+ databaseNamesCache: regex -&gt; database names, facilitates database search
+ tableNamesCache
+ viewNamesCache
+ partitionNamesCache: table name -> partition names
+ partitionNamesCache: table name -&gt; partition names

* Other
+ partitionFilterCache: PS -> partition names, facilitates partition pruning
+ partitionFilterCache: PS -&gt; partition names, facilitates partition pruning

For every partition filter condition, Presto breaks it down into tupleDomain and remainder:

```
AddExchanges.planTableScan:

           DomainTranslator.ExtractionResult decomposedPredicate = DomainTranslator.fromPredicate(

                   metadata,

                   session,

                   deterministicPredicate,

                   types);

   public static class ExtractionResult

   {

       private final TupleDomain<Symbol> tupleDomain;

       private final Expression remainingExpression;

   }
```

tupleDomain is a mapping of column -> range or exact value. When converting to PS, any range will be converted into wildcard and only exact value will be considered:
tupleDomain is a mapping of column -&gt; range or exact value. When converting to PS, any range will be converted into wildcard and only exact value will be considered:

```
HivePartitionManager.getFilteredPartitionNames:

       for (HiveColumnHandle partitionKey : partitionKeys) {

           if (domain != null && domain.isNullableSingleValue()) {

                   filter.add(((Slice) value).toStringUtf8());

           else {

               filter.add(PARTITION_VALUE_WILDCARD);

           }

       }
```

For example, the expression “state = CA and date between ‘201612’ and ‘201701’ will be broken down to PS (state = CA) and remainder date between ‘201612’ and ‘201701’. Presto will retrieve the partitions with state = CA from the PS -> partition name cache and partition object cache, and evaluates “date between ‘201612’ and ‘201701’ for every partitions returned. This is a good balance compare to caching partition names for every expression.
For example, the expression “state = CA and date between ‘201612’ and ‘201701’ will be broken down to PS (state = CA) and remainder date between ‘201612’ and ‘201701’. Presto will retrieve the partitions with state = CA from the PS -&gt; partition name cache and partition object cache, and evaluates “date between ‘201612’ and ‘201701’ for every partitions returned. This is a good balance compare to caching partition names for every expression.

## Our Approach

Expand Down
26 changes: 13 additions & 13 deletions content/Development/desingdocs/hivereplicationv2development.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,7 @@ Event 100: ALTER TABLE tbl ADD PARTITION (p=1) SET LOCATION <location>;
Event 110: ALTER TABLE tbl DROP PARTITION (p=1);
Event 120: ALTER TABLE tbl ADD PARTITION (p=1) SET LOCATION <location>;
```
When loading the dump on the destination side (at a much later point), when the event 100 is replayed, the load task on the destination will try to pull the files from the <location> (the _files contains the path of <location>), which may contain new or different data. To replicate the exact state of the source at the time event 100 occurred at the source, we do the following:
When loading the dump on the destination side (at a much later point), when the event 100 is replayed, the load task on the destination will try to pull the files from the `<location>` (the _files contains the path of `<location>`), which may contain new or different data. To replicate the exact state of the source at the time event 100 occurred at the source, we do the following:

1. When Event 100 occurs at the source, in the notification event, we store the checksum of the file(s) in the newly added partition along with the file path(s).
2. When Event 110 occurs at the source, we move the files of the dropped partition to $cmroot/database/tbl/p=1 instead of purging them.
Expand Down Expand Up @@ -212,7 +212,9 @@ The current implementation of replication is built upon existing commands EXPORT
This is better described via various examples of each of the pieces of the command syntax, as follows:


(a) REPL DUMP sales;       REPL DUMP sales.['.*?']Replicates out sales database for bootstrap, from <init-evid>=0 (bootstrap case) to <end-evid>=<CURR-EVID> with a batch size of 0, i.e. no batching.
(a) REPL DUMP sales;       REPL DUMP sales.['.*?']

Replicates out sales database for bootstrap, from `<init-evid>=0` (bootstrap case) to `<end-evid>=<CURR-EVID>` with a batch size of 0, i.e. no batching.`

(b) REPL DUMP sales.['T3', '[a-z]+'];

Expand All @@ -228,15 +230,15 @@ This sets up db-level replication that excludes all the tables/views but include

(e) REPL DUMP sales FROM 200 TO 1400;

The presence of a FROM <init-evid> tag makes this dump not a bootstrap, but a dump which looks at the event log to produce a delta dump. FROM 200 TO 1400 is self-evident in that it will go through event ids 200 to 1400 looking for events from the relevant db.
The presence of a FROM `<init-evid>` tag makes this dump not a bootstrap, but a dump which looks at the event log to produce a delta dump. FROM 200 TO 1400 is self-evident in that it will go through event ids 200 to 1400 looking for events from the relevant db.

(f) REPL DUMP sales FROM 200;

Similar to above, but with an implicit assumed <end-evid> as being the current event id at the time the command is run.
Similar to above, but with an implicit assumed `<end-evid>` as being the current event id at the time the command is run.

(g) REPL DUMP sales FROM 200 to 1400 LIMIT 100;REPL DUMP sales FROM 200 LIMIT 100;

Similar to cases (d) & (e), with the addition of a batch size of <num-evids>=100. This causes us to stop processing if we reach 100 events, and return at that point. Note that this does not mean that we stop processing at event id = 300, since we began at 200 - it means that we will stop processing events when we have processed 100 events in the event stream (that has unrelated events) belonging to this replication-definition, i.e. of a relevant db or db.table, then we stop.
Similar to cases (d) & (e), with the addition of a batch size of `<num-evids>=100`. This causes us to stop processing if we reach 100 events, and return at that point. Note that this does not mean that we stop processing at event id = 300, since we began at 200 - it means that we will stop processing events when we have processed 100 events in the event stream (that has unrelated events) belonging to this replication-definition, i.e. of a relevant db or db.table, then we stop.

(h) REPL DUMP sales.['[a-z]+'] REPLACE sales FROM 200;

Expand All @@ -258,8 +260,8 @@ The REPL DUMP command has an optional WITH clause to set command-specific confi

1. Error codes returned as return error codes (and over jdbc if with HS2)
2. Returns 2 columns in the ResultSet:
1. <dir-name> - the directory to which it has dumped info.
2. <last-evid> - the last event-id associated with this dump, which might be the end-evid, or the curr-evid, as the case may be.
1. `<dir-name>` - the directory to which it has dumped info.
2. `<last-evid>` - the last event-id associated with this dump, which might be the end-evid, or the curr-evid, as the case may be.

#### Note:

Expand All @@ -275,20 +277,18 @@ When bootstrap dump is in progress, it blocks rename table/partition operations

Look up the HiveServer logs for below pair of log messages.

> REPL DUMP:: Set property for Database: <db_name>, Property: <bootstrap.dump.state.xxxx>, Value: ACTIVE
>
> REPL DUMP:: Reset property for Database: <db_name>, Property: <bootstrap.dump.state.xxxx>
>
> REPL DUMP:: Set property for Database: `<db_name>`, Property: `<bootstrap.dump.state.xxxx>`, Value: ACTIVE
>
> REPL DUMP:: Reset property for Database: `<db_name>`, Property: `<bootstrap.dump.state.xxxx>`

If Reset property log is not found for the corresponding Set property log, then user need to manually reset the database property <bootstrap.dump.state.xxxx> with value as "IDLE" using ALTER DATABASE command.
If Reset property log is not found for the corresponding Set property log, then user need to manually reset the database property `<bootstrap.dump.state.xxxx>` with value as "IDLE" using ALTER DATABASE command.

## REPL LOAD

`REPL LOAD {<dbname>} FROM <dirname> {WITH ('key1'='value1', 'key2'='value2')};`


This causes a REPL DUMP present in <dirname> (which is to be a fully qualified HDFS URL) to be pulled and loaded. If <dbname> is specified, and the original dump was a database-level dump, this allows Hive to do db-rename-mapping on import. If dbname is not specified, the original dbname as recorded in the dump would be used.The REPL LOAD command has an optional WITH clause to set command-specific configurations to be used when trying to copy from the source cluster. These configurations are only used by the corresponding REPL LOAD command and won't be used for other queries running in the same session.
This causes a REPL DUMP present in `<dirname>` (which is to be a fully qualified HDFS URL) to be pulled and loaded. If `<dbname>` is specified, and the original dump was a database-level dump, this allows Hive to do db-rename-mapping on import. If dbname is not specified, the original dbname as recorded in the dump would be used.The REPL LOAD command has an optional WITH clause to set command-specific configurations to be used when trying to copy from the source cluster. These configurations are only used by the corresponding REPL LOAD command and won't be used for other queries running in the same session.

#### Return values:

Expand Down
2 changes: 1 addition & 1 deletion content/Development/desingdocs/indexdev.md
Original file line number Diff line number Diff line change
Expand Up @@ -281,7 +281,7 @@ TBD: we will be adding methods for calling the handler when an index is dropped

The reference implementation creates what is referred to as a "compact" index. This means that rather than storing the HDFS location of each occurrence of a particular value, it only stores the addresses of HDFS blocks containing that value. This is optimized for point-lookups in the case where a value typically occurs more than once in nearby rows; the index size is kept small since there are many fewer blocks than rows. The tradeoff is that extra work is required during queries in order to filter out the other rows from the indexed blocks.

The compact index is stored in an index table. The index table columns consist of the indexed columns from the base table followed by a _bucketname string column (indicating the name of the file containing the indexed block) followed by an _offsets array<string> column (indicating the block offsets within the corresponding file). The index table is stored as sorted on the indexed columns (but not on the generated columns).
The compact index is stored in an index table. The index table columns consist of the indexed columns from the base table followed by a _bucketname string column (indicating the name of the file containing the indexed block) followed by an `_offsets array<string>` column (indicating the block offsets within the corresponding file). The index table is stored as sorted on the indexed columns (but not on the generated columns).

The reference implementation can be plugged in with

Expand Down
2 changes: 1 addition & 1 deletion content/Development/desingdocs/subqueries-in-select.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ SELECT customer.customer_num,
) AS total_ship_chg
FROM customer
```
* Subqueries with DISTINCT are not allowed. Since DISTINCT <expression> will be evaluated as GROUP BY <expression>, subqueries with DISTINCT are disallowed for now.
* Subqueries with DISTINCT are not allowed. Since `DISTINCT <expression>` will be evaluated as `GROUP BY <expression>`, subqueries with `DISTINCT` are disallowed for now.

# Design

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -50,45 +50,57 @@ In order to make sure that the SAML assertions received by HiveServer2 are valid

Following new configurations will be added to the hive-site.xml which would need to be configured by the clients.

```
<property>
  <name>hive.server2.authentication</name>
  <value>SAML</value>
</property>
```

This configuration will be set to SAML to indicate that the server will use SAML 2.0 protocol to authenticate the user. 

```
<property>
  <name>hive.server2.saml2.idp.metadata</name>
  <value>path_to_idp_metadata.xml</value>
</property>
```

This configuration will provide a path to the IDP metadata xml file.

```
<property>
  <name>hive.server2.saml2.sp.entity.id</name>
  <value>test_sp_entity_id</value>
</property>
```

This configuration should be same the service provider entity id as configured in the IDP. Some identity providers require this to be same as the ACS URL.

```
<property>
  <name>hive.server2.saml2.group.attribute.name</name>
  <value>group_attribute_name</value>
</property>
```

This configuration will be used to map the SAML attribute in the response to the groups of the user. This should be configured in the identity provider as the attribute name for the group information.

```
<property>
  <name>hive.server2.saml2.group.filter</name>
  <value>comma_separated_group_names</value>
</property>
```

This configuration will be used to configure the allowed group names.

```
<property>
  <name>hive.server2.saml2.sp.callback.url</name>
  <value>callback_url_of_hiveserver2</value>
</property>
```

The http URL endpoint where the SAML assertion is posted back by the IDP. Currently this must be on the same port as HiveServer2’s http endpoint and must be TLS enabled (https) on secure setups.

Expand Down
6 changes: 2 additions & 4 deletions content/Development/desingdocs/type-qualifiers-in-hive.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,16 +39,14 @@ The type qualifiers could simply be stored as part of the type string for a colu

This approach would be similar to the attributes in the INFORMATION_SCHEMA.COLUMNS that some DBMS catalog tables have, such as those listed below:

<pre>

```
| CHARACTER_MAXIMUM_LENGTH | bigint(21) unsigned | YES |   | NULL |   |
| CHARACTER_OCTET_LENGTH | bigint(21) unsigned | YES |   | NULL |   |
| NUMERIC_PRECISION | bigint(21) unsigned | YES |   | NULL |   |
| NUMERIC_SCALE | bigint(21) unsigned | YES |   | NULL |   |
| CHARACTER_SET_NAME | varchar(32) | YES |   | NULL |   |
| COLLATION_NAME | varchar(32) | YES |   | NULL |   |

</pre>
```

We could add new columns to the COLUMNS_V2 table for any type qualifiers we are trying to support (initially looks like CHARACTER_MAXIMUM_LENGTH, NUMERIC_PRECISION, NUMERIC_SCALE). Advantages to this would be that it is easier to query these parameters than the first approach, though types with no parameters would still have these columns (set to null).

Expand Down
2 changes: 1 addition & 1 deletion content/Development/gettingstarted-latest.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ To build the current Hive code from the master branch:

Here, {version} refers to the current Hive version.

If building Hive source using Maven (mvn), we will refer to the directory "/packaging/target/apache-hive-{version}-SNAPSHOT-bin/apache-hive-{version}-SNAPSHOT-bin" as <install-dir> for the rest of the page.
If building Hive source using Maven (mvn), we will refer to the directory "/packaging/target/apache-hive-{version}-SNAPSHOT-bin/apache-hive-{version}-SNAPSHOT-bin" as `<install-dir>` for the rest of the page.

#### Compile Hive on branch-1

Expand Down
Loading