Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion content/Development/desingdocs/column-statistics-in-hive.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ describe formatted [table_name] [column_name];

To persist column level statistics, we propose to add the following new tables,

```
CREATE TABLE TAB_COL_STATS
(
CS_ID NUMBER NOT NULL,
Expand Down Expand Up @@ -87,11 +88,13 @@ AVG_COL_LEN DOUBLE,
ALTER TABLE COLUMN_STATISTICS ADD CONSTRAINT COLUMN_STATISTICS_PK PRIMARY KEY (CS_ID);

ALTER TABLE COLUMN_STATISTICS ADD CONSTRAINT COLUMN_STATISTICS_FK1 FOREIGN KEY (PART_ID) REFERENCES PARTITIONS (PART_ID) INITIALLY DEFERRED;
```

### **Metastore Thrift API**

We propose to add the following Thrift structs to transport column statistics:

```
struct BooleanColumnStatsData {
1: required i64 numTrues,
2: required i64 numFalses,
Expand Down Expand Up @@ -185,9 +188,11 @@ struct ColumnStatistics {
1: required ColumnStatisticsDesc statsDesc,
2: required list<ColumnStatisticsObj> statsObj;
}
```

We propose to add the following Thrift APIs to persist, retrieve and delete column statistics:

```
bool update_table_column_statistics(1:ColumnStatistics stats_obj) throws (1:NoSuchObjectException o1,
2:InvalidObjectException o2, 3:MetaException o3, 4:InvalidInputException o4)
bool update_partition_column_statistics(1:ColumnStatistics stats_obj) throws (1:NoSuchObjectException o1,
Expand All @@ -205,8 +210,8 @@ bool delete_partition_column_statistics(1:string db_name, 2:string tbl_name, 3:s
bool delete_table_column_statistics(1:string db_name, 2:string tbl_name, 3:string col_name) throws
(1:NoSuchObjectException o1, 2:MetaException o2, 3:InvalidObjectException o3,
4:InvalidInputException o4)
```

Note that delete_column_statistics is needed to remove the entries from the metastore when a table is dropped. Also note that currently Hive doesn’t support drop column.

Note that in V1 of the project, we will support only scalar statistics. Furthermore, we will support only static partitions, i.e., both the partition key and partition value should be specified in the analyze command. In a following version, we will add support for height balanced histograms as well as support for dynamic partitions in the analyze command for column level statistics.

6 changes: 3 additions & 3 deletions content/Development/desingdocs/default-constraint.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,18 +27,18 @@ DEFAULT will be a fifth addition to this list. Note that unlike existing constra
CREATE TABLE will be updated to let user specify DEFAULT as follows:

* With column definition
+ CREATE TABLE <tableName> (<columnName> <dataType> DEFAULT <defaultValue>)
* `CREATE TABLE <tableName> (<columnName> <dataType> DEFAULT <defaultValue>)`

* ~~With constraint specification~~
+ ~~CREATE TABLE <tableName> (<columnName> <dataType>, …, CONSTRAINT <constraintName> DEFAULT <defaultValue> (<columnName>)~~
* ~~`CREATE TABLE <tableName> (<columnName> <dataType>, …, CONSTRAINT <constraintName> DEFAULT <defaultValue> (<columnName>)`~~

To be compliant with SQL standards, Hive will only permit default values which fall in one of the following categories:

* LITERAL
* DATE TIME VALUE FUNCTION, that is, CURRENT_TIME, CURRENT_DATE
* CURRENT_USER()
* NULL
* CAST (<expression in above category> as PRIMITIVE TYPE)
* CAST (&lt;expression in above category&gt; as PRIMITIVE TYPE)

## INSERT

Expand Down
2 changes: 1 addition & 1 deletion content/Development/desingdocs/design.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Figure 1 also shows how a typical query flows through the system. The UI calls t
Data in Hive is organized into:

* Tables – These are analogous to Tables in Relational Databases. Tables can be filtered, projected, joined and unioned. Additionally all the data of a table is stored in a directory in HDFS. Hive also supports the notion of external tables wherein a table can be created on prexisting files or directories in HDFS by providing the appropriate location to the table creation DDL. The rows in a table are organized into typed columns similar to Relational Databases.
* Partitions – Each Table can have one or more partition keys which determine how the data is stored, for example a table T with a date partition column ds had files with data for a particular date stored in the <table location>/ds=<date> directory in HDFS. Partitions allow the system to prune data to be inspected based on query predicates, for example a query that is interested in rows from T that satisfy the predicate T.ds = '2008-09-01' would only have to look at files in <table location>/ds=2008-09-01/ directory in HDFS.
* Partitions – Each Table can have one or more partition keys which determine how the data is stored, for example a table T with a date partition column ds had files with data for a particular date stored in the `<table location>/ds=<date>` directory in HDFS. Partitions allow the system to prune data to be inspected based on query predicates, for example a query that is interested in rows from T that satisfy the predicate T.ds = '2008-09-01' would only have to look at files in `<table location>/ds=2008-09-01/` directory in HDFS.
* Buckets – Data in each partition may in turn be divided into Buckets based on the hash of a column in the table. Each bucket is stored as a file in the partition directory. Bucketing allows the system to efficiently evaluate queries that depend on a sample of data (these are queries that use the SAMPLE clause on the table).

Apart from primitive column types (integers, floating point numbers, generic strings, dates and booleans), Hive also supports arrays and maps. Additionally, users can compose their own types programmatically from any of the primitives, collections or other user-defined types. The typing system is closely tied to the SerDe (Serailization/Deserialization) and object inspector interfaces. User can create their own types by implementing their own object inspectors, and using these object inspectors they can create their own SerDes to serialize and deserialize their data into HDFS files). These two interfaces provide the necessary hooks to extend the capabilities of Hive when it comes to understanding other data formats and richer types. Builtin object inspectors like ListObjectInspector, StructObjectInspector and MapObjectInspector provide the necessary primitives to compose richer types in an extensible manner. For maps (associative arrays) and arrays useful builtin functions like size and index operators are provided. The dotted notation is used to navigate nested types, for example a.b.c = 1 looks at field c of field b of type a and compares that with 1.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,7 @@ Examples of conversion of query plan to hbase api calls


| Filter expression | HBase calls |
|-|-|
| p1 > 10 and p1 < 20 | Scan(X10+, X20) |
| p1 = 10 (if single partition column) | Scan(X10, X10+). Optimized? : Get(X10) |
| Similar case as above, if all partition columns are specified | |
Expand Down Expand Up @@ -162,77 +163,44 @@ ExpressionTree (existing) - TreeNodes for AND/OR expressions. Leaf Node for leaf

Output:

```
 public static abstract class FilterPlan {

   abstract FilterPlan and(FilterPlan other);

   abstract FilterPlan or(FilterPlan other);

   abstract List<ScanPlan> getPlans();

 }



// represents a union of multiple ScanPlan

MultiScanPlan extends FilterPlan




ScanPlan extends FilterPlan

   // represent Scan start

   private ScanMarker startMarker ;

   // represent Scan end

   private ScanMarker endMarker ;

   private ScanFilter filter;



public FilterPlan and(FilterPlan other) {

// calls this.and(otherScanPlan) on each scan plan in other

}

private ScanPlan and(ScanPlan other) {

  // combines start marker and end marker and filters of this and other

}

public FilterPlan or(FilterPlan other) {

  // just create a new FilterPlan from other, with this additional plan

}




PartitionFilterGenerator -

 /**

  * Visitor for ExpressionTree.

  * It first generates the ScanPlan for the leaf nodes. The higher level nodes are

  * either AND or OR operations. It then calls FilterPlan.and and FilterPlan.or with

  * the child nodes to generate the plans for higher level nodes.

  */




```

Initial implementation: Convert from from ExpressionTree to Hbase filter, thereby implementing both getPartitionsByFilter and getPartitionsByExpr

Expand Down