diff --git a/content/Development/desingdocs/column-statistics-in-hive.md b/content/Development/desingdocs/column-statistics-in-hive.md index e6a9c3f1..50230987 100644 --- a/content/Development/desingdocs/column-statistics-in-hive.md +++ b/content/Development/desingdocs/column-statistics-in-hive.md @@ -34,6 +34,7 @@ describe formatted [table_name] [column_name]; To persist column level statistics, we propose to add the following new tables, +``` CREATE TABLE TAB_COL_STATS ( CS_ID NUMBER NOT NULL, @@ -87,11 +88,13 @@ AVG_COL_LEN DOUBLE, ALTER TABLE COLUMN_STATISTICS ADD CONSTRAINT COLUMN_STATISTICS_PK PRIMARY KEY (CS_ID); ALTER TABLE COLUMN_STATISTICS ADD CONSTRAINT COLUMN_STATISTICS_FK1 FOREIGN KEY (PART_ID) REFERENCES PARTITIONS (PART_ID) INITIALLY DEFERRED; +``` ### **Metastore Thrift API** We propose to add the following Thrift structs to transport column statistics: +``` struct BooleanColumnStatsData { 1: required i64 numTrues, 2: required i64 numFalses, @@ -185,9 +188,11 @@ struct ColumnStatistics { 1: required ColumnStatisticsDesc statsDesc, 2: required list statsObj; } +``` We propose to add the following Thrift APIs to persist, retrieve and delete column statistics: +``` bool update_table_column_statistics(1:ColumnStatistics stats_obj) throws (1:NoSuchObjectException o1, 2:InvalidObjectException o2, 3:MetaException o3, 4:InvalidInputException o4) bool update_partition_column_statistics(1:ColumnStatistics stats_obj) throws (1:NoSuchObjectException o1, @@ -205,8 +210,8 @@ bool delete_partition_column_statistics(1:string db_name, 2:string tbl_name, 3:s bool delete_table_column_statistics(1:string db_name, 2:string tbl_name, 3:string col_name) throws (1:NoSuchObjectException o1, 2:MetaException o2, 3:InvalidObjectException o3, 4:InvalidInputException o4) +``` Note that delete_column_statistics is needed to remove the entries from the metastore when a table is dropped. Also note that currently Hive doesn’t support drop column. Note that in V1 of the project, we will support only scalar statistics. Furthermore, we will support only static partitions, i.e., both the partition key and partition value should be specified in the analyze command. In a following version, we will add support for height balanced histograms as well as support for dynamic partitions in the analyze command for column level statistics. - diff --git a/content/Development/desingdocs/default-constraint.md b/content/Development/desingdocs/default-constraint.md index 54ae4790..ea0aae51 100644 --- a/content/Development/desingdocs/default-constraint.md +++ b/content/Development/desingdocs/default-constraint.md @@ -27,10 +27,10 @@ DEFAULT will be a fifth addition to this list. Note that unlike existing constra CREATE TABLE will be updated to let user specify DEFAULT as follows: * With column definition -+ CREATE TABLE ( DEFAULT ) + * `CREATE TABLE ( DEFAULT )` * ~~With constraint specification~~ -+ ~~CREATE TABLE ( , …, CONSTRAINT DEFAULT ()~~ + * ~~`CREATE TABLE ( , …, CONSTRAINT DEFAULT ()`~~ To be compliant with SQL standards, Hive will only permit default values which fall in one of the following categories: @@ -38,7 +38,7 @@ To be compliant with SQL standards, Hive will only permit default values which f * DATE TIME VALUE FUNCTION, that is, CURRENT_TIME, CURRENT_DATE * CURRENT_USER() * NULL -* CAST ( as PRIMITIVE TYPE) +* CAST (<expression in above category> as PRIMITIVE TYPE) ## INSERT diff --git a/content/Development/desingdocs/design.md b/content/Development/desingdocs/design.md index fdb3d841..8b4ca2e1 100644 --- a/content/Development/desingdocs/design.md +++ b/content/Development/desingdocs/design.md @@ -27,7 +27,7 @@ Figure 1 also shows how a typical query flows through the system. The UI calls t Data in Hive is organized into: * Tables – These are analogous to Tables in Relational Databases. Tables can be filtered, projected, joined and unioned. Additionally all the data of a table is stored in a directory in HDFS. Hive also supports the notion of external tables wherein a table can be created on prexisting files or directories in HDFS by providing the appropriate location to the table creation DDL. The rows in a table are organized into typed columns similar to Relational Databases. -* Partitions – Each Table can have one or more partition keys which determine how the data is stored, for example a table T with a date partition column ds had files with data for a particular date stored in the /ds= directory in HDFS. Partitions allow the system to prune data to be inspected based on query predicates, for example a query that is interested in rows from T that satisfy the predicate T.ds = '2008-09-01' would only have to look at files in
/ds=2008-09-01/ directory in HDFS. +* Partitions – Each Table can have one or more partition keys which determine how the data is stored, for example a table T with a date partition column ds had files with data for a particular date stored in the `
/ds=` directory in HDFS. Partitions allow the system to prune data to be inspected based on query predicates, for example a query that is interested in rows from T that satisfy the predicate T.ds = '2008-09-01' would only have to look at files in `
/ds=2008-09-01/` directory in HDFS. * Buckets – Data in each partition may in turn be divided into Buckets based on the hash of a column in the table. Each bucket is stored as a file in the partition directory. Bucketing allows the system to efficiently evaluate queries that depend on a sample of data (these are queries that use the SAMPLE clause on the table). Apart from primitive column types (integers, floating point numbers, generic strings, dates and booleans), Hive also supports arrays and maps. Additionally, users can compose their own types programmatically from any of the primitives, collections or other user-defined types. The typing system is closely tied to the SerDe (Serailization/Deserialization) and object inspector interfaces. User can create their own types by implementing their own object inspectors, and using these object inspectors they can create their own SerDes to serialize and deserialize their data into HDFS files). These two interfaces provide the necessary hooks to extend the capabilities of Hive when it comes to understanding other data formats and richer types. Builtin object inspectors like ListObjectInspector, StructObjectInspector and MapObjectInspector provide the necessary primitives to compose richer types in an extensible manner. For maps (associative arrays) and arrays useful builtin functions like size and index operators are provided. The dotted notation is used to navigate nested types, for example a.b.c = 1 looks at field c of field b of type a and compares that with 1. diff --git a/content/Development/desingdocs/hbase-execution-plans-for-rawstore-partition-filter-condition.md b/content/Development/desingdocs/hbase-execution-plans-for-rawstore-partition-filter-condition.md index 9cef21f3..8aad01c7 100644 --- a/content/Development/desingdocs/hbase-execution-plans-for-rawstore-partition-filter-condition.md +++ b/content/Development/desingdocs/hbase-execution-plans-for-rawstore-partition-filter-condition.md @@ -129,6 +129,7 @@ Examples of conversion of query plan to hbase api calls   | Filter expression | HBase calls | +|-|-| | p1 > 10 and p1 < 20 | Scan(X10+, X20) | | p1 = 10 (if single partition column) | Scan(X10, X10+). Optimized? : Get(X10) | | Similar case as above, if all partition columns are specified | | @@ -162,77 +163,44 @@ ExpressionTree (existing) - TreeNodes for AND/OR expressions. Leaf Node for leaf Output: +```  public static abstract class FilterPlan { -    abstract FilterPlan and(FilterPlan other); -    abstract FilterPlan or(FilterPlan other); -    abstract List getPlans(); -  } - - // represents a union of multiple ScanPlan - MultiScanPlan extends FilterPlan - ScanPlan extends FilterPlan -    // represent Scan start -    private ScanMarker startMarker ; -    // represent Scan end -    private ScanMarker endMarker ; -    private ScanFilter filter; - - public FilterPlan and(FilterPlan other) { - // calls this.and(otherScanPlan) on each scan plan in other - } - private ScanPlan and(ScanPlan other) { -   // combines start marker and end marker and filters of this and other - } - public FilterPlan or(FilterPlan other) { -   // just create a new FilterPlan from other, with this additional plan - } - - PartitionFilterGenerator - -  /** -   * Visitor for ExpressionTree. -   * It first generates the ScanPlan for the leaf nodes. The higher level nodes are -   * either AND or OR operations. It then calls FilterPlan.and and FilterPlan.or with -   * the child nodes to generate the plans for higher level nodes. -   */ - - - - +``` Initial implementation: Convert from from ExpressionTree to Hbase filter, thereby implementing both getPartitionsByFilter and getPartitionsByExpr