From 1a0bd31877462d7178e4b6ce23b6f98975925bfb Mon Sep 17 00:00:00 2001 From: Michele Mor Date: Wed, 29 Nov 2017 13:13:46 +0000 Subject: [PATCH 1/2] Updated blog on Hadoop and Hive --- _posts/2015-06-06-PDI-Hadoop-Dev-Env.md | 89 ++++++++++++++++--------- 1 file changed, 59 insertions(+), 30 deletions(-) diff --git a/_posts/2015-06-06-PDI-Hadoop-Dev-Env.md b/_posts/2015-06-06-PDI-Hadoop-Dev-Env.md index b4ceb0c..5cdc12c 100644 --- a/_posts/2015-06-06-PDI-Hadoop-Dev-Env.md +++ b/_posts/2015-06-06-PDI-Hadoop-Dev-Env.md @@ -16,21 +16,21 @@ In this article I'd like to explain how to set up **Vanilla Hadoop** and configu # Apache Hadoop and Hive Installation: Minimal Dev Environment -This article focuses on setting up a minimal Hadoop dev environment. There are certainly easier ways, e.g. using one of the VMs supplied by one of the major commercial Hadoop vendors: If you have a machine with **enough memory** this is certainly the most convenient way to go. However, you might not want to sacrifice a vast amount of **RAM** or not have such a high spec machine, so setting up **Hadoop** natively on your machine is the way to go. It might not be the easiest way, but you'll certainly learn a few interesting details on the way - that's what any great journey is about! +This article focuses on setting up a minimal Hadoop dev environment. There are certainly easier ways, e.g. using one of the VMs supplied by one of the major commercial Hadoop vendors; if you have a machine with **enough memory** this is certainly the most convenient way to go. However, you might not want to sacrifice a vast amount of **RAM** or not have such a high spec machine, so setting up **Hadoop** natively on your machine is the way to go. It might not be the easiest way, but you'll certainly learn a few interesting details on the way - that's what any great journey is about! -If you very interested in Hadoop, I can strongely recommend the excellent book **Hadoop - The Definitive Guide** by Tom White. +If you are very interested in Hadoop, I can strongly recommend the excellent book **Hadoop - The Definitive Guide** by Tom White. ## Installing HDFS ### Downloading the required files -You can download the files directly from the [Apache Hadoop](https://hadoop.apache.org/releases.html) website or they might be available via your packaging system (e.g. on Fedora it is available via yum or dnf). The instructions here will mainly follow my setup on **Mac OS X** - your milage may vary. +You can download the files directly from the [Apache Hadoop](https://hadoop.apache.org/releases.html) website or they might be available via your packaging system (e.g. on Fedora it is available via yum or dnf). The instructions here will mainly follow my setup on **Mac OS X** - your mileage may vary. On **Mac OS X** you can alternatively install Hadoop via **Homebrew**: `brew install hadoop`, in which case all the files will be located in `/usr/local/Cellar/hadoop/`. I assume you have **JDK** already installed: Check that you are using the correct [supported Java version](http://wiki.apache.org/hadoop/HadoopJavaVersions). -I added the hadoop files under `/Applications/Development/Hadoop`, you can choose any other suitable directory. +I added the Hadoop files under `/Applications/Development/Hadoop`, you can choose any other suitable directory. Add the following to `~/.bash_profile` (adjust if required): @@ -42,11 +42,15 @@ export PATH=$PATH:$HADOOP_HOME/sbin export PATH=$PATH:$HADOOP_HOME/bin ``` -`JAVA_HOME` must be set in `~/.bash_profile` already. +`JAVA_HOME` must be set in `~/.bash_profile` already: +``` +export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre +export PATH=$PATH:$JAVA_HOME/bin +``` Next we have to enable passwordless login: -In regards to **ssh** there is not much to do on **Mac OS X**: Just make sure that you enable **Remote Login** via the Mac OS X **System Settings** > **Sharing**. There is no need to install anything else. **Linux Users** might have to install ssh and openssh-server. +In regards to **ssh** there is not much to do on **Mac OS X**: Just make sure that you enable **Remote Login** via the Mac OS X **System Settings** > **Sharing**. There is no need to install anything else. **Linux Users** might have to install openssh-client and openssh-server. Generate a new **SSH key** with an **empty passphrase**: @@ -90,10 +94,10 @@ For **pseudo-distributed** node add this: > **Note**: The default port for HDFS (the namenode) will be `8020`. You can however set the port as well in the config above like so: `hdfs://localhost:9000`. -`hdfs-site.xml`: - +`hdfs-site.xml`: + **Option 1**: data gets stored in `/tmp` directory. - + ```xml @@ -103,7 +107,7 @@ For **pseudo-distributed** node add this: ``` -**Option 2**: Dedicated **permanent data directory**. You do not have to create the directories on the file system upfront - it will be created automatically for you. +**Option 2**: Dedicated **permanent data directory**. You do not have to create the directories on the file system upfront - they will be created automatically for you. ```xml @@ -122,7 +126,7 @@ For **pseudo-distributed** node add this: ``` -> **Note**: All possible **properties** and **default values** for `hdfs-site.xml` you can find [here](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml). +> **Note**: All possible **properties** and **default values** for `hdfs-site.xml` can be found [here](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml). Next: @@ -137,7 +141,7 @@ Next: ``` -> **Note**: All possible **properties** and **default values** for `mapred-site.xml` you can find [here](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml). +> **Note**: All possible **properties** and **default values** for `mapred-site.xml` can be found [here](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml). And finally: @@ -186,7 +190,7 @@ And finally: ``` -> **Note**: All possible **properties** and **default values** for `yarn-site.xml` you can find [here](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml). +> **Note**: All possible **properties** and **default values** for `yarn-site.xml` can be found [here](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml). Now we can format the **namenode**: @@ -197,9 +201,9 @@ hdfs namenode -format Pay attention to the last few lines of the log output, where you will find the root directory of your HDFS file system, e.g.: `/tmp/hadoop-diethardsteiner/dfs/name`. -To start **HDFS**, **YARN** and **MapReduce**, run the below commands (this can be run from any directory): +To start **HDFS**, **YARN** and **MapReduce**, run the below commands (they can be run from any directory): -> **IMPORTANT**: Before you start Hadoop, make sure that **HDFS** is formatted. Even after the original setup it might be necessary to do this upfront as the HDFS directory might be located in `/tmp` (and hence gets lost each time you restart your machine). +> **IMPORTANT**: Before you start Hadoop, make sure that **HDFS** is formatted. Even after the original setup it might be necessary to do this upfront if the HDFS directory is located in `/tmp` (and hence gets lost each time you restart your machine) as explained in [**Option 1**](#slink1). ``` start-dfs.sh @@ -213,6 +217,11 @@ Check that the following websites are accessible: - [Resource Manager](http://localhost:8088/) - [History Server](http://localhost:19888/) +In Linux, it could be necessary to update etc/environment with a reference to JAVA for dfs and yarn to work: +``` +JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre +``` + Create user directory: ``` @@ -227,7 +236,7 @@ hdfs dfs -ls / Inside the root folder of your Hadoop installation try to run this map-reduce job to check everything is working (amend version number). -> **Note**: The first command will put the file directly into the current user's **HDFS** directory (so make sure it exists). So although only `input` is mentioned, it will automatically be expande to `/user/$USER/input`. +> **Note**: The first command will put the file directly into the current user's **HDFS** directory (so make sure it exists). So although only `input` is mentioned, it will automatically be expanded to `/user/$USER/input`. ``` hdfs dfs -put etc/hadoop input @@ -259,7 +268,7 @@ hdfs dfs -ls /user/diethardsteiner/output sudo netstat -lanp | grep 8088 ``` -If this returns something another process is already using this port. You have two options now: Either stop the other process or change the port for the ResoureManager Web UI. +If this returns something another process is already using this port. You have two options now: either stop the other process or change the port for the ResoureManager Web UI. #### MacOS: Java no such file or directory @@ -306,7 +315,7 @@ Failing this attempt.Diagnostics: Container [pid=13437,containerID=container_151 Container killed on request. Exit code is 143 ``` -**Solution**: As stated in [this post on stackoverflow](https://stackoverflow.com/questions/21005643/container-is-running-beyond-memory-limits): There is a check placed at Yarn level for Vertual and Physical memory usage ratio. Issue is not only that VM doesn't have sufficient pysical memory. But it is because Virtual memory usage is more than expected for given physical memory. +**Solution**: As stated in [this post on stackoverflow](https://stackoverflow.com/questions/21005643/container-is-running-beyond-memory-limits): There is a check placed at Yarn level for Virtual and Physical memory usage ratio. Issue is not only that VM doesn't have sufficient physical memory. But it is because Virtual memory usage is more than expected for given physical memory. > **Note** : This is happening on Centos/RHEL 6/Fedora due to its aggressive allocation of virtual memory. @@ -340,7 +349,7 @@ stop-yarn.sh stop-dfs.sh ``` -Now that we know that everything is working, let's fine-tune our setup. Ideally we should keep the config files (all `*site.xml` files) separate from the install directory (so that it is e.g. easier to upgrade). I copied these files to a dedicated Dropbox folder and use the `HADOOP_CONFIG_DIR` environment variable to point to it. Another benefit is that now I can use the same config files for my dev environment on another machine. +Now that we know that everything is working, let's fine-tune our setup. Ideally we should keep the config files (all `*site.xml` files included in /Applications/Development/Hadoop/hadoop-2.6.0/etc/hadoop) separate from the install directory (so that it is easier to upgrade). I copied these files to a dedicated Dropbox folder and use the `HADOOP_CONFIG_DIR` environment variable to point to it. Another benefit is that now I can use the same config files for my dev environment on another machine. > **Note**: It seems like you have to copy all files in this conf directory, as I saw error messages that e.g. the slaves file could not be found. @@ -390,7 +399,10 @@ stop-yarn.sh stop-dfs.sh ``` -Then source the bash profile file. +Change permissions to those files so that they are executables, then source the bash profile file: +``` +source ~/.bash_profile +``` ### Command Line Utilities @@ -452,9 +464,15 @@ CREATE DATABASE hive_metastore_db; -- CREATE DATABASE hive_stats; ``` +It is suggested to add a user to the database: +```sql +>grant all on hive_metastore_db.* TO 'stats_user'@'%' IDENTIFIED BY 'stats_password'; +``` +These details will be used in a configuration file later on. + Configuration ([hive-site.xml](https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration#AdminManualConfiguration-ConfiguringHive)) -Find the [Metadstore Admin Manual here](https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin). +Find the Metastore Admin Manual [here](https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin). Find some info on how to configure the Hive Stats DB [here](http://www.cloudera.com/documentation/manager/5-0-x/Cloudera-Manager-Managing-Clusters/cm5mc_hive_table_stats.html). @@ -622,6 +640,13 @@ Create the tables: schematool --verbose -dbType postgres -initSchema ``` +or + +```bash +schematool --verbose -dbType mysql -initSchema +``` + + Some notes on certain properties: `hive.server2.enable.doAs`: This is related to **HiveServer2**, which we will take a look at a bit later on. When connecting via a client to **HiveServer2**, this property specifies whether **HiveServer2** should impersonate the connected user or not. For our little and simple set up we want to disable this so that we can log on without username and password. @@ -636,7 +661,7 @@ You also see a commented section for the **Hive Stats DB**: > **Note**: Hive table statistics are not supported for PostgreSQL or Oracle - MySQL only! -The reason why I commented/disabled them is that table and column stats show up with out this extra DB any ways. You can check this be running e.g.: +The reason why I commented/disabled them is that table and column stats show up without this extra DB anyway. You can check this by running e.g.: ```sql CREATE TABLE test (foo STRING, bar STRING); @@ -669,6 +694,10 @@ Just as an interesting side node: This days Hive includes a [Hive Schema Tool](h ```bash schematool -dbType postgres -info ``` +or +```bash +schematool -dbType mysql -info +``` In my case this returned: @@ -691,6 +720,7 @@ schematool -dbType postgres -info # start hive metastore service hive --service metastore ``` +(in the commands above, replace _psql_ with _mysql_ and _postgres_ with _mysql_ if needed) So taking the learnings from this error, it is better if we use the `schematool` to create the schema instead of the `datanucleus.autoCreateSchema=true` setting in the `hive-site.xml`. @@ -821,7 +851,7 @@ This was related to `hive.server2.enable.doAs`. Setting it to `false` solved the 1. Make sure the hive database (e.g. MySQL) is up and running -2. Start the metastore service: `nohup hive --service metastore>/tmp/hive.out 2>/tmp/hive.log &` +2. Start the metastore service: `nohup hive --service metastore > /tmp/hive.out 2>/tmp/hive.log & tail -f /tmp/hive.log` 3. Test Hive: @@ -836,8 +866,7 @@ This was related to `hive.server2.enable.doAs`. Setting it to `false` solved the 4. Start HiveServer2: ``` - nohup hiveserver2 > /tmp/hiveserver2.out 2> /tmp/hiveserver2.log & - tail -f /tmp/hiveserver2.log + nohup hiveserver2 > /tmp/hiveserver2.out 2>/tmp/hiveserver2.log & tail -f /tmp/hiveserver2.log ``` 5. Open Beeline command line shell to interact with HiveServer2. @@ -865,8 +894,8 @@ In a convenient folder outside the Hive install directory place these start and ```bash mysql.server start -nohup hive --service metastore>/tmp/hive.out 2>/tmp/hive.log & -nohup hiveserver2 > /tmp/hiveserver2.out 2> /tmp/hiveserver2.log & +nohup hive --service metastore >/tmp/hive.out 2 >/tmp/hive.log & +nohup hiveserver2 >/tmp/hiveserver2.out 2 >/tmp/hiveserver2.log & ``` `stop-hive.sh`: This this one on your OS - this is my custom version for Mac OS X. @@ -901,7 +930,7 @@ $ jps ## Installing PDI -It's extremely easy to install PDI: Just download the latest version from [Sourceforge](http://sourceforge.net/projects/pentaho/files/Data%20Integration/), unzip the folder and place it in a convientent location (e.g. Applications folder). +It's extremely easy to install PDI: Just download the latest version from [Sourceforge](https://sourceforge.net/projects/pentaho/files/Data%20Integration/), unzip the folder and place it in a convientent location (e.g. Applications folder). ## Configurating PDI @@ -932,7 +961,7 @@ Sources: If you are using the vanilla **Apache Hadoop** installation (so not Cloudera or similar), then there is one important point to consider: PDI ships with an ancient Big Data shim for this **Apache Hadoop**. The default PDI shim is **hadoop-20** which refers to an Apache Hadoop 0.20.x distribution ([Source](http://stackoverflow.com/questions/25043374/unable-to-connect-to-hdfs-using-pdi-step)), so don't be fooled that the `20` suffix refers to the Hadoop 2.x.x release you just downloaded. -Matt Burgess suggested using the Hortenworks (HDP) shim instead (for the reason that **Hortonworks** feeds most of the in-house improvements back into the open source projects). This works fine out-of-the-box if you are only using **HDFS** directly (so copying files etc). To use MapReduce, we have to make a few amendments: +Matt Burgess suggested using the Hortonworks (HDP) shim instead (for the reason that **Hortonworks** feeds most of the in-house improvements back into the open source projects). This works fine out-of-the-box if you are only using **HDFS** directly (so copying files etc). To use MapReduce, we have to make a few amendments: **How to figure out which version of Hadoop a particular shim uses**: Go into the `/lib/client` folder and see which version number the `hadoop*` files have. This will help you understand if your vanilla Hadoop distro version is supported or not. In my case I had vanilla Apache Hadoop 2.6.0 installed and required a shim that supported just this version. PDI-CE-5.3 ships with the **hdp21** shim, which seems to support Hadoop 2.4. Luckily Pentaho had already a newer **Hortonworks** shim available: You can browse all the available [Shims on Github](https://github.com/pentaho/pentaho-hadoop-shims) and download a compiled version from [Shims on CI](http://ci.pentaho.org/view/Big%20Data/job/pentaho-hadoop-shims-5.3/) (note this is PDI version specific, see notes further down also for detailed download instructions). The shim that supported the same version was **hdp22**. @@ -980,7 +1009,7 @@ active.hadoop.configuration=hdp22 All the configuration details are set up now. -For more details on how to adjust the PDI Shin to specific distributions see [here](http://wiki.pentaho.com/display/BAD/Additional+Configuration+for+YARN+Shims). +For more details on how to adjust the PDI Shim to specific distributions see [here](http://wiki.pentaho.com/display/BAD/Additional+Configuration+for+YARN+Shims). Info about Shims: From dc3dc39f460e491453b5495844998135f454901b Mon Sep 17 00:00:00 2001 From: Michele Mor Date: Wed, 29 Nov 2017 15:13:03 +0000 Subject: [PATCH 2/2] Updated blog on Presto --- _posts/2016-02-14-Presto.md | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/_posts/2016-02-14-Presto.md b/_posts/2016-02-14-Presto.md index 80efd70..c94676b 100644 --- a/_posts/2016-02-14-Presto.md +++ b/_posts/2016-02-14-Presto.md @@ -9,13 +9,13 @@ published: true --- This is the second part of the **Setting up a minimal Big Data development environment on your local machine** series. The aim is to use as little resources as possible, hence we are not using a VM. The first part of this series is available [here](/big/data/2016/02/14/Presto.html) and is prerequisite for this blog post. -Previously we had a look at how to configure **HDFS** and **Hive** and we also set up **Pentaho Data Integrating** to connect to HDFS as well as Hive. This is a nice basic setup, however, quite often we also want a database with a short response time. Currently it's not that easy to install Impala standalone (or outside of the CDH stack), so that leaves us with two popular alternatives: [Apache Drill](https://drill.apache.org), [druid](http://druid.io) and [Presto](https://prestodb.io). In this blog post we will take a look at Facebook's Presto offering: +Previously we had a look at how to configure **HDFS** and **Hive** and we also set up **Pentaho Data Integration** to connect to HDFS as well as Hive. This is a nice basic setup, however, quite often we also want a database with a short response time. Currently it's not that easy to install Impala standalone (or outside of the CDH stack), so that leaves us with three popular alternatives: [Apache Drill](https://drill.apache.org), [druid](http://druid.io) and [Presto](https://prestodb.io). In this blog post we will take a look at Facebook's Presto offering: # Installing Presto Download the files from [here](https://prestodb.io/docs/current/installation/deployment.html). Once you extracted the files at a convenient location, let's start configuring **Presto**. The configuration is fairly straight forward and consists of creating a few properties files. -Navigate inside the extracted folder and issue the following commands: +Navigate inside the extracted folder and issue the following commands to create the first properties file, _node.properties_: ```bash $ mkdir etc @@ -31,7 +31,7 @@ node.id=ffffffff-ffff-ffff-ffff-ffffffffffff node.data-dir=/tmp/presto/data ``` -Save file and exit. Next define the **main configuration** in `config.properties`. Important settings are related to memory and ports: +Save file and exit. Next create _config.properties_ and define the **main configuration**. Important settings are related to memory and ports: ```bash $ vi config.properties @@ -44,10 +44,11 @@ discovery-server.enabled=true discovery.uri=http://localhost:8080 ``` -Save file and exit. This is a suitable configuration for a dev environment on a local machine. Next we define the properties for the **JVM**: +Save file and exit. This is a suitable configuration for a dev environment on a local machine. +Next we define the properties for the **JVM**: ```bash -$ vi jvm.properties +$ vi jvm.config -server -Xmx6G -XX:+UseG1GC @@ -65,7 +66,7 @@ $ vi log.properties com.facebook.presto=INFO ``` -Save file and exit. Let's also create a **catalog directory** so that we can soon store the configuration files in there to connect to various data stores: +Save file and exit. Let's also create a **catalog directory** where we will soon store the configuration files to connect to various data stores: ```bash $ mkdir catalog @@ -73,7 +74,7 @@ $ mkdir catalog # Basics -To start Presto: +To start Presto open a terminal in the installation directory and type: ``` bin/launcher start @@ -205,7 +206,7 @@ Splits: 2 total, 2 done (100.00%) # JDBC Driver -While it's fun for some time to write queries on the command line, more of then not we want to use a given database with some power tool like **Pentaho Data Integration**, in which case we need a **JDBC Driver** to connect the tool to the database. The **Presto JDBC Driver** can be downloaded from [here](https://prestodb.io/docs/current/installation/jdbc.html). +While it's fun for some time to write queries on the command line, more often than not we want to use a given database with some power tool like **Pentaho Data Integration**, in which case we need a **JDBC Driver** to connect the tool to the database. The **Presto JDBC Driver** can be downloaded from [here](https://prestodb.io/docs/current/installation/jdbc.html). The connection string looks like this: @@ -225,7 +226,7 @@ In regards to **username** and **password**: I haven't found any information yet # Web Interface -If you want to see a list of running and executed queries, **Presto** comes with a simple web interface which you can find under the URL you speficied in the `config.properties`. In my case this is `http://localhost:8080`. +If you want to see a list of running and executed queries, **Presto** comes with a simple web interface which you can find under the URL you specified in the `config.properties`. In my case this is `http://localhost:8080`. # SQL Considerations