diff --git a/test_data/README.md b/test_data/README.md new file mode 100755 index 0000000..c0d6a94 --- /dev/null +++ b/test_data/README.md @@ -0,0 +1,95 @@ +# Apache Spark + +Spark is a fast and general cluster computing system for Big Data. It provides +high-level APIs in Scala, Java, Python, and R, and an optimized engine that +supports general computation graphs for data analysis. It also supports a +rich set of higher-level tools including Spark SQL for SQL and DataFrames, +MLlib for machine learning, GraphX for graph processing, +and Spark Streaming for stream processing. + + + + +## Online Documentation + +You can find the latest Spark documentation, including a programming +guide, on the [project web page](http://spark.apache.org/documentation.html) +and [project wiki](https://cwiki.apache.org/confluence/display/SPARK). +This README file only contains basic setup instructions. + +## Building Spark + +Spark is built using [Apache Maven](http://maven.apache.org/). +To build Spark and its example programs, run: + + build/mvn -DskipTests clean package + +(You do not need to do this if you downloaded a pre-built package.) +More detailed documentation is available from the project site, at +["Building Spark"](http://spark.apache.org/docs/latest/building-spark.html). + +## Interactive Scala Shell + +The easiest way to start using Spark is through the Scala shell: + + ./bin/spark-shell + +Try the following command, which should return 1000: + + scala> sc.parallelize(1 to 1000).count() + +## Interactive Python Shell + +Alternatively, if you prefer Python, you can use the Python shell: + + ./bin/pyspark + +And run the following command, which should also return 1000: + + >>> sc.parallelize(range(1000)).count() + +## Example Programs + +Spark also comes with several sample programs in the `examples` directory. +To run one of them, use `./bin/run-example [params]`. For example: + + ./bin/run-example SparkPi + +will run the Pi example locally. + +You can set the MASTER environment variable when running examples to submit +examples to a cluster. This can be a mesos:// or spark:// URL, +"yarn" to run on YARN, and "local" to run +locally with one thread, or "local[N]" to run locally with N threads. You +can also use an abbreviated class name if the class is in the `examples` +package. For instance: + + MASTER=spark://host:7077 ./bin/run-example SparkPi + +Many of the example programs print usage help if no params are given. + +## Running Tests + +Testing first requires [building Spark](#building-spark). Once Spark is built, tests +can be run using: + + ./dev/run-tests + +Please see the guidance on how to +[run tests for a module, or individual tests](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools). + +## A Note About Hadoop Versions + +Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported +storage systems. Because the protocols have changed in different versions of +Hadoop, you must build Spark against the same version that your cluster runs. + +Please refer to the build documentation at +["Specifying the Hadoop Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version) +for detailed guidance on building for a particular distribution of Hadoop, including +building for particular Hive and Hive Thriftserver distributions. + +## Configuration + +Please refer to the [Configuration Guide](http://spark.apache.org/docs/latest/configuration.html) +in the online documentation for an overview on how to configure Spark. diff --git a/test_pyspark_shell.sh b/test_pyspark_shell.sh index 3d46660..6879c10 100755 --- a/test_pyspark_shell.sh +++ b/test_pyspark_shell.sh @@ -9,7 +9,7 @@ curr_dir=`cd $curr_dir; pwd` # Default SPARK_HOME location is already checked by init_spark.sh spark_home=${SPARK_HOME:='/opt/spark'} if [ ! -d "$spark_home" ] ; then - >&2 echo "fail - $spark_home does not exist, please check you Spark installation or SPARK_HOME env variable, exinting!" + >&2 echo "fail - $spark_home does not exist, please check you Spark installation or SPARK_HOME env variable, exiting!" exit -2 else echo "ok - applying Spark home $spark_home" @@ -43,7 +43,9 @@ fi pushd `pwd` cd $spark_home hdfs dfs -mkdir -p spark/test/ -hdfs dfs -put $spark_home/README.md spark/test/ + +# Including spark README.md in test_data to differentiate from sparkexample README.md +hdfs dfs -put "$spark_test_dir/test_data/README.md" spark/test/ # Leverage a simple use case here hdfs dfs -put "$spark_test_dir/src/main/resources/spam_sample.txt" spark/test/