Altiscale · anil-altiscale · May 5, 2019 · meni-altiscale · May 6, 2019 · alee-altiscale
diff --git a/test_data/README.md b/test_data/README.md
@@ -0,0 +1,95 @@
+# Apache Spark
+
+Spark is a fast and general cluster computing system for Big Data. It provides
+high-level APIs in Scala, Java, Python, and R, and an optimized engine that
+supports general computation graphs for data analysis. It also supports a
+rich set of higher-level tools including Spark SQL for SQL and DataFrames,
+MLlib for machine learning, GraphX for graph processing,
+and Spark Streaming for stream processing.
+
+<http://spark.apache.org/>
+
+
+## Online Documentation
+
+You can find the latest Spark documentation, including a programming
+guide, on the [project web page](http://spark.apache.org/documentation.html)
+and [project wiki](https://cwiki.apache.org/confluence/display/SPARK).
+This README file only contains basic setup instructions.
+
+## Building Spark
+
+Spark is built using [Apache Maven](http://maven.apache.org/).
+To build Spark and its example programs, run:
+
+    build/mvn -DskipTests clean package
+
+(You do not need to do this if you downloaded a pre-built package.)
+More detailed documentation is available from the project site, at
+["Building Spark"](http://spark.apache.org/docs/latest/building-spark.html).
+
+## Interactive Scala Shell
+
+The easiest way to start using Spark is through the Scala shell:
+
+    ./bin/spark-shell
+
+Try the following command, which should return 1000:
+
+    scala> sc.parallelize(1 to 1000).count()
+
+## Interactive Python Shell
+
+Alternatively, if you prefer Python, you can use the Python shell:
+
+    ./bin/pyspark
+
+And run the following command, which should also return 1000:
+
+    >>> sc.parallelize(range(1000)).count()
+
+## Example Programs
+
+Spark also comes with several sample programs in the `examples` directory.
+To run one of them, use `./bin/run-example <class> [params]`. For example:
+
+    ./bin/run-example SparkPi
+
+will run the Pi example locally.
+
+You can set the MASTER environment variable when running examples to submit
+examples to a cluster. This can be a mesos:// or spark:// URL,
+"yarn" to run on YARN, and "local" to run
+locally with one thread, or "local[N]" to run locally with N threads. You
+can also use an abbreviated class name if the class is in the `examples`
+package. For instance:
+
+    MASTER=spark://host:7077 ./bin/run-example SparkPi
+
+Many of the example programs print usage help if no params are given.
+
+## Running Tests
+
+Testing first requires [building Spark](#building-spark). Once Spark is built, tests
+can be run using:
+
+    ./dev/run-tests
+
+Please see the guidance on how to
+[run tests for a module, or individual tests](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools).
+
+## A Note About Hadoop Versions
+
+Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
+storage systems. Because the protocols have changed in different versions of
+Hadoop, you must build Spark against the same version that your cluster runs.
+
+Please refer to the build documentation at
+["Specifying the Hadoop Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version)
+for detailed guidance on building for a particular distribution of Hadoop, including
+building for particular Hive and Hive Thriftserver distributions.
+
+## Configuration
+
+Please refer to the [Configuration Guide](http://spark.apache.org/docs/latest/configuration.html)
+in the online documentation for an overview on how to configure Spark.
diff --git a/test_pyspark_shell.sh b/test_pyspark_shell.sh
@@ -9,7 +9,7 @@ curr_dir=`cd $curr_dir; pwd`
 # Default SPARK_HOME location is already checked by init_spark.sh
 spark_home=${SPARK_HOME:='/opt/spark'}
 if [ ! -d "$spark_home" ] ; then
-  >&2 echo "fail - $spark_home does not exist, please check you Spark installation or SPARK_HOME env variable, exinting!"
+  >&2 echo "fail - $spark_home does not exist, please check you Spark installation or SPARK_HOME env variable, exiting!"
   exit -2
 else
   echo "ok - applying Spark home $spark_home"
@@ -43,7 +43,9 @@ fi
 pushd `pwd`
 cd $spark_home
 hdfs dfs -mkdir -p spark/test/
-hdfs dfs -put $spark_home/README.md spark/test/
+
+# Including spark README.md in test_data to differentiate from sparkexample README.md
+hdfs dfs -put "$spark_test_dir/test_data/README.md" spark/test/
 
 # Leverage a simple use case here
 hdfs dfs -put "$spark_test_dir/src/main/resources/spam_sample.txt" spark/test/