-
Notifications
You must be signed in to change notification settings - Fork 0
BDSGOLD-301. Enable test_pyspark_shell test to pass on a fresh cluster with spark-2.3.2 as default. #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-2.3.2-alti
Are you sure you want to change the base?
BDSGOLD-301. Enable test_pyspark_shell test to pass on a fresh cluster with spark-2.3.2 as default. #14
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,95 @@ | ||
| # Apache Spark | ||
|
|
||
| Spark is a fast and general cluster computing system for Big Data. It provides | ||
| high-level APIs in Scala, Java, Python, and R, and an optimized engine that | ||
| supports general computation graphs for data analysis. It also supports a | ||
| rich set of higher-level tools including Spark SQL for SQL and DataFrames, | ||
| MLlib for machine learning, GraphX for graph processing, | ||
| and Spark Streaming for stream processing. | ||
|
|
||
| <http://spark.apache.org/> | ||
|
|
||
|
|
||
| ## Online Documentation | ||
|
|
||
| You can find the latest Spark documentation, including a programming | ||
| guide, on the [project web page](http://spark.apache.org/documentation.html) | ||
| and [project wiki](https://cwiki.apache.org/confluence/display/SPARK). | ||
| This README file only contains basic setup instructions. | ||
|
|
||
| ## Building Spark | ||
|
|
||
| Spark is built using [Apache Maven](http://maven.apache.org/). | ||
| To build Spark and its example programs, run: | ||
|
|
||
| build/mvn -DskipTests clean package | ||
|
|
||
| (You do not need to do this if you downloaded a pre-built package.) | ||
| More detailed documentation is available from the project site, at | ||
| ["Building Spark"](http://spark.apache.org/docs/latest/building-spark.html). | ||
|
|
||
| ## Interactive Scala Shell | ||
|
|
||
| The easiest way to start using Spark is through the Scala shell: | ||
|
|
||
| ./bin/spark-shell | ||
|
|
||
| Try the following command, which should return 1000: | ||
|
|
||
| scala> sc.parallelize(1 to 1000).count() | ||
|
|
||
| ## Interactive Python Shell | ||
|
|
||
| Alternatively, if you prefer Python, you can use the Python shell: | ||
|
|
||
| ./bin/pyspark | ||
|
|
||
| And run the following command, which should also return 1000: | ||
|
|
||
| >>> sc.parallelize(range(1000)).count() | ||
|
|
||
| ## Example Programs | ||
|
|
||
| Spark also comes with several sample programs in the `examples` directory. | ||
| To run one of them, use `./bin/run-example <class> [params]`. For example: | ||
|
|
||
| ./bin/run-example SparkPi | ||
|
|
||
| will run the Pi example locally. | ||
|
|
||
| You can set the MASTER environment variable when running examples to submit | ||
| examples to a cluster. This can be a mesos:// or spark:// URL, | ||
| "yarn" to run on YARN, and "local" to run | ||
| locally with one thread, or "local[N]" to run locally with N threads. You | ||
| can also use an abbreviated class name if the class is in the `examples` | ||
| package. For instance: | ||
|
|
||
| MASTER=spark://host:7077 ./bin/run-example SparkPi | ||
|
|
||
| Many of the example programs print usage help if no params are given. | ||
|
|
||
| ## Running Tests | ||
|
|
||
| Testing first requires [building Spark](#building-spark). Once Spark is built, tests | ||
| can be run using: | ||
|
|
||
| ./dev/run-tests | ||
|
|
||
| Please see the guidance on how to | ||
| [run tests for a module, or individual tests](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools). | ||
|
|
||
| ## A Note About Hadoop Versions | ||
|
|
||
| Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported | ||
| storage systems. Because the protocols have changed in different versions of | ||
| Hadoop, you must build Spark against the same version that your cluster runs. | ||
|
|
||
| Please refer to the build documentation at | ||
| ["Specifying the Hadoop Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version) | ||
| for detailed guidance on building for a particular distribution of Hadoop, including | ||
| building for particular Hive and Hive Thriftserver distributions. | ||
|
|
||
| ## Configuration | ||
|
|
||
| Please refer to the [Configuration Guide](http://spark.apache.org/docs/latest/configuration.html) | ||
| in the online documentation for an overview on how to configure Spark. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -9,7 +9,7 @@ curr_dir=`cd $curr_dir; pwd` | |
| # Default SPARK_HOME location is already checked by init_spark.sh | ||
| spark_home=${SPARK_HOME:='/opt/spark'} | ||
| if [ ! -d "$spark_home" ] ; then | ||
| >&2 echo "fail - $spark_home does not exist, please check you Spark installation or SPARK_HOME env variable, exinting!" | ||
| >&2 echo "fail - $spark_home does not exist, please check you Spark installation or SPARK_HOME env variable, exiting!" | ||
| exit -2 | ||
| else | ||
| echo "ok - applying Spark home $spark_home" | ||
|
|
@@ -43,7 +43,9 @@ fi | |
| pushd `pwd` | ||
| cd $spark_home | ||
| hdfs dfs -mkdir -p spark/test/ | ||
| hdfs dfs -put $spark_home/README.md spark/test/ | ||
|
|
||
| # Including spark README.md in test_data to differentiate from sparkexample README.md | ||
| hdfs dfs -put "$spark_test_dir/test_data/README.md" spark/test/ | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I notice that the README.md do exist in both https://github.com/Altiscale/sparkexample/blob/branch-2.3.2-alti/README.md and https://github.com/Altiscale/spark/blob/branch-2.3.2-alti/README.md although the one in sparkexample is pretty empty with one line of content. The subject is misleading as well, it is enabling? I thought this test case always run, it is enabled all the time https://github.com/Altiscale/sparkexample/blob/branch-2.3.2-alti/run_all_test.kerberos.sh#L11
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep. The README.md file is present in both the repositories. This has more to do with the |
||
|
|
||
| # Leverage a simple use case here | ||
| hdfs dfs -put "$spark_test_dir/src/main/resources/spam_sample.txt" spark/test/ | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will removing this break test suite on existing clusters?
As far as I know, test suites are often ran after a maintenance to verify the status of a cluster.