This repo implements the same map reduce ETL (Extract-Transform-Load) task in multiple languages in an effort to compare language productivity, terseness and readability. The performance comparisons should not be taken seriously. If anything, it is a bigger indication of my skillset in that language rather than their performance capabilities. Nonetheless, they are here and reflect what I would realistically face.
Count the number of tweets that mention 'knicks' in their message and bucket based on the neighborhood of origin. The ~1GB dataset for this task, sampled below, contains a tweet's message and its NYC neighborhood. It can be downloaded here.
91 west-brighton Brooklyn Uhhh
121 turtle-bay-east-midtown Manhattan Say anything
175 morningside-heights Manhattan It feels half-cheating half-fulfilling to cite myself.
- These tasks are not run on Hadoop but do run concurrently. Performance numbers are moot since the CPU mostly sits idle waiting on Disk IO.
- **UPDATE: Boy was the IO bound assumption wrong.
- Ruby 2.1.0 with Celluloid - Exposes the GIL limitation in pure ruby and shows the multicore advantage of JRuby.
- Ruby 2.1.0 and GNU Parallel - Uses GNU parallel to run ruby processes on multiple cores.
- Golang 1.2 - Imperative
- Scala 2.10.4 - Both Imperative and Functional
- Elixir 1.0 - Functional
$ ./run_ruby
- Celluloid Actor Pool
- Performance is very respectable when considering the GIL lock:
1m15.243s - Performance is great when run on JRuby, which uses all available cores:
0m41.268s
$ ./run_ruby_parallel
This is effectively:
$ parallel -j 90% -a commands.txt && ruby reducer.rb
- GNU Parallel to get around the GIL and more accurately mirror a real world scenario: Many single core workers running one ruby process (eg: Heroku dynos)
- Performance is excellent, with all cores on full blast:
40s. - This implementation is cheating in some areas but serves as a good baseline for other comparisons.
- Separate processes can be a maintenance nightmare. It leads to memory bloat, is difficult to coordinate failed processes, and can be difficult to deploy and scale. There is simplicity in being able to deploy one process that is capable of using all cores.
- From experience, Ruby's real weakness is its poor performance handling long-running jobs. Memory leaks run rampant. Twitter shared this opinion.
$ ./run_go
- goroutines
- channels
- selects
- Performance after first write with no optimizations:
3m23.165s. Was only using one core! - Performance average after using all cores by manually setting GOMAXPROCS:
1m03.593s - Had to research why all cores weren't used here.
- Ultimately, GOMAXPROCS will be removed and scheduling will automatically make use of all cores.
- Golang's libraries are fantastic but don't have the mature optimizations of other languages (yet).
- Ended up being the fewest lines of code across all languages, by a lot.
- Golang is not functional, so don't force functional programming concepts, like map and reduce. For loops for days...
-
Handling goroutines with
channels andselect.for _ = range inputFiles { select { case <-channel: fmt.Println("Finished mapping.") } }
-
Iterating over a
mapwithrange. -
Using
deferfor cleanup of file resources. -
Command-line debugger (but I didn't need it).
-
Verbose error handling. There are design patterns to better manage errors, but were skipped for this demo.
files, err := ioutil.ReadDir(inputDir) if err != nil { panic(err) }
-
Having to explicitly set the number of cores to use via
GOMAXPROCSbecause of immature scheduling. -
Lack of collection helpers like
mapandreduce.
$ ./run_scala
- Akka (Supervisors and Actors)
- Performance after first write on first run:
50s - Performance on subsequent runs:
27s. The JVM is probably doing something fancy. - All cores used.
- Not as IO bound as originally thought. Attributed to the optimizations in the BufferedSource/BufferedWriter classes.
-
Witnessing the speed after the first write.
-
Seeing BDD style testing as default for ScalaTest.
-
Using
!,?, andreceiveto handle messages in the Actor system.def map(inputDir: String, outputDir: String) = { val system = ActorSystem("MapSystem") val mapSupervisor = system.actorOf(Props[MapSupervisor], "mapsupervisor") val future = mapSupervisor ? ProcessDirectoryMessage(inputDir, outputDir) Await.result(future, Duration.Inf) system.shutdown }
-
sbt runandsbt testwork well, especially for fetching dependencies. -
Realizing the power of Akka and Akka Cluster.
- Inability to debug via the command line.
- Having to set implicit variables:
implicit val timeout = Timeout(5 minutes). - Having to use Java libraries for File IO.
$ ./run_elixir
- Streams
- pipeline operators
- PIDs
- All the Erlang and Elixir goodness
- Performance average after first write with
:delayed_write:55.964s. - This number says less about Elixir's performance and more about how much I suck at writing Elixir code. Ease of writing performant code though is a valid factor.
- Extremely productive language once one knows the class methods.
- Clearly designed for use with a text editor and the command-line (It's great).
- The Elixir docs are usually the sole source of information, thankfully they are pretty good.
-
Using Interactive Elixir,
iexand Mix is fantastic. Preferable tosbt console. -
Matching on assignment:
{:ok, result} = {:ok, 5}. -
Functional style coupled with pipeline operators and anonymous methods makes for some beautiful code.
-
Stream.intoallows manipulation of infinite collections in a terse manneroutput = File.stream!(output_file, [:delayed_write]) stream = File.stream!(input_file) |> Stream.into(output, fn line -> map_line(line) end) Stream.run(stream)
- The lack of objects is initially infuriating. Hard to encapsulate logic, and structs don't seem like a substitute. It effectively means that most if not all built-in methods only return primitive types as opposed to objects.
- Lack of online resources because of small community. Few Stack Overflow posts, etc.
- Discoverability is tricky since methods are all class methods on primitive types.
- Only after returning to a functional language like Elixir do I realize the convenience of Object Oriented meets Functional in Scala.
- The ability to return an object with relevant methods while still being immutable adds the power of discoverability, an advantage over the manipulation of maps and other primitives with Class methods.
- The big surprise was JRuby's performance and the impact of being able to use all cores. Running Puma on JRuby is very compelling when using a system with multiple cores.
- Golang's simplicity is very refreshing and their built-in profiling contributes to a philosophy of hand-tuning code for the best performance.
- Scala, on the other hand, has the user well removed from the low level, but the JVM handles a lot of optimizations for the programmer, and it shows. If only I didn't need an IDE...
For ETL operations, it would be remiss to ignore the Hadoop and Java ecosystem. Scala provides an incredible toolset for all ETL operations.