Replay Analysis tools for HSReplay.net
Tools for doing large scale analysis on hsreplay.net data.
Replay analysis jobs are written using Yelp's MRJob library to process replays at scale via Map Reduce on EMR. Data scientists can easily develop jobs locally and then submit a request to a member of the HearthSim team to have them run the job at scale on a production map reduce cluster.
Checkout chess_brawl.py for an example of how to write a job that uses a hearthstone .hslog.export.EntityTreeExporter subclass to do an analysis against the replay xml files.
Jobs that follow this template will have several things in common:
- They use the
mapred.protocols.BaseS3Protocolbase class to abstract away the raw storage details and implement directly against thehsreplay.document.HSReplayDocumentclass. - They implement a subclass of
EntityTreeExporterand use the exposed hooks to capture whatever event data the job is focused on analyzing. - They usually emit their final output as aggregates in a CSV like format so that final chart generation and analysis can be done in interactive visual tools like Excel.
To run a job you must first make sure you have the libraries listed in requirements.txt installed. Then the command to invoke a job is:
$ python <JOB_NAME>.py <INPUT_FILE.TXT>
The INPUT_FILE.TXT must be the path to a text file on the local file system that contains
newline delimited lines where each line follows the format <STORAGE_LOCATION>:<FILE_PATH>.
If STORAGE_LOCATION is the string local than the job will look for the file on the
local file system. If it is any other value, like hsreplaynet-replays then it assumes
that the file is stored in an S3 bucket with that name.
Let's assume that your job script is named my_job.py and your input file is named
inputs.txt and looks as follows:
local:uploads/2016/09/ex1_replay.xml
local:uploads/2016/09/ex2_replay.xml
local:uploads/2016/09/ex3_replay.xml
The BaseS3Protocol will then look for those files in the ./uploads directory which
it expects to be in the same folder as where you invoked the script from.
Once the test data is prepared, then the job can be run by invoking:
$ python my_job.py inputs.txt
This will run the job entirely in a single process which makes it easy to attach a debugger or employ any other traditional development practice. In addition, one of the benefits of using Map Reduce is that the isolated nature of map() and reduce() functions makes them easy to unit test.
When your job is ready, have a member of the HearthSim team run it on the production data set. There are a few small changes necessary to make the job run on EMR.
- You must replace the
<STORAGE_LOCATION>ininputs.txtwith the name of the raw log data bucket. Usuallyhsreplaynet-replays, so that it looks like:
hsreplaynet-replays:uploads/2016/09/ex1_replay.xml
hsreplaynet-replays:uploads/2016/09/ex2_replay.xml
hsreplaynet-replays:uploads/2016/09/ex3_replay.xml
Since you likely want to run it on a larger set of inputs, you can ask a member of the HearthSim team to help you generate a larger input file by telling them the type of replays that you'd like to run the job over.
-
You must run
$ ./package_libraries.shto generate a zip of the libraries in this repo so that they get shipped up to the map reduce cluster. -
When the HearthSim team member invokes the job they will do so from a machine in the data center that is configured with the correct AWS credentials in the environment. They will also use the
-r emroption to tell MRJob to use EMR. E.g.
$ python my_job.py -r emr inputs.txt
And that's it! MRJob will automatically provision an elastic map reduce cluster, whose
size can be tuned by a HearthSim member by editing mrjob.conf prior to launching the
job. When the job is done MRJob will either stream the results back to console or save
them in S3 and then tear down the EMR cluster.
Happy Questing, Adventurer!
When working on the data processing infrastructure it is possible to only pay the cost of bootstrapping the cluster once by first running this command:
$ mrjob create-cluster --conf-path mrjob.conf
This will create a cluster that will remain active until it's idle for a full hour and then it will shut itself down. The command will return a cluster ID token that looks like 'j-1CSVCLY28T3EY'.
Then when invoking subsequent jobs the additional --cluster-id <ID> command can be used
to have the job run on the already provisioned cluster. E.g.
$ python my_job.py -r emr --conf-path mrjob.conf --cluster-id j-1CSVCLY28T3EY inputs.txt
Copyright © HearthSim - All Rights Reserved
With the exception of the job scripts under /contrib, which are licensed under the MIT license. The full license text
is available in the /contrib/LICENSE file.
This is a HearthSim project. All development
happens on our IRC channel #hearthsim on Freenode.