performance

The following was run on a MacBook with 2 cores and 8 GB of memory with all applications closed.

```
(benchmarks)benchmarks master > ./run-benchmarks.sh 
sys:1: DtypeWarning: Columns (0,19) have mixed types. Specify dtype option on import or set low_memory=False.
pandas read csv: 11.1477160454s
pandas apply transforms: 0.938997983932s
2016-04-15 13:10:16,467 [INFO] sframe.cython.cy_server, 172: SFrame v1.8.5 started. Logging /tmp/sframe_server_1460722216.log
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,float,float,float,float,str,str,float,str,str,str,str,str,float,str,str,str,str,str,str,str,str,str,str,float,float,str,float,float,float,float,float,float,str,float,str,float,float,float,float,float,float,float,float,float,str,float,str,str,float,str,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Unable to parse line "Loans that do not meet the credit policy,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"
Read 89872 lines. Lines per second: 35505.4
Read 426460 lines. Lines per second: 56880.2
1 lines failed to parse correctly
Finished parsing file /Users/samuelhopkins/cp/benchmarks/data/lc_big.csv
Parsing completed. Parsed 756878 lines in 12.2473 secs.
sframe read csv: 17.1927471161s
sframe apply transforms: 16.9669880867s
node apply transforms: 466.212ms
```

As you can see, applying the operations in pandas take about 1s, using sframe about 18s, and in node.js about 0.5s.

The performance difference between pandas and sframe is probably due to the fact that I can use the native pandas functions `isin` and `map` which I am guessing are highly optimized while with sframe I am simply using `apply` which is being supplied with pure python functions.

However, I can confirm that sframe is using all cores which leads me to believe that if I can perform my single operations more efficiently, I should see better results.

Node.js is the winner so far, but it's scalability and predictibility is a bit limited so we are willing to take a few milliseconds hit if we can get something robust that uses all cores.

I guess the question here is: 1) Did I miss something in the documentation for sframe that provides equivilent pandas `map` and `isin` functionality and 2) If not, how can I optimize the given operations?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

performance #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

performance #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions