Skip to content

performance #1

@hershaw

Description

@hershaw

The following was run on a MacBook with 2 cores and 8 GB of memory with all applications closed.

(benchmarks)benchmarks master > ./run-benchmarks.sh 
sys:1: DtypeWarning: Columns (0,19) have mixed types. Specify dtype option on import or set low_memory=False.
pandas read csv: 11.1477160454s
pandas apply transforms: 0.938997983932s
2016-04-15 13:10:16,467 [INFO] sframe.cython.cy_server, 172: SFrame v1.8.5 started. Logging /tmp/sframe_server_1460722216.log
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,float,float,float,float,str,str,float,str,str,str,str,str,float,str,str,str,str,str,str,str,str,str,str,float,float,str,float,float,float,float,float,float,str,float,str,float,float,float,float,float,float,float,float,float,str,float,str,str,float,str,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Unable to parse line "Loans that do not meet the credit policy,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"
Read 89872 lines. Lines per second: 35505.4
Read 426460 lines. Lines per second: 56880.2
1 lines failed to parse correctly
Finished parsing file /Users/samuelhopkins/cp/benchmarks/data/lc_big.csv
Parsing completed. Parsed 756878 lines in 12.2473 secs.
sframe read csv: 17.1927471161s
sframe apply transforms: 16.9669880867s
node apply transforms: 466.212ms

As you can see, applying the operations in pandas take about 1s, using sframe about 18s, and in node.js about 0.5s.

The performance difference between pandas and sframe is probably due to the fact that I can use the native pandas functions isin and map which I am guessing are highly optimized while with sframe I am simply using apply which is being supplied with pure python functions.

However, I can confirm that sframe is using all cores which leads me to believe that if I can perform my single operations more efficiently, I should see better results.

Node.js is the winner so far, but it's scalability and predictibility is a bit limited so we are willing to take a few milliseconds hit if we can get something robust that uses all cores.

I guess the question here is: 1) Did I miss something in the documentation for sframe that provides equivilent pandas map and isin functionality and 2) If not, how can I optimize the given operations?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions