You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jun 19, 2025. It is now read-only.
After missing groupby with bcolz for some time I was excited to find this interesting project. To get started, I was looking at the unique method.
I found some interesting timing results with a 12-character string column in my database:
import blaze
import bquery
import bcolz
from multiprocessing import Pool
p = Pool()
db = bquery.open(...)
%%timeit
db.unique('my_col')
--> 5.5 sec (uses only one core)
db = bcolz.open(...)
d = blaze.Data(db)
%%timeit
blaze.compute(d['my_col'].distinct(), map=p.map)
--> 3.32 sec (using 2 cores on my dual core machine)
%%timeit
blaze.compute(d['my_col'].distinct())
--> 7.69 sec (using only one core)
It appears that parallel processing with blaze provides a fairly significant speedup with my database given its inherent overhead. Is it conceivable to parallelise the bquery code?