Skip to content
This repository was archived by the owner on Jun 19, 2025. It is now read-only.
This repository was archived by the owner on Jun 19, 2025. It is now read-only.

Multi-core support for bquery #17

@ARF1

Description

@ARF1

After missing groupby with bcolz for some time I was excited to find this interesting project. To get started, I was looking at the unique method.

I found some interesting timing results with a 12-character string column in my database:

import blaze
import bquery
import bcolz
from multiprocessing import Pool
p = Pool()

db = bquery.open(...)

%%timeit
db.unique('my_col')
--> 5.5 sec (uses only one core)

db = bcolz.open(...)
d = blaze.Data(db)

%%timeit
blaze.compute(d['my_col'].distinct(), map=p.map)
--> 3.32 sec (using 2 cores on my dual core machine)

%%timeit
blaze.compute(d['my_col'].distinct())
--> 7.69 sec (using only one core)

It appears that parallel processing with blaze provides a fairly significant speedup with my database given its inherent overhead. Is it conceivable to parallelise the bquery code?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions