Query transformer infrastructure & example query transformer implementations #29

ARF1 · 2015-03-21T10:27:42Z

Based on #27 but can be rebased on master.

This introduces the infrastructure for plug-in query transformers. Included are three sample query transformers:

InOperatorTransformer: my_col in ['ABC', 'DEF'] is transformed into (my_col == 'ABC') | (my_col == 'DEF'). The operation not in is similarly transformed.
TrivialBooleanExpressionsOptimizer: (my_col == 'ABC') | (False) is transformed into False (limited usefulness without an intelligent query optimizer)
CachedFactorOptimizer: converts comparisons containing columns with cached factors into comparisons using the factor instead. (Naive implementation, currently only useful for edge-cases.)

By default this PR does not change the behaviour or dependencies of bquery. Query transformers have to be explicitly enabled by configuring them, e.g.:

from transformers import InOperatorTransformer, TrivialBooleanExpressionsOptimizer
b.transformers = [InOperatorTransformer(), TrivialBooleanExpressionsOptimizer()]

For convenience, a shortcut is provided for these (currently) most useful transformers with transformers.standard_transformers:

from transformers import standard_transformers
b.transformers = standard_transformers

The overhead for queries is negligible for reasonably sized databases: For the query db["my_col=='AB1234567890'"] bquery without query transformers requires 362 ms, with all query transformers configured (including CachedFactorOptimizer) 367 ms.

With a non-compressed database the CachedFactorOptimizer shows some minor positive effects: 547 ms vs. 296 ms

CarstVaartjes · 2015-09-26T22:36:04Z

Hi @ARF1

Sorry no one ever got back to you before! :(
We used to work like the inoperatortransformer before, but with larger in statements it broke numexpr (too many or's); so we had to implement this workaround. In a short mail discussion with Francesc Alted he suggested that the best thing to do was to add in/not in functionality to numexpr. But that needs some heavy C coding (not my personal forte and my programmers are also quite overloaded atm). Still, it's on the to do list as it will greatly improve filtering (you would be able to push everything directly to numexpr)
The factorization part is a very good idea, i'll see how to automate that from a filter behaviour

ARF added 2 commits March 19, 2015 10:21

pass kwargs to cache_factor() through to bcolz.carray()

453ec25

introduce query transformer infrastructure & sample implementations

2c68b6e

ARF1 force-pushed the query_transformations branch from 9d97104 to 2c68b6e Compare May 10, 2015 09:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Query transformer infrastructure & example query transformer implementations #29

Query transformer infrastructure & example query transformer implementations #29

Uh oh!

ARF1 commented Mar 21, 2015

Uh oh!

CarstVaartjes commented Sep 26, 2015

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Query transformer infrastructure & example query transformer implementations #29

Are you sure you want to change the base?

Query transformer infrastructure & example query transformer implementations #29

Uh oh!

Conversation

ARF1 commented Mar 21, 2015

Uh oh!

CarstVaartjes commented Sep 26, 2015

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants