Skip to content
This repository was archived by the owner on Jun 19, 2025. It is now read-only.

Conversation

@ARF1
Copy link

@ARF1 ARF1 commented Mar 21, 2015

Based on #27 but can be rebased on master.

This introduces the infrastructure for plug-in query transformers. Included are three sample query transformers:

  • InOperatorTransformer: my_col in ['ABC', 'DEF'] is transformed into (my_col == 'ABC') | (my_col == 'DEF'). The operation not in is similarly transformed.
  • TrivialBooleanExpressionsOptimizer: (my_col == 'ABC') | (False) is transformed into False (limited usefulness without an intelligent query optimizer)
  • CachedFactorOptimizer: converts comparisons containing columns with cached factors into comparisons using the factor instead. (Naive implementation, currently only useful for edge-cases.)

By default this PR does not change the behaviour or dependencies of bquery. Query transformers have to be explicitly enabled by configuring them, e.g.:

from transformers import InOperatorTransformer, TrivialBooleanExpressionsOptimizer
b.transformers = [InOperatorTransformer(), TrivialBooleanExpressionsOptimizer()]

For convenience, a shortcut is provided for these (currently) most useful transformers with transformers.standard_transformers:

from transformers import standard_transformers
b.transformers = standard_transformers

The overhead for queries is negligible for reasonably sized databases: For the query db["my_col=='AB1234567890'"] bquery without query transformers requires 362 ms, with all query transformers configured (including CachedFactorOptimizer) 367 ms.

With a non-compressed database the CachedFactorOptimizer shows some minor positive effects: 547 ms vs. 296 ms

@ARF1 ARF1 force-pushed the query_transformations branch from 9d97104 to 2c68b6e Compare May 10, 2015 09:06
@CarstVaartjes
Copy link
Member

Hi @ARF1

Sorry no one ever got back to you before! :(
We used to work like the inoperatortransformer before, but with larger in statements it broke numexpr (too many or's); so we had to implement this workaround. In a short mail discussion with Francesc Alted he suggested that the best thing to do was to add in/not in functionality to numexpr. But that needs some heavy C coding (not my personal forte and my programmers are also quite overloaded atm). Still, it's on the to do list as it will greatly improve filtering (you would be able to push everything directly to numexpr)
The factorization part is a very good idea, i'll see how to automate that from a filter behaviour

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants