Skip to content

Improve SPARQL query optimization by using selectivity #70

@jindrichmynarz

Description

@jindrichmynarz

SPARQL query optimization in Halyard is based on cardinalities pre-computed by Halyard Stats. A frequent case in SPARQL queries is to encounter triple patterns with similar cardinality, but drastically different selectivity (see e.g., here for a description of selectivity). Selectivity can be considered as the ratio of distinct predicate values to all values (i.e. (COUNT(DISTINCT ?val)/COUNT(?val)).

In some cases the optimizer can prioritize triple patterns that have low cardinality but are unselective. For example, in Wikidata the properties that link Wikidata entities with Wikipedia pages have similar cardinality. See the Halyard Stats for schema:about, schema:inLanguage, and schema:isPartOf:

PREFIX halyard: <http://merck.github.io/Halyard/ns#>
PREFIX schema:  <http://schema.org/>
PREFIX void:    <http://rdfs.org/ns/void#>

SELECT ?property ?cardinality
WHERE {
  GRAPH halyard:statsContext {
    VALUES ?property {
      schema:about
      schema:inLanguage
      schema:isPartOf
    }
    [] void:propertyPartition [
        void:property ?property ;
        void:triples ?cardinality
      ] .
  }
}
property cardinality
schema:about "126270886"^^xsd:long
schema:inLanguage "67986056"^^xsd:long
schema:isPartOf "67986056"^^xsd:long

schema:about has the highest cardinality, but it has vastly better object selectivity than schema:inLanguage or schema:isPartOf, which have only few distinct values. Given that the object of schema:about is known, the query optimizer would produce a better plan if it gave it a priority. While the object selectivity of schema:about is high, since most of its objects are unique (i.e. Wikidata entities), the object selectivity of schema:inLanguage is low, since it has very few unique objects (i.e. languages of Wikipedias).

Adding selectivity to Halyard Stats can mean adding 2 numbers to each partition, e.g., for property partitions it's selectivity with respect to objects and selectivity with respect to subjects. These can be then used by the query optimizer to produce better query plans.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions