-
Notifications
You must be signed in to change notification settings - Fork 19
Description
SPARQL query optimization in Halyard is based on cardinalities pre-computed by Halyard Stats. A frequent case in SPARQL queries is to encounter triple patterns with similar cardinality, but drastically different selectivity (see e.g., here for a description of selectivity). Selectivity can be considered as the ratio of distinct predicate values to all values (i.e. (COUNT(DISTINCT ?val)/COUNT(?val)).
In some cases the optimizer can prioritize triple patterns that have low cardinality but are unselective. For example, in Wikidata the properties that link Wikidata entities with Wikipedia pages have similar cardinality. See the Halyard Stats for schema:about, schema:inLanguage, and schema:isPartOf:
PREFIX halyard: <http://merck.github.io/Halyard/ns#>
PREFIX schema: <http://schema.org/>
PREFIX void: <http://rdfs.org/ns/void#>
SELECT ?property ?cardinality
WHERE {
GRAPH halyard:statsContext {
VALUES ?property {
schema:about
schema:inLanguage
schema:isPartOf
}
[] void:propertyPartition [
void:property ?property ;
void:triples ?cardinality
] .
}
}| property | cardinality |
|---|---|
| schema:about | "126270886"^^xsd:long |
| schema:inLanguage | "67986056"^^xsd:long |
| schema:isPartOf | "67986056"^^xsd:long |
schema:about has the highest cardinality, but it has vastly better object selectivity than schema:inLanguage or schema:isPartOf, which have only few distinct values. Given that the object of schema:about is known, the query optimizer would produce a better plan if it gave it a priority. While the object selectivity of schema:about is high, since most of its objects are unique (i.e. Wikidata entities), the object selectivity of schema:inLanguage is low, since it has very few unique objects (i.e. languages of Wikipedias).
Adding selectivity to Halyard Stats can mean adding 2 numbers to each partition, e.g., for property partitions it's selectivity with respect to objects and selectivity with respect to subjects. These can be then used by the query optimizer to produce better query plans.