-
Notifications
You must be signed in to change notification settings - Fork 0
Description
The current incarnation of /autocomplete is biased towards prefix matches: if a given query happens to be both a prefix in many entries, and a full match, it'll be "crowded out" by prefix entries. Which means that queries like teg or berg will have Tegucigalpa and Bergen respectively at the top 5, but rome will find a bunch of cities that are not Rome, and one has to write e.g. rome italy or rome lazio or rome roma to get better hits. We can aid with other heuristics of course, like population, but I think that before that is done, we should have a way of comparing the performance of queries to quickly prove that any improvement is not just biased towards whatever a manual tester was thinking of:
- We should always test against a significant sample of the db, e.g. a
select ... tablesample bernoulli(10). - We should see how a given query fares for a given city given small/average partial matches, and full matches.
- A query "wins" if for a large percentage of cities, partial/full searches show the city in the top 5 or top 10 (can't be 100% effective: cities with unfortunate prefixes like
sanorsantaorthejust can't have significant hits until the full name is provided, evenromeis ill-fated due to it being present in so many other cities as a prefix/part-of-name.) - A query loses a bit of luster if it takes close to 100ms or more to return, so average execution time should also be noted.
- This should produce a textual report of sorts, so we can keep a small archive/knowledge base of what queries are better at what use cases.
Here's some queries to enter into the contest, presented in order of how I have anecdotally seen them fare for some imperfect test queries (off the top of my head: teg, flore, frank, rome, pekin, quee)
-- look for partial and full matches, order by full and then partial rank:
with vars as (select 'rome' as q)
select geocode.city_autocomplete.* from geocode.city_autocomplete, vars
where (web_to_tsquery_prefix(vars.q) || websearch_to_tsquery(vars.q)) @@ autocomplete_doc
order by ts_rank(autocomplete_doc, websearch_to_tsquery(vars.q)) desc, ts_rank(autocomplete_doc, (web_to_tsquery_prefix(vars.q))) desc limit 10;
-- a variation of the above, but with a UNION and consequently lesser performance:
with vars as (select 'rome' as q)
select * from (
select geocode.city_autocomplete.*, ts_rank(autocomplete_doc, (websearch_to_tsquery(vars.q))) * 10 as rank from geocode.city_autocomplete, vars
where (websearch_to_tsquery(vars.q)) @@ autocomplete_doc
union
select geocode.city_autocomplete.*, ts_rank(autocomplete_doc, (web_to_tsquery_prefix(vars.q))) as rank from geocode.city_autocomplete, vars
where (web_to_tsquery_prefix(vars.q)) @@ autocomplete_doc
) t order by rank desc limit 10;
-- OUR CURRENT QUERY: only look for partial matches; not great for `rome`, but great for most others from
-- the manual test, and performant:
with vars as (select 'rome' as q)
select * from geocode.city_autocomplete, vars
where web_to_tsquery_prefix(vars.q) @@ autocomplete_doc
order by ts_rank(autocomplete_doc, (web_to_tsquery_prefix(vars.q))) desc limit 10;
-- look for partial and full matches, order by a combined full/partial rank:
with vars as (select 'rome' as q)
select geocode.city_autocomplete.* from geocode.city_autocomplete, vars
where (web_to_tsquery_prefix(vars.q) || websearch_to_tsquery(vars.q)) @@ autocomplete_doc
order by ts_rank(autocomplete_doc, (web_to_tsquery_prefix(vars.q) || websearch_to_tsquery(vars.q))) desc limit 10;
-- look for partial and full matches, but sort by cover density (proximity of matching lexemes)
-- seems to perform almost identically to only looking at partial matches? (i.e. not great for "the rome case")
with vars as (select 'rome' as q)
select geocode.city_autocomplete.* from geocode.city_autocomplete, vars
where (web_to_tsquery_prefix(vars.q) || websearch_to_tsquery(vars.q)) @@ autocomplete_doc
order by ts_rank_cd(autocomplete_doc, (web_to_tsquery_prefix(vars.q) || websearch_to_tsquery(vars.q))) desc limit 10;