Skip to content

Conversation

@missinglink
Copy link
Member

@missinglink missinglink commented Dec 13, 2019

cherry-picked from #412 and based on #414.

This PR adds a normalizer which is the nearest thing to an analyzer for keyword fields.
more info here: elastic/elasticsearch#18064

This allows us to perform some basic normalization to fields such as layer, source and category, forcing them to be lowercased and doing some ICU normalization.

One notable change here is that those fields were previously case-sensitive and will now be case-insensitive, which I think is preferable despite there being a test which was covering this behaviour.

Note that not all keyword fields should have a normalizer specified, for instance, verbatim fields such as bounding_box and addendum are probably best left with the default null normalizer.

Normalizers are applied both at index-time and at query-time.

I would like to add some additional filters such as trim and unique but they are not available until version 6.4 of elasticsearch and so will come in a subsequent PR which can be merged independently of this.

@Joxit
Copy link
Member

Joxit commented Dec 13, 2019

Is this normalizer necessary? The API is already doing the lowercase transformation for layer and source (category too ?) and there are also some check for ids 🤔

The code looks good 😄

@missinglink
Copy link
Member Author

Good point, I was more thinking about trying to prevent bugs by ensuring the tokens were normalized.
This normalization needs to be done both at index-time and query-time and I could see bugs easily being introduced, especially in the pelias/api code.

It's not a big deal and I'd be fine with not merging this if it comes with a performance hit.

@missinglink
Copy link
Member Author

missinglink commented Dec 16, 2019

I found this old test case super confusing because it's asserting that the keyword field is case-sensitive even though it should never be the case 🤷‍♂

[admission of guilt] it was written by me 😝

@Joxit
Copy link
Member

Joxit commented Dec 16, 2019

Ha ha ha, it's ok, 4 years ago, the statute of limitations has passed :p

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants