A common requirement for searching over data sets containing accented characters is matching against both the “unaccented version” as well as the character as present in the document. This seems to arise primarily in a US English context where accents are not perceived to convey useful information (e.g. many users will not see any meaningful difference between the words “facade” and “façade” and expect to find documents containing either variation when searching).
Fortunately, Elasticsearch comes with support for this use case and it is relatively straightforward to set up. To do so, we need to reconfigure three components of the index’s settings and restore some of the out-of-the-box defaults (if necessary). This can be done with the ASCII Folding Token Filter
Configure Filters
The first step is to instruct Elasticsearch to intercept and generate additional tokens representing the de-accented versions of the terms in our documents. We define a custom filter in the analysis
section of the index’s settings:
"settings" : {
"analysis" : {
"filter" : {
"my_text_asciifolding_filter" : {
"type" : "asciifolding",
"preserve_original" : true
}
}
}
}
Note that we are using the preserve_original
property so that the characters as present in the document will be indexed as well as their ASCII-folded equivalents.
Custom Analyzers
Next we create two custom analyzers: one for the text fields themselves, the other for search terms. Both are required so that incoming queries can be transformed to match the values that are in the index.
"settings" : {
"analysis" : {
"analyzer" : {
"my_text_field_analyzer" : {
"filter" : [
"standard",
"lowercase",
"my_text_asciifolding_filter"
],
"tokenizer" : "standard"
},
"my_text_search_analyzer" : {
"filter" : [
"standard",
"lowercase",
"text_asciifolding_filter"
],
"tokenizer" : "standard"
}
}
}
}
Here we add the standard
and lowercase
filters to preseve the default behavior.
Mappings
The final step is to apply the custom analyzer to the fields of our documents. To do so, we add the analyzer
and search_analyzer
properties to our mapping, referencing the custom analyzers created above:
"properties" : {
"my_text_field" : {
"type" : "text",
"analyzer" : "my_text_field_analyzer",
"search_analyzer" : "my_text_search_analyzer"
}
}