12 May 2018 •By Sean Wellington in Localization

Elasticsearch ASCII Folding Filter

A common requirement for searching over data sets containing accented characters is matching against both the “unaccented version” as well as the character as present in the document. This seems to arise primarily in a US English context where accents are not perceived to convey useful information (e.g. many users will not see any meaningful difference between the words “facade” and “façade” and expect to find documents containing either variation when searching).

Fortunately, Elasticsearch comes with support for this use case and it is relatively straightforward to set up. To do so, we need to reconfigure three components of the index’s settings and restore some of the out-of-the-box defaults (if necessary). This can be done with the ASCII Folding Token Filter

Configure Filters

The first step is to instruct Elasticsearch to intercept and generate additional tokens representing the de-accented versions of the terms in our documents. We define a custom filter in the analysis section of the index’s settings:

"settings" : {
  "analysis" : {
    "filter" : {
	  "my_text_asciifolding_filter" : {
	    "type" : "asciifolding",
		"preserve_original" : true
	  }
	}
  }
}

Note that we are using the preserve_original property so that the characters as present in the document will be indexed as well as their ASCII-folded equivalents.

Custom Analyzers

Next we create two custom analyzers: one for the text fields themselves, the other for search terms. Both are required so that incoming queries can be transformed to match the values that are in the index.

"settings" : {
  "analysis" : {
    "analyzer" : {
	  "my_text_field_analyzer" : {
	    "filter" : [
		  "standard",
		  "lowercase",
		  "my_text_asciifolding_filter"
		],
		"tokenizer" : "standard"
	  },
	  "my_text_search_analyzer" : {
	    "filter" : [
		  "standard",
		  "lowercase",
		  "text_asciifolding_filter"
		],
		"tokenizer" : "standard"
	  }
	}
  }
}

Here we add the standard and lowercase filters to preseve the default behavior.

Mappings

The final step is to apply the custom analyzer to the fields of our documents. To do so, we add the analyzer and search_analyzer properties to our mapping, referencing the custom analyzers created above:

"properties" : {
  "my_text_field" : {
    "type" : "text",
    "analyzer" : "my_text_field_analyzer",
	"search_analyzer" : "my_text_search_analyzer"
  }
}