Escape elasticsearch special characters in PHP
You can use preg_match with backreferences as stribizhev has noticed it (simpliest way) :
$string = "The next chars should be escaped: + - = && || > < ! ( ) { } [ ] ^ \" ~ * ? : \ / Did it work?"; function escapeElasticReservedChars($string) { $regex = "/[\\+\\-\\=\\&\\|\\!\\(\\)\\{\\}\\[\\]\\^\\\"\\~\\*\\<\\>\\?\\:\\\\\\/]/"; return preg_replace($regex, addslashes('\\$0'), $string);}echo escapeElasticReservedChars($string);
or use preg_match_callback function to achieve that. Thank to the callback, you will be able to have the current match and edit it.
A callback that will be called and passed an array of matched elements in the subject string. The callback should return the replacement string. This is the callback signature:
Here is in action :
<?php $string = "The next chars should be escaped: + - = && || > < ! ( ) { } [ ] ^ \" ~ * ? : \ / Did it work?"; function escapeElasticSearchReservedChars($string) { $regex = "/[\\+\\-\\=\\&\\|\\!\\(\\)\\{\\}\\[\\]\\^\\\"\\~\\*\\<\\>\\?\\:\\\\\\/]/"; $string = preg_replace_callback ($regex, function ($matches) { return "\\" . $matches[0]; }, $string); return $string;}echo escapeElasticSearchReservedChars($string);
Output: The next chars should be escaped\: \+ \- \= \&\& \|\| \> \< \! \( \) \{ \} \[ \] \^ \" \~ \* \? \: \\ \/ Did it work\?
If anyone's looking for a slightly verbose (but readable!) solution:
public function escapeElasticsearchValue($searchValue){ $searchValue = str_replace('\\', '\\\\', $searchValue); $searchValue = str_replace('*', '\\*', $searchValue); $searchValue = str_replace('?', '\\?', $searchValue); $searchValue = str_replace('+', '\\+', $searchValue); $searchValue = str_replace('-', '\\-', $searchValue); $searchValue = str_replace('&&', '\\&&', $searchValue); $searchValue = str_replace('||', '\\||', $searchValue); $searchValue = str_replace('!', '\\!', $searchValue); $searchValue = str_replace('(', '\\(', $searchValue); $searchValue = str_replace(')', '\\)', $searchValue); $searchValue = str_replace('{', '\\{', $searchValue); $searchValue = str_replace('}', '\\}', $searchValue); $searchValue = str_replace('[', '\\[', $searchValue); $searchValue = str_replace(']', '\\]', $searchValue); $searchValue = str_replace('^', '\\^', $searchValue); $searchValue = str_replace('~', '\\~', $searchValue); $searchValue = str_replace(':', '\\:', $searchValue); $searchValue = str_replace('"', '\\"', $searchValue); $searchValue = str_replace('=', '\\=', $searchValue); $searchValue = str_replace('/', '\\/', $searchValue); // < and > can’t be escaped at all. The only way to prevent them from // attempting to create a range query is to remove them from the query // string entirely $searchValue = str_replace('<', '', $searchValue); $searchValue = str_replace('>', '', $searchValue); return $searchValue;}
Full disclosure, I've never used elasticsearch and my advice is not from personal experience or even tested with elasticsearch. I am generating this advice from my knowledge of regular expressions and string manipulation skills. If someone identifies a vulnerability, I'll be happy to receive your comment.
My snippet:
- first removes all occurrences of
<
and>
in the string then - checks for a character in the list of single-occurrence reserved characters OR an ampersand or pipe which is immediately followed by the same character -- all of these qualifying characters are escaped with a backslash.
Code: (Demo)
$string = "To be escaped: + - = && || > < ! ( ) { } [ ] ^ \" ~ * ? : \ / triple ||| and split '&<&'"; echo escapeElasticSearchReservedChars($string);function escapeElasticSearchReservedChars(string $string): string{ return preg_replace( [ '_[<>]+_', '_[-+=!(){}[\]^"~*?:\\/\\\\]|&(?=&)|\|(?=\|)_', ], [ '', '\\\\$0', ], $string );}
Output:
To be escaped\: \+ \- \= \&& \|| \! \( \) \{ \} \[ \] \^ \" \~ \* \? \: \\ \/ triple \|\|| and split '\&&'
The reason that <
and >
are removed first is so that someone cannot try to hack the design of the replacement and try to pass in |>|
which otherwise would prevent the appropriate escaping of two consecutive pipes (after the >
was removed).