Can MongoDB store and manipulate strings of UTF-8 with code points outside the basic multilingual plane? Can MongoDB store and manipulate strings of UTF-8 with code points outside the basic multilingual plane? mongodb mongodb

Can MongoDB store and manipulate strings of UTF-8 with code points outside the basic multilingual plane?


There are several issues here:

1) Please be aware that MongoDB stores all documents using the BSON format. Also note that the BSON spec referes to a UTF-8 string encoding, not a UTF-16 encoding.

Ref: http://bsonspec.org/#/specification

2) All of the drivers, including the JavaScript driver in the mongo shell, should properly handle strings that are encoded as UTF-8. (If they don't then it's a bug!) Many of the drivers happen to handle UTF-16 properly, as well, although as far as I know, UTF-16 isn't officially supported.

3) When I tested this with the Python driver, MongoDB could successfully load and return a string value that contained a broken UTF-16 code pair. However, I couldn't load a broken code pair using the mongo shell, nor could I store a string containing a broken code pair into a JavaScript variable in the shell.

4) mapReduce() runs correctly on string data using a correct UTF-16 code pair, but it will generate an error when trying to run mapReduce() on string data containing a broken code pair.

It appears that the mapReduce() is failing when MongoDB is trying to convert the BSON to a JavaScript variable for use by the JavaScript engine.

5) I've filed Jira issue SERVER-6747 for this issue. Feel free to follow it and vote it up.