How do JSON parsers encode unicode characters not in the basic multilingual plane? How do JSON parsers encode unicode characters not in the basic multilingual plane? json json

How do JSON parsers encode unicode characters not in the basic multilingual plane?


Taken from the Wikipedia article linked by Remy Lebeau in the comments above (link):

To encode U+10437 (𐐷) to UTF-16:

Subtract 0x10000 from the code point, leaving 0x0437. For the high surrogate, shift right by 10 (divide by 0x400), then add 0xD800, resulting in 0x0001 + 0xD800 = 0xD801. For the low surrogate, take the low 10 bits (remainder of dividing by 0x400), then add 0xDC00, resulting in 0x0037 + 0xDC00 = 0xDC37. To decode U+10437 (𐐷) from UTF-16:

Take the high surrogate (0xD801) and subtract 0xD800, then multiply by 0x400, resulting in 0x0001 × 0x400 = 0x0400. Take the low surrogate (0xDC37) and subtract 0xDC00, resulting in 0x37. Add these two results together (0x0437), and finally add 0x10000 to get the final decoded UTF-32 code point, 0x10437.