How to convert UTF8 string to byte array? How to convert UTF8 string to byte array? javascript javascript

How to convert UTF8 string to byte array?


The logic of encoding Unicode in UTF-8 is basically:

  • Up to 4 bytes per character can be used. The fewest number of bytes possible is used.
  • Characters up to U+007F are encoded with a single byte.
  • For multibyte sequences, the number of leading 1 bits in the first byte gives the number of bytes for the character. The rest of the bits of the first byte can be used to encode bits of the character.
  • The continuation bytes begin with 10, and the other 6 bits encode bits of the character.

Here's a function I wrote a while back for encoding a JavaScript UTF-16 string in UTF-8:

function toUTF8Array(str) {    var utf8 = [];    for (var i=0; i < str.length; i++) {        var charcode = str.charCodeAt(i);        if (charcode < 0x80) utf8.push(charcode);        else if (charcode < 0x800) {            utf8.push(0xc0 | (charcode >> 6),                       0x80 | (charcode & 0x3f));        }        else if (charcode < 0xd800 || charcode >= 0xe000) {            utf8.push(0xe0 | (charcode >> 12),                       0x80 | ((charcode>>6) & 0x3f),                       0x80 | (charcode & 0x3f));        }        // surrogate pair        else {            i++;            // UTF-16 encodes 0x10000-0x10FFFF by            // subtracting 0x10000 and splitting the            // 20 bits of 0x0-0xFFFFF into two halves            charcode = 0x10000 + (((charcode & 0x3ff)<<10)                      | (str.charCodeAt(i) & 0x3ff));            utf8.push(0xf0 | (charcode >>18),                       0x80 | ((charcode>>12) & 0x3f),                       0x80 | ((charcode>>6) & 0x3f),                       0x80 | (charcode & 0x3f));        }    }    return utf8;}


JavaScript Strings are stored in UTF-16. To get UTF-8, you'll have to convert the String yourself.

One way is to mix encodeURIComponent(), which will output UTF-8 bytes URL-encoded, with unescape, as mentioned on ecmanaut.

var utf8 = unescape(encodeURIComponent(str));var arr = [];for (var i = 0; i < utf8.length; i++) {    arr.push(utf8.charCodeAt(i));}


The Encoding API lets you both encode and decode UTF-8 easily (using typed arrays):

var encoded = new TextEncoder().encode("Γεια σου κόσμε");var decoded = new TextDecoder("utf-8").decode(encoded);    console.log(encoded, decoded);

Browser support isn't too bad, and there's a polyfill that should work in IE11 and older versions of Edge.

While TextEncoder can only encode to UTF-8, TextDecoder supports other encodings. I used it to decode Japanese text (Shift-JIS) in this way:

// Shift-JIS encoded text; must be a byte array due to values 129 and 130.var arr = [130, 108, 130, 102, 130, 80, 129,  64, 130, 102, 130,  96, 130, 108, 130, 100,           129,  64, 130,  99, 130, 96, 130, 115, 130,  96, 129, 124, 130,  79, 130, 80];// Convert to byte arrayvar data = new Uint8Array(arr);// Decode with TextDecodervar decoded = new TextDecoder("shift-jis").decode(data.buffer);console.log(decoded);