Truncating unicode so it fits a maximum size when encoded for wire transfer

python json unicode truncate

def unicode_truncate(s, length, encoding='utf-8'):encoded = s.encode(encoding)[:length]return encoded.decode(encoding, 'ignore')

Here is an example for unicode string where each character is represented with 2 bytes in UTF-8:

>>> unicode_truncate(u'абвгд', 5)u'\u0430\u0431'

python json unicode truncate

One of UTF-8's properties is that it is easy to resync, that is find the unicode character boundaries easily in the encoded bytestream. All you need to do is to cut the encoded string at max length, then walk backwards from the end removing any bytes that are > 127 -- those are part of, or the start of a multibyte character.

As written now, this is too simple -- will erase to last ASCII char, possibly the whole string. What we need to do is check for no truncated two-byte (start with 110yyyxx) three-byte (1110yyyy) or four-byte (11110zzz)

Python 2.6 implementation in clear code. Optimization should not be an issue -- regardless of length, we only check the last 1-4 bytes.

# coding: UTF-8def decodeok(bytestr):    try:        bytestr.decode("UTF-8")    except UnicodeDecodeError:        return False    return Truedef is_first_byte(byte):    """return if the UTF-8 @byte is the first byte of an encoded character"""    o = ord(byte)    return ((0b10111111 & o) != o)def truncate_utf8(bytestr, maxlen):    u"""    >>> us = u"ウィキペディアにようこそ"    >>> s = us.encode("UTF-8")    >>> trunc20 = truncate_utf8(s, 20)    >>> print trunc20.decode("UTF-8")    ウィキペディ    >>> len(trunc20)    18    >>> trunc21 = truncate_utf8(s, 21)    >>> print trunc21.decode("UTF-8")    ウィキペディア    >>> len(trunc21)    21    """    L = maxlen    for x in xrange(1, 5):        if is_first_byte(bytestr[L-x]) and not decodeok(bytestr[L-x:L]):            return bytestr[:L-x]    return bytestr[:L]if __name__ == '__main__':    # unicode doctest hack    import sys    reload(sys)    sys.setdefaultencoding("UTF-8")    import doctest    doctest.testmod()

python json unicode truncate

This will do for UTF8, If you like to do it in regex.

import repartial="\xc2\x80\xc2\x80\xc2"re.sub("([\xf6-\xf7][\x80-\xbf]{0,2}|[\xe0-\xef][\x80-\xbf]{0,1}|[\xc0-\xdf])$","",partial)"\xc2\x80\xc2\x80"

Its cover from U+0080 (2 bytes) to U+10FFFF (4 bytes) utf8 strings

Its really straight forward just like UTF8 algorithm

From U+0080 to U+07FF It will need 2 bytes 110yyyxx 10xxxxxxIts mean, if you see only one byte in the end like 110yyyxx (0b11000000 to 0b11011111) It is [\xc0-\xdf], it will be partial one.

From U+0800 to U+FFFF is 3 bytes needed 1110yyyy 10yyyyxx 10xxxxxxIf you see only 1 or 2 bytes in the end, it will be partial one.It will match with this pattern [\xe0-\xef][\x80-\xbf]{0,1}

From U+10000–U+10FFFF is 4 bytes needed 11110zzz 10zzyyyy 10yyyyxx 10xxxxxxIf you see only 1 to 3 bytes in the end, it will be partial oneIt will match with this pattern [\xf6-\xf7][\x80-\xbf]{0,2}

Update :

If you only need Basic Multilingual Plane, You can drop last Pattern. This will do.

re.sub("([\xe0-\xef][\x80-\xbf]{0,1}|[\xc0-\xdf])$","",partial)

Let me know if there is any problem with that regex.

CodeHunter

Truncating unicode so it fits a maximum size when encoded for wire transfer

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last