Detect character position in an UTF NSString from a byte offset(was SQLite offsets() and encoding problem) Detect character position in an UTF NSString from a byte offset(was SQLite offsets() and encoding problem) sqlite sqlite

Detect character position in an UTF NSString from a byte offset(was SQLite offsets() and encoding problem)


See the Sqlite FTS3 docs and you'll notice that the offsets and lengths are in bytes not characters.

You must apply the offset and length before decoding the bytes into a string of characters in order to display the correct offset. The offset coming from Sqlite counts each byte of multibyte characters, whereas you are using that offset to count characters.

Your indexed text probably has 3 or 4 characters that are two bytes. Hence the off-by-3-or-4 problem.


Per @metatation's answer, the offset is in bytes, not characters. The text in your database is probably UTF8-encoded Unicode, in which case any single non-ASCII character is represented by multiple bytes. Examples of non-ASCII characters include those with accents (à, ö, etc), smart quotes, characters from non-Latin character sets (Greek, Cyrillic, most Asian character sets, etc) and so on.

If the bytes in the SQLite database are UTF8-encoded Unicode strings you can work out the true Unicode character offset for a given byte offset like so:

NSUInteger characterOffsetForByteOffsetInUTF8String(NSUInteger byteOffset, const char *string) {    /*     * UTF-8 represents ASCII characters in a single byte. Characters with a code     * point from U+0080 upwards are represented as multiple bytes. The first byte     * always has the two most significant bits set (i.e. 11xxxxxx). All subsequent     * bytes have the most significant bit set, the next most significant bit unset     * (i.e. 10xxxxxx).     *      * We use that here to determine character offsets. We step through the first     * `byteOffset` bytes of `string`, incrementing the character offset result     * every time we come across a byte that doesn't match 10xxxxxx, i.e. where     * (byte & 11000000) != 10000000     *     * See also: http://en.wikipedia.org/wiki/UTF-8#Description     */    NSUInteger characterOffset = 0;    for (NSUInteger i = 0; i < byteOffset; i++) {        char c = string[i];        if ((c & 0xc0) != 0x80) {            characterOffset++;        }    }    return characterOffset;}

Caveat: If you're using the character offset to index into an NSString, bear in mind that NSString uses UTF-16 under the hood, so characters with a Unicode code point higher than U+FFFF are represented by a pair of 16-bit values. You generally won't bump up against this for text content, but if you care about particularly obscure character sets, or some of the non-text characters Unicode can represent such as Emojis, then the above algorithm will require improvements to cater for those.

(The code snippet's from this project of mine - feel free to use it.)


Inspired by this thread, and Simon's solution in particular; here's how I do it.

There might be a more "Swifty" way than returning an NSRange but I need it to highlight an NSAttributedString.

extension String {    func charRangeForByteRange(range : NSRange) -> NSRange {        let bytes = [UInt8](utf8)        var charOffset = 0        for i in 0..<range.location {            if ((bytes[i] & 0xc0) != 0x80) { charOffset++ }        }        let location = charOffset        for i in range.location..<(range.location + range.length) {            if ((bytes[i] & 0xc0) != 0x80) { charOffset++ }        }        let length = charOffset - location        return NSMakeRange(location, length)    }}