How to count characters in a unicode string in C

c string unicode ascii

You only count the characters that have the top two bits are not set to 10 (i.e., everything less that 0x80 or greater than 0xbf).

That's because all the characters with the top two bits set to 10 are UTF-8 continuation bytes.

See here for a description of the encoding and how strlen can work on a UTF-8 string.

For slicing and dicing UTF-8 strings, you basically have to follow the same rules. Any byte starting with a 0 bit or a 11 sequence is the start of a UTF-8 code point, all others are continuation characters.

Your best bet, if you don't want to use a third-party library, is to simply provide functions along the lines of:

utf8left (char *destbuff, char *srcbuff, size_t sz);utf8mid  (char *destbuff, char *srcbuff, size_t pos, size_t sz);utf8rest (char *destbuff, char *srcbuff, size_t pos;

to get, respectively:

the left sz UTF-8 bytes of a string.
the sz UTF-8 bytes of a string, starting at pos.
the rest of the UTF-8 bytes of a string, starting at pos.

This will be a decent building block to be able to manipulate the strings sufficiently for your purposes.

c string unicode ascii

Try this for size:

#include <stdbool.h>#include <stdio.h>#include <stdlib.h>#include <string.h>#include <unistd.h>// returns the number of utf8 code points in the buffer at ssize_t utf8len(char *s){    size_t len = 0;    for (; *s; ++s) if ((*s & 0xC0) != 0x80) ++len;    return len;}// returns a pointer to the beginning of the pos'th utf8 codepoint// in the buffer at schar *utf8index(char *s, size_t pos){        ++pos;    for (; *s; ++s) {        if ((*s & 0xC0) != 0x80) --pos;        if (pos == 0) return s;    }    return NULL;}// converts codepoint indexes start and end to byte offsets in the buffer at svoid utf8slice(char *s, ssize_t *start, ssize_t *end){    char *p = utf8index(s, *start);    *start = p ? p - s : -1;    p = utf8index(s, *end);    *end = p ? p - s : -1;}// appends the utf8 string at src to destchar *utf8cat(char *dest, char *src){    return strcat(dest, src);}// test programint main(int argc, char **argv){    // slurp all of stdin to p, with length len    char *p = malloc(0);    size_t len = 0;    while (true) {        p = realloc(p, len + 0x10000);        ssize_t cnt = read(STDIN_FILENO, p + len, 0x10000);        if (cnt == -1) {            perror("read");            abort();        } else if (cnt == 0) {            break;        } else {            len += cnt;        }    }    // do some demo operations    printf("utf8len=%zu\n", utf8len(p));    ssize_t start = 2, end = 3;    utf8slice(p, &start, &end);    printf("utf8slice[2:3]=%.*s\n", end - start, p + start);    start = 3; end = 4;    utf8slice(p, &start, &end);    printf("utf8slice[3:4]=%.*s\n", end - start, p + start);    return 0;}

Sample run:

matt@stanley:~/Desktop$ echo -n 你们好āa | ./utf8ops utf8len=5utf8slice[2:3]=好utf8slice[3:4]=ā

Note that your example has an off by one error. theString[2] == "好"

c string unicode ascii

The easiest way is to use a library like ICU

CodeHunter

How to count characters in a unicode string in C

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last