Number of character cells used by string Number of character cells used by string linux linux

Number of character cells used by string


From UTF-8 and Unicode FAQ for Unix/Linux:

The number of characters can be counted in C in a portable way using mbstowcs(NULL,s,0). This works for UTF-8 like for any other supported encoding, as long as the appropriate locale has been selected. A hard-wired technique to count the number of characters in a UTF-8 string is to count all bytes except those in the range 0x80 – 0xBF, because these are just continuation bytes and not characters of their own. However, the need to count characters arises surprisingly rarely in applications.


You may or may not have a UTF-8 compatible strlen(3) function available. However, there are some simple C functions readily available that do the job quickly.

The efficient C solutions examine the start of the character to skip continuation bytes. The simple code (referenced from the link above) is

int my_strlen_utf8_c(char *s) {   int i = 0, j = 0;   while (s[i]) {     if ((s[i] & 0xc0) != 0x80) j++;     i++;   }   return j;}

The faster version uses the same technique, but prefetches data and does multi-byte compares, resulting is a substantial speedup. The code is longer and more complex, however.


I'm shocked that no one mentioned this, so here it goes for the record:

If you want to align text in a terminal, you need to use the POSIX functions wcwidth and wcswidth. Here's correct program to find the on-screen length of a string.

#define _XOPEN_SOURCE#include <wchar.h>#include <stdio.h>#include <locale.h>#include <stdlib.h>int measure(char *string) {    // allocate enough memory to hold the wide string    size_t needed = mbstowcs(NULL, string, 0) + 1;    wchar_t *wcstring = malloc(needed * sizeof *wcstring);    if (!wcstring) return -1;    // change encodings    if (mbstowcs(wcstring, string, needed) == (size_t)-1) return -2;    // measure width    int width = wcswidth(wcstring, needed);    free(wcstring);    return width;}int main(int argc, char **argv) {    setlocale(LC_ALL, "");    for (int i = 1; i < argc; i++) {        printf("%s: %d\n", argv[i], measure(argv[i]));    }}

Here's an example of it running:

$ ./measure hello 莊子 cAbhello: 5莊子: 4cAb: 4

Note how the two characters "莊子" and the three characters "cAb" (note the double-width A) are both 4 columns wide.

As utf8everywhere.org puts it,

The size of the string as it appears on the screen is unrelated to the number of code points in the string. One has to communicate with the rendering engine for this. Code points do not occupy one column even in monospace fonts and terminals. POSIX takes this into account.

Windows does not have any built-in wcwidth function for console output; if you want to support multi-column characters in the Windows console you need to find a portable implementation of wcwidth give up because the Windows console doesn’t support Unicode without crazy hacks.