Number of character cells used by string

From UTF-8 and Unicode FAQ for Unix/Linux:

The number of characters can be counted in C in a portable way using mbstowcs(NULL,s,0). This works for UTF-8 like for any other supported encoding, as long as the appropriate locale has been selected. A hard-wired technique to count the number of characters in a UTF-8 string is to count all bytes except those in the range 0x80 – 0xBF, because these are just continuation bytes and not characters of their own. However, the need to count characters arises surprisingly rarely in applications.

c linux string utf-8

You may or may not have a UTF-8 compatible strlen(3) function available. However, there are some simple C functions readily available that do the job quickly.

The efficient C solutions examine the start of the character to skip continuation bytes. The simple code (referenced from the link above) is

int my_strlen_utf8_c(char *s) {   int i = 0, j = 0;   while (s[i]) {     if ((s[i] & 0xc0) != 0x80) j++;     i++;   }   return j;}

The faster version uses the same technique, but prefetches data and does multi-byte compares, resulting is a substantial speedup. The code is longer and more complex, however.

c linux string utf-8

I'm shocked that no one mentioned this, so here it goes for the record:

If you want to align text in a terminal, you need to use the POSIX functions wcwidth and wcswidth. Here's correct program to find the on-screen length of a string.

#define _XOPEN_SOURCE#include <wchar.h>#include <stdio.h>#include <locale.h>#include <stdlib.h>int measure(char *string) {    // allocate enough memory to hold the wide string    size_t needed = mbstowcs(NULL, string, 0) + 1;    wchar_t *wcstring = malloc(needed * sizeof *wcstring);    if (!wcstring) return -1;    // change encodings    if (mbstowcs(wcstring, string, needed) == (size_t)-1) return -2;    // measure width    int width = wcswidth(wcstring, needed);    free(wcstring);    return width;}int main(int argc, char **argv) {    setlocale(LC_ALL, "");    for (int i = 1; i < argc; i++) {        printf("%s: %d\n", argv[i], measure(argv[i]));    }}

Here's an example of it running:

$ ./measure hello 莊子 cＡbhello: 5莊子: 4cＡb: 4

Note how the two characters "莊子" and the three characters "cＡb" (note the double-width Ａ) are both 4 columns wide.

As utf8everywhere.org puts it,

The size of the string as it appears on the screen is unrelated to the number of code points in the string. One has to communicate with the rendering engine for this. Code points do not occupy one column even in monospace fonts and terminals. POSIX takes this into account.

Windows does not have any built-in wcwidth function for console output; if you want to support multi-column characters in the Windows console ~~you need to find a portable implementation of wcwidth~~ give up because the Windows console doesn’t support Unicode without crazy hacks.

CodeHunter

Number of character cells used by string

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last