Matching Unicode word boundaries in Python

1- RIGHT SINGLE QUOTATION MARK ’ seems to be just simply missed in source file:

/* Break between apostrophe and vowels (French, Italian). *//* WB5a */if (pos_m1 >= 0 && char_at(state->text, pos_m1) == '\'' &&  is_unicode_vowel(char_at(state->text, text_pos)))    return TRUE;

2- Unicode vowels are determined with is_unicode_vowel() function which translates to this list:

a, à, á, â, e, è, é, ê, i, ì, í, î, o, ò, ó, ô, u, ù, ú, û

So a LATIN SMALL LIGATURE OE œ character is not considered as a unicode vowel:

Py_LOCAL_INLINE(BOOL) is_unicode_vowel(Py_UCS4 ch) {#if PY_VERSION_HEX >= 0x03030000    switch (Py_UNICODE_TOLOWER(ch)) {#else    switch (Py_UNICODE_TOLOWER((Py_UNICODE)ch)) {#endif    case 'a': case 0xE0: case 0xE1: case 0xE2:    case 'e': case 0xE8: case 0xE9: case 0xEA:    case 'i': case 0xEC: case 0xED: case 0xEE:    case 'o': case 0xF2: case 0xF3: case 0xF4:    case 'u': case 0xF9: case 0xFA: case 0xFB:        return TRUE;    default:        return FALSE;    }}

This bug is now fixed in regex 2016.08.27 after a bug report. [_regex.c:#1668]

CodeHunter

Matching Unicode word boundaries in Python

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last