Removing text containing non-english character Removing text containing non-english character r r

Removing text containing non-english character


I would check out this related Stack Overflow post for doing the same thing in javascript. Regular expression to match non-English characters?

To translate this into R, you could do (to match non-ASCII):

res <- data[which(!grepl("[^\x01-\x7F]+", data$Name)),]res# A tibble: 1 × 2#        Name  Rank#       <chr> <dbl>#1 apple firm     1

And to match non-unicode per that same SO post:

  res <- data[which(!grepl("[^\u0001-\u007F]+", data$Name)),]  res# A tibble: 1 × 2#        Name  Rank#       <chr> <dbl>#1 apple firm     1

Note - we had to take out the NUL character for this to work. So instead of starting at \u0000 or x00 we start at \u0001 and \x01.


stringi package has the convenience function stri_enc_isascii:

library(stringi)stri_enc_isascii(data$Name)# [1]  TRUE FALSE FALSE

As the name suggests,

the function checks whether all bytes in a string are in the [ASCII] set 1,2,...,127 (from ?stri_enc_isascii).


An alternative to regex would be to use iconv and than filter for non NA entries:

library(dplyr)data <- data %>%          mutate(Name = iconv(Name, from = "latin1", to = "ASCII")) %>%         filter(!is.na(Name))

What happens in the mutate statement is that the strings are converted from latin1 to ASCII. Here's a list of the characters covered by latin1 aka ISO 8859-1. When a string contains a character that is not on the latin1 list, it cannot be converted to ASCII and becomes NA.