Removing text containing non-english character
I would check out this related Stack Overflow post for doing the same thing in javascript. Regular expression to match non-English characters?
To translate this into R, you could do (to match non-ASCII):
res <- data[which(!grepl("[^\x01-\x7F]+", data$Name)),]res# A tibble: 1 × 2# Name Rank# <chr> <dbl>#1 apple firm 1
And to match non-unicode per that same SO post:
res <- data[which(!grepl("[^\u0001-\u007F]+", data$Name)),] res# A tibble: 1 × 2# Name Rank# <chr> <dbl>#1 apple firm 1
Note - we had to take out the NUL
character for this to work. So instead of starting at \u0000
or x00
we start at \u0001
and \x01
.
stringi
package has the convenience function stri_enc_isascii
:
library(stringi)stri_enc_isascii(data$Name)# [1] TRUE FALSE FALSE
As the name suggests,
the function checks whether all bytes in a string are in the [ASCII] set 1,2,...,127 (from
?stri_enc_isascii
).
An alternative to regex would be to use iconv
and than filter for non NA entries:
library(dplyr)data <- data %>% mutate(Name = iconv(Name, from = "latin1", to = "ASCII")) %>% filter(!is.na(Name))
What happens in the mutate statement is that the strings are converted from latin1 to ASCII. Here's a list of the characters covered by latin1 aka ISO 8859-1. When a string contains a character that is not on the latin1 list, it cannot be converted to ASCII and becomes NA.