Removing text containing non-english character

r text

I would check out this related Stack Overflow post for doing the same thing in javascript. Regular expression to match non-English characters?

To translate this into R, you could do (to match non-ASCII):

res <- data[which(!grepl("[^\x01-\x7F]+", data$Name)),]res# A tibble: 1 × 2#        Name  Rank#       <chr> <dbl>#1 apple firm     1

And to match non-unicode per that same SO post:

  res <- data[which(!grepl("[^\u0001-\u007F]+", data$Name)),]  res# A tibble: 1 × 2#        Name  Rank#       <chr> <dbl>#1 apple firm     1

Note - we had to take out the NUL character for this to work. So instead of starting at \u0000 or x00 we start at \u0001 and \x01.

r text

stringi package has the convenience function stri_enc_isascii:

library(stringi)stri_enc_isascii(data$Name)# [1]  TRUE FALSE FALSE

As the name suggests,

the function checks whether all bytes in a string are in the [ASCII] set 1,2,...,127 (from ?stri_enc_isascii).

r text

An alternative to regex would be to use iconv and than filter for non NA entries:

library(dplyr)data <- data %>%          mutate(Name = iconv(Name, from = "latin1", to = "ASCII")) %>%         filter(!is.na(Name))

What happens in the mutate statement is that the strings are converted from latin1 to ASCII. Here's a list of the characters covered by latin1 aka ISO 8859-1. When a string contains a character that is not on the latin1 list, it cannot be converted to ASCII and becomes NA.

CodeHunter

Removing text containing non-english character

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last