Removing non-ASCII characters from data files

r unicode ascii non-ascii-characters

To simply remove the non-ASCII characters, you could use base R's iconv(), setting sub = "". Something like this should work:

x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher") # e.g. from ?iconvEncoding(x) <- "latin1"  # (just to make sure)x# [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"iconv(x, "latin1", "ASCII", sub="")# [1] "Ekstrm"        "Jreskog"       "bichen Zrcher"

To locate non-ASCII characters, or to find if there were any at all in your files, you could likely adapt the following ideas:

## Do *any* lines contain non-ASCII characters? any(grepl("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII")))[1] TRUE## Find which lines (e.g. read in by readLines()) contain non-ASCII charactersgrep("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))[1] 1 2 3

r unicode ascii non-ascii-characters

These days, a slightly better approach is to use the stringi package which provides a function for general unicode conversion. This allows you to preserve the original text as much as possible:

x <- c("Ekstr\u00f8m", "J\u00f6reskog", "bi\u00dfchen Z\u00fcrcher")x#> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"stringi::stri_trans_general(x, "latin-ascii")#> [1] "Ekstrom"          "Joreskog"         "bisschen Zurcher"

r unicode ascii non-ascii-characters

To remove all words with non-ascii characters (borrowing code from @Hadley), you can use the package xfun with filter from dplyr

x <- c("Ekstr\u00f8m", "J\u00f6reskog", "bi\u00dfchen Z\u00fcrcher", "alex")xx %>%   tibble(name = .) %>%  filter(xfun::is_ascii(name)== T)

CodeHunter

Removing non-ASCII characters from data files

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last