convert HTML Character Entity Encoding in R convert HTML Character Entity Encoding in R r r

convert HTML Character Entity Encoding in R


Unescape xml/html values using xml2 package:

unescape_xml <- function(str){  xml2::xml_text(xml2::read_xml(paste0("<x>", str, "</x>")))}unescape_html <- function(str){  xml2::xml_text(xml2::read_html(paste0("<x>", str, "</x>")))}

Examples:

unescape_xml("3 < x & x > 9")# [1] "3 < x & x > 9"unescape_html("€ 2.99")# [1] "€ 2.99"


Update: this answer is outdated. Please check the answer below based on the new xml2 pkg.


Try something along the lines of:

# load XML packagelibrary(XML)# Convenience function to convert html codeshtml2txt <- function(str) {      xpathApply(htmlParse(str, asText=TRUE),                 "//body//text()",                  xmlValue)[[1]] }# html encoded string( x <- paste("i", "s", "n", "&", "a", "p", "o", "s", ";", "t", sep = "") )[1] "isn&apos;t"# converted stringhtml2txt(x)[1] "isn't"

UPDATE: Edited the html2txt() function so it applies to more situations


While the solution by Jeroen does the job, it has the disadvantage that it is not vectorised and therefore slow if applied to a large number of characters. In addition, it only works with a character vector of length one and one has to use sapply for a longer character vector.

To demonstrate this, I first create a large character vector:

set.seed(123)strings <- c("abcd", "& &apos; >", "&", " <")many_strings <- sample(strings, 10000, replace = TRUE)

And apply the function:

unescape_html <- function(str) {  xml2::xml_text(xml2::read_html(paste0("<x>", str, "</x>")))}system.time(res <- sapply(many_strings, unescape_html, USE.NAMES = FALSE))##    user  system elapsed ##   2.327   0.000   2.326 head(res)## [1] "& ' >" "€ <"   "& ' >" "€ <"   "€ <"   "abcd" 

It is much faster if all the strings in the character vector are combined into a single, large string, such that read_html() and xml_text() need only be used once. The strings can then easily be separated again using strsplit():

unescape_html2 <- function(str){  html <- paste0("<x>", paste0(str, collapse = "#_|"), "</x>")  parsed <- xml2::xml_text(xml2::read_html(html))  strsplit(parsed, "#_|", fixed = TRUE)[[1]]}system.time(res2 <- unescape_html2(many_strings))##    user  system elapsed ##   0.011   0.000   0.010 identical(res, res2)## [1] TRUE

Of course, you need to be careful that the string that you use to combine the various strings in str ("#_|" in my example) does not appear anywhere in str. Otherwise, you will introduce an error, when the large string is split again in the end.