Serious Memory Leak When Iteratively Parsing XML Files Serious Memory Leak When Iteratively Parsing XML Files xml xml

Serious Memory Leak When Iteratively Parsing XML Files


From the XML package's webpage, it seems that the author, Duncan Temple Lang, has quite extensively described certain memory management issues. See this page: "Memory Management in the XML Package".

Honestly, I'm not proficient in the details of what's going on here with your code and the package, but I think you'll either find the answer in that page, specifically in the section called "Problems", or in direct communication with Duncan Temple Lang.


Update 1. An idea that might work is to use the multicore and foreach packages (i.e. listResults = foreach(ix = 1:N) %dopar% {your processing;return(listElement)}. I think that for Windows you'll need doSMP, or maybe doRedis; under Linux, I use doMC. In any case, by parallelizing the loading, you'll get faster throughput. The reason I think you may get some benefit from memory usage is that it could be that forking R, could lead to different memory cleaning, as each spawned process gets killed when complete. This isn't guaranteed to work, but it could address both memory and speed issues.

Note, though: doSMP has its own idiosyncracies (i.e. you may still have some memory issues with it). There have been other Q&As on SO that mentioned some issues, but I'd still give it a shot.


I've experienced similar issues with the XML package. The amount of memory being used by R was ballooning, to the point where my computer would crash. This answer solved my problem, I just set addFinalizer = F.

Here's a minimum reproducible example:

library(tidyverse)library(XML)url <- "https://en.wikipedia.org/wiki/Main_Page"httr::GET(url) %>% base::saveRDS("html.rds")

Memory usage before running anything else:

enter image description here


Memory usage after running the following:

for(i in 1:10000){    base::readRDS(file = "html.rds") %>%         XML::htmlParse(., asText=TRUE) %>%         XML::xpathSApply(., path = "//h1", xmlValue, addFinalizer = F)}

enter image description here


Memory usage after removing addFinalizer = F (the default):

for(i in 1:10000){    base::readRDS(file = "html.rds") %>%         XML::htmlParse(., asText=TRUE) %>%         XML::xpathSApply(., path = "//h1", xmlValue)}

enter image description here


@Rappster My R doesn't crash when I first check and make sure the XML doc exists and then call the C function for realizing the memory.

 for (i in 1:1000) {  pXML<-xmlParse(file)if(exists("pXML")){  .Call("RS_XML_forceFreeDoc", pXML)                  }}