Serious Memory Leak When Iteratively Parsing XML Files
From the XML
package's webpage, it seems that the author, Duncan Temple Lang, has quite extensively described certain memory management issues. See this page: "Memory Management in the XML Package".
Honestly, I'm not proficient in the details of what's going on here with your code and the package, but I think you'll either find the answer in that page, specifically in the section called "Problems", or in direct communication with Duncan Temple Lang.
Update 1. An idea that might work is to use the multicore
and foreach
packages (i.e. listResults = foreach(ix = 1:N) %dopar% {your processing;return(listElement)}
. I think that for Windows you'll need doSMP
, or maybe doRedis
; under Linux, I use doMC
. In any case, by parallelizing the loading, you'll get faster throughput. The reason I think you may get some benefit from memory usage is that it could be that forking R, could lead to different memory cleaning, as each spawned process gets killed when complete. This isn't guaranteed to work, but it could address both memory and speed issues.
Note, though: doSMP
has its own idiosyncracies (i.e. you may still have some memory issues with it). There have been other Q&As on SO that mentioned some issues, but I'd still give it a shot.
I've experienced similar issues with the XML package. The amount of memory being used by R was ballooning, to the point where my computer would crash. This answer solved my problem, I just set addFinalizer = F
.
Here's a minimum reproducible example:
library(tidyverse)library(XML)url <- "https://en.wikipedia.org/wiki/Main_Page"httr::GET(url) %>% base::saveRDS("html.rds")
Memory usage before running anything else:
Memory usage after running the following:
for(i in 1:10000){ base::readRDS(file = "html.rds") %>% XML::htmlParse(., asText=TRUE) %>% XML::xpathSApply(., path = "//h1", xmlValue, addFinalizer = F)}
Memory usage after removing addFinalizer = F
(the default):
for(i in 1:10000){ base::readRDS(file = "html.rds") %>% XML::htmlParse(., asText=TRUE) %>% XML::xpathSApply(., path = "//h1", xmlValue)}
@Rappster My R doesn't crash when I first check and make sure the XML doc exists and then call the C function for realizing the memory.
for (i in 1:1000) { pXML<-xmlParse(file)if(exists("pXML")){ .Call("RS_XML_forceFreeDoc", pXML) }}