Import data from XML files into data.frame

Question 1

I have a number of XML files containing data I would like to analyse. Each XML contains data in a format similar to this:

<?xml version='1.0' encoding='UTF-8'?>
<build>
 <actions>
 ...
 </actions>
 <queueId>1276</queueId>
 <timestamp>1447062398490</timestamp>
 <startTime>1447062398538</startTime>
 <result>ABORTED</result>
 <duration>539722</duration>
 <charset>UTF-8</charset>
 <keepLog>false</keepLog>
 <builtOn></builtOn>
 <workspace>/var/lib/jenkins/workspace/clean-caches</workspace>
 <hudsonVersion>1.624</hudsonVersion>
 <scm class="hudson.scm.NullChangeLogParser"/>
 <culprits class="com.google.common.collect.EmptyImmutableSortedSet"/>
</build>

These are build.xml files generated by the continuous integration server, Jenkins. The files themselves don't have some important data that I would like, like the Jenkins job name, or the build number that created the xml. The job and build ids are encoded into the path of each file like .\jenkins\jobs\${JOB_NAME}\builds\${BUILD_NUMBER}\build.xml

I would like to create a data frame contain job name, build number, duration, and result.

My code to achieve this is the following:

library(XML)
filenames <- list.files("C:/Users/jenkins/", recursive=TRUE, full.names=TRUE, pattern="build.xml")
job <- unlist(lapply(filenames, function(f) {
 s <- unlist(strsplit(f, split=.Platform$file.sep))
 s[length(s) - 3]
}))
build <- unlist(lapply(filenames, function(f) {
 s <- unlist(strsplit(f, split=.Platform$file.sep))
 s[length(s) - 1]
}))
duration <- unlist(lapply(filenames, function(f) {
 xml <- xmlParse(f)
 xpathSApply(xml, "//duration", xmlValue)
}))
result <- unlist(lapply(filenames, function(f) {
 xml <- xmlParse(f)
 x <- xpathSApply(xml, "//result", xmlValue)
 return(x)
}))
build.data <- data.frame(job, build, result, duration)

Which gives me a data frame that looks like this:

 job build result duration
1 clean-caches 37 SUCCESS 248701
2 clean-caches 38 FAILURE 1200049
3 clean-caches 39 FAILURE 1200060
4 clean-caches 40 FAILURE 1200123
5 clean-caches 41 SUCCESS 358024
6 clean-caches 42 SUCCESS 130462

This works, but I have serious concerns about it from both a style and a performance point of view. I'm completely new to R, so I don't know what would be a nicer way to do this.

My concerns:

Repeated code:

The code blocks to generate the job and build vectors are identical. Same for duration and result. If I decide to import more nodes from XML, I'll end up repeating even more code.
- Several iterations must be made of my list of files. There are thousands of XML files, and this number will likely grow. As above, if I wish to extract more data from the XML, I must add more iterations.

Question 2

With a few XML files to read you could have done something like this to address your concerns, where the files are only read at once, but all loaded into memory at the same time:

subdirs <- strsplit(dirname(filenames),
 split = .Platform$file.sep)
subdirs <- lapply(subdirs, rev)
job <- sapply(subdirs, `[[`, 3)
build <- sapply(subdirs, `[[`, 1)
xmls <- lapply(filenames, xmlParse)
duration <- sapply(xmls, xpathSApply, "//duration", xmlValue)
result <- sapply(xmls, xpathSApply, "//result", xmlValue)
build.data <- data.frame(job, build, result, duration)

With thousands of files though, it makes more sense to process the files one by one and only keep the useful information before moving from one file to the next. It also makes sense to write a function to process each file. It could be:

build.info <- function(file, xml_fields = c("duration", "result")) {
 res <- list()
 # process filepath
 subdirs <- rev(unlist(strsplit(dirname(file),
 split = .Platform$file.sep)))
 res$job <- subdirs[[3]]
 res$build <- subdirs[[1]]
 # process xml data
 doc <- xmlTreeParse(file)
 build <- doc$doc$children$build
 res[xml_fields] <- lapply(build[xml_fields], xmlValue)
 # return as a data.frame
 as.data.frame(res)
}

See how the function returns a one row data.frame. Then you can call the function on all files via lapply and bind all the outputs together:

build.data <- do.call(rbind, lapply(filenames, build.info))

With a few changes, you can write a more general function that will take one or more files and do the binding itself (like file.info does)

build.info <- function(file, xml_fields = c("duration", "result")) {
 stopifnot(length(file) > 0L)
 if (length(file) == 1L) {
 res <- list()
 # process filepath
 subdirs <- rev(unlist(strsplit(dirname(file),
 split = .Platform$file.sep)))
 res$job <- subdirs[[1]]
 res$build <- subdirs[[3]]
 # process xml data
 doc <- xmlTreeParse(file)
 build <- doc$doc$children$build
 res[xml_fields] <- lapply(build[xml_fields], xmlValue)
 # return data.frame
 as.data.frame(res)
 } else {
 do.call(rbind, lapply(file, build.info))
 }
}
build.data <- build.info(filenames)

Question 3

That's great. It really helps me understand how to use the different methods of array access: [], [[]], and $. I assume that you mean to refer to xml_fields in the body of the function, not fields?

flodel flodelflodel 3,5551 gold badge15 silver badges15 bronze badges · Accepted Answer · 2016-01-12 02:38:07Z

With a few XML files to read you could have done something like this to address your concerns, where the files are only read at once, but all loaded into memory at the same time:

subdirs <- strsplit(dirname(filenames),
 split = .Platform$file.sep)
subdirs <- lapply(subdirs, rev)
job <- sapply(subdirs, `[[`, 3)
build <- sapply(subdirs, `[[`, 1)
xmls <- lapply(filenames, xmlParse)
duration <- sapply(xmls, xpathSApply, "//duration", xmlValue)
result <- sapply(xmls, xpathSApply, "//result", xmlValue)
build.data <- data.frame(job, build, result, duration)

With thousands of files though, it makes more sense to process the files one by one and only keep the useful information before moving from one file to the next. It also makes sense to write a function to process each file. It could be:

build.info <- function(file, xml_fields = c("duration", "result")) {
 res <- list()
 # process filepath
 subdirs <- rev(unlist(strsplit(dirname(file),
 split = .Platform$file.sep)))
 res$job <- subdirs[[3]]
 res$build <- subdirs[[1]]
 # process xml data
 doc <- xmlTreeParse(file)
 build <- doc$doc$children$build
 res[xml_fields] <- lapply(build[xml_fields], xmlValue)
 # return as a data.frame
 as.data.frame(res)
}

See how the function returns a one row data.frame. Then you can call the function on all files via lapply and bind all the outputs together:

build.data <- do.call(rbind, lapply(filenames, build.info))

With a few changes, you can write a more general function that will take one or more files and do the binding itself (like file.info does)

build.info <- function(file, xml_fields = c("duration", "result")) {
 stopifnot(length(file) > 0L)
 if (length(file) == 1L) {
 res <- list()
 # process filepath
 subdirs <- rev(unlist(strsplit(dirname(file),
 split = .Platform$file.sep)))
 res$job <- subdirs[[1]]
 res$build <- subdirs[[3]]
 # process xml data
 doc <- xmlTreeParse(file)
 build <- doc$doc$children$build
 res[xml_fields] <- lapply(build[xml_fields], xmlValue)
 # return data.frame
 as.data.frame(res)
 } else {
 do.call(rbind, lapply(file, build.info))
 }
}
build.data <- build.info(filenames)

That's great. It really helps me understand how to use the different methods of array access: [], [[]], and $. I assume that you mean to refer to xml_fields in the body of the function, not fields?

Stack Exchange Network

Import data from XML files into data.frame

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Import data from XML files into data.frame

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions