I have a number of XML files containing data I would like to analyse. Each XML contains data in a format similar to this:
<?xml version='1.0' encoding='UTF-8'?>
<build>
<actions>
...
</actions>
<queueId>1276</queueId>
<timestamp>1447062398490</timestamp>
<startTime>1447062398538</startTime>
<result>ABORTED</result>
<duration>539722</duration>
<charset>UTF-8</charset>
<keepLog>false</keepLog>
<builtOn></builtOn>
<workspace>/var/lib/jenkins/workspace/clean-caches</workspace>
<hudsonVersion>1.624</hudsonVersion>
<scm class="hudson.scm.NullChangeLogParser"/>
<culprits class="com.google.common.collect.EmptyImmutableSortedSet"/>
</build>
These are build.xml
files generated by the continuous integration server, Jenkins. The files themselves don't have some important data that I would like, like the Jenkins job name, or the build number that created the xml. The job and build ids are encoded into the path of each file like .\jenkins\jobs\${JOB_NAME}\builds\${BUILD_NUMBER}\build.xml
I would like to create a data frame contain job name, build number, duration, and result.
My code to achieve this is the following:
library(XML)
filenames <- list.files("C:/Users/jenkins/", recursive=TRUE, full.names=TRUE, pattern="build.xml")
job <- unlist(lapply(filenames, function(f) {
s <- unlist(strsplit(f, split=.Platform$file.sep))
s[length(s) - 3]
}))
build <- unlist(lapply(filenames, function(f) {
s <- unlist(strsplit(f, split=.Platform$file.sep))
s[length(s) - 1]
}))
duration <- unlist(lapply(filenames, function(f) {
xml <- xmlParse(f)
xpathSApply(xml, "//duration", xmlValue)
}))
result <- unlist(lapply(filenames, function(f) {
xml <- xmlParse(f)
x <- xpathSApply(xml, "//result", xmlValue)
return(x)
}))
build.data <- data.frame(job, build, result, duration)
Which gives me a data frame that looks like this:
job build result duration 1 clean-caches 37 SUCCESS 248701 2 clean-caches 38 FAILURE 1200049 3 clean-caches 39 FAILURE 1200060 4 clean-caches 40 FAILURE 1200123 5 clean-caches 41 SUCCESS 358024 6 clean-caches 42 SUCCESS 130462
This works, but I have serious concerns about it from both a style and a performance point of view. I'm completely new to R, so I don't know what would be a nicer way to do this.
My concerns:
Repeated code:
The code blocks to generate the
job
andbuild
vectors are identical. Same forduration
andresult
. If I decide to import more nodes from XML, I'll end up repeating even more code.- Several iterations must be made of my list of files. There are thousands of XML files, and this number will likely grow. As above, if I wish to extract more data from the XML, I must add more iterations.
1 Answer 1
With a few XML files to read you could have done something like this to address your concerns, where the files are only read at once, but all loaded into memory at the same time:
subdirs <- strsplit(dirname(filenames),
split = .Platform$file.sep)
subdirs <- lapply(subdirs, rev)
job <- sapply(subdirs, `[[`, 3)
build <- sapply(subdirs, `[[`, 1)
xmls <- lapply(filenames, xmlParse)
duration <- sapply(xmls, xpathSApply, "//duration", xmlValue)
result <- sapply(xmls, xpathSApply, "//result", xmlValue)
build.data <- data.frame(job, build, result, duration)
With thousands of files though, it makes more sense to process the files one by one and only keep the useful information before moving from one file to the next. It also makes sense to write a function to process each file. It could be:
build.info <- function(file, xml_fields = c("duration", "result")) {
res <- list()
# process filepath
subdirs <- rev(unlist(strsplit(dirname(file),
split = .Platform$file.sep)))
res$job <- subdirs[[3]]
res$build <- subdirs[[1]]
# process xml data
doc <- xmlTreeParse(file)
build <- doc$doc$children$build
res[xml_fields] <- lapply(build[xml_fields], xmlValue)
# return as a data.frame
as.data.frame(res)
}
See how the function returns a one row data.frame. Then you can call the function on all files via lapply
and bind all the outputs together:
build.data <- do.call(rbind, lapply(filenames, build.info))
With a few changes, you can write a more general function that will take one or more files and do the binding itself (like file.info
does)
build.info <- function(file, xml_fields = c("duration", "result")) {
stopifnot(length(file) > 0L)
if (length(file) == 1L) {
res <- list()
# process filepath
subdirs <- rev(unlist(strsplit(dirname(file),
split = .Platform$file.sep)))
res$job <- subdirs[[1]]
res$build <- subdirs[[3]]
# process xml data
doc <- xmlTreeParse(file)
build <- doc$doc$children$build
res[xml_fields] <- lapply(build[xml_fields], xmlValue)
# return data.frame
as.data.frame(res)
} else {
do.call(rbind, lapply(file, build.info))
}
}
build.data <- build.info(filenames)
-
\$\begingroup\$ That's great. It really helps me understand how to use the different methods of array access: [], [[]], and $. I assume that you mean to refer to
xml_fields
in the body of the function, notfields
? \$\endgroup\$laffoyb– laffoyb2016年01月12日 11:02:49 +00:00Commented Jan 12, 2016 at 11:02