4
\$\begingroup\$

I have a number of XML files containing data I would like to analyse. Each XML contains data in a format similar to this:

<?xml version='1.0' encoding='UTF-8'?>
<build>
 <actions>
 ...
 </actions>
 <queueId>1276</queueId>
 <timestamp>1447062398490</timestamp>
 <startTime>1447062398538</startTime>
 <result>ABORTED</result>
 <duration>539722</duration>
 <charset>UTF-8</charset>
 <keepLog>false</keepLog>
 <builtOn></builtOn>
 <workspace>/var/lib/jenkins/workspace/clean-caches</workspace>
 <hudsonVersion>1.624</hudsonVersion>
 <scm class="hudson.scm.NullChangeLogParser"/>
 <culprits class="com.google.common.collect.EmptyImmutableSortedSet"/>
</build>

These are build.xml files generated by the continuous integration server, Jenkins. The files themselves don't have some important data that I would like, like the Jenkins job name, or the build number that created the xml. The job and build ids are encoded into the path of each file like .\jenkins\jobs\${JOB_NAME}\builds\${BUILD_NUMBER}\build.xml

I would like to create a data frame contain job name, build number, duration, and result.

My code to achieve this is the following:

library(XML)
filenames <- list.files("C:/Users/jenkins/", recursive=TRUE, full.names=TRUE, pattern="build.xml")
job <- unlist(lapply(filenames, function(f) {
 s <- unlist(strsplit(f, split=.Platform$file.sep))
 s[length(s) - 3]
}))
build <- unlist(lapply(filenames, function(f) {
 s <- unlist(strsplit(f, split=.Platform$file.sep))
 s[length(s) - 1]
}))
duration <- unlist(lapply(filenames, function(f) {
 xml <- xmlParse(f)
 xpathSApply(xml, "//duration", xmlValue)
}))
result <- unlist(lapply(filenames, function(f) {
 xml <- xmlParse(f)
 x <- xpathSApply(xml, "//result", xmlValue)
 return(x)
}))
build.data <- data.frame(job, build, result, duration)

Which gives me a data frame that looks like this:

 job build result duration
1 clean-caches 37 SUCCESS 248701
2 clean-caches 38 FAILURE 1200049
3 clean-caches 39 FAILURE 1200060
4 clean-caches 40 FAILURE 1200123
5 clean-caches 41 SUCCESS 358024
6 clean-caches 42 SUCCESS 130462

This works, but I have serious concerns about it from both a style and a performance point of view. I'm completely new to R, so I don't know what would be a nicer way to do this.

My concerns:

  • Repeated code:

    The code blocks to generate the job and build vectors are identical. Same for duration and result. If I decide to import more nodes from XML, I'll end up repeating even more code.

    • Several iterations must be made of my list of files. There are thousands of XML files, and this number will likely grow. As above, if I wish to extract more data from the XML, I must add more iterations.
Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Jan 11, 2016 at 16:46
\$\endgroup\$

1 Answer 1

3
\$\begingroup\$

With a few XML files to read you could have done something like this to address your concerns, where the files are only read at once, but all loaded into memory at the same time:

subdirs <- strsplit(dirname(filenames),
 split = .Platform$file.sep)
subdirs <- lapply(subdirs, rev)
job <- sapply(subdirs, `[[`, 3)
build <- sapply(subdirs, `[[`, 1)
xmls <- lapply(filenames, xmlParse)
duration <- sapply(xmls, xpathSApply, "//duration", xmlValue)
result <- sapply(xmls, xpathSApply, "//result", xmlValue)
build.data <- data.frame(job, build, result, duration)

With thousands of files though, it makes more sense to process the files one by one and only keep the useful information before moving from one file to the next. It also makes sense to write a function to process each file. It could be:

build.info <- function(file, xml_fields = c("duration", "result")) {
 res <- list()
 # process filepath
 subdirs <- rev(unlist(strsplit(dirname(file),
 split = .Platform$file.sep)))
 res$job <- subdirs[[3]]
 res$build <- subdirs[[1]]
 # process xml data
 doc <- xmlTreeParse(file)
 build <- doc$doc$children$build
 res[xml_fields] <- lapply(build[xml_fields], xmlValue)
 # return as a data.frame
 as.data.frame(res)
}

See how the function returns a one row data.frame. Then you can call the function on all files via lapply and bind all the outputs together:

build.data <- do.call(rbind, lapply(filenames, build.info))

With a few changes, you can write a more general function that will take one or more files and do the binding itself (like file.info does)

build.info <- function(file, xml_fields = c("duration", "result")) {
 stopifnot(length(file) > 0L)
 if (length(file) == 1L) {
 res <- list()
 # process filepath
 subdirs <- rev(unlist(strsplit(dirname(file),
 split = .Platform$file.sep)))
 res$job <- subdirs[[1]]
 res$build <- subdirs[[3]]
 # process xml data
 doc <- xmlTreeParse(file)
 build <- doc$doc$children$build
 res[xml_fields] <- lapply(build[xml_fields], xmlValue)
 # return data.frame
 as.data.frame(res)
 } else {
 do.call(rbind, lapply(file, build.info))
 }
}
build.data <- build.info(filenames)
answered Jan 12, 2016 at 2:38
\$\endgroup\$
1
  • \$\begingroup\$ That's great. It really helps me understand how to use the different methods of array access: [], [[]], and $. I assume that you mean to refer to xml_fields in the body of the function, not fields? \$\endgroup\$ Commented Jan 12, 2016 at 11:02

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.