Parse dblp XML and output sums of publications grouped by year and type

Question 1

The following Go program parses a gzipped XML file (available here) which contains bibliographic information on computer science publications and has the following indicative structure:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
 <article mdate="2017-05-28" key="journals/acta/Saxena96">
 <author>Sanjeev Saxena</author>
 <title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
 <pages>607-619</pages>
 <year>1996</year>
 <volume>33</volume>
 <journal>Acta Inf.</journal>
 <number>7</number>
 <url>db/journals/acta/acta33.html#Saxena96</url>
 <ee>https://doi.org/10.1007/BF03036466</ee>
 </article>
 <article mdate="2017-05-28" key="journals/acta/Simon83">
 <author>Hans Ulrich Simon</author>
 <title>Pattern Matching in Trees and Nets.</title>
 <pages>227-248</pages>
 <year>1983</year>
 <volume>20</volume>
 <journal>Acta Inf.</journal>
 <url>db/journals/acta/acta20.html#Simon83</url>
 <ee>https://doi.org/10.1007/BF01257084</ee>
 </article>
 <article mdate="2017-05-28" key="journals/acta/GoodmanS83">
 <author>Nathan Goodman</author>
 <author>Oded Shmueli</author>
 <title>NP-complete Problems Simplified on Tree Schemas.</title>
 <pages>171-178</pages>
 <year>1983</year>
 <volume>20</volume>
 <journal>Acta Inf.</journal>
 <url>db/journals/acta/acta20.html#GoodmanS83</url>
 <ee>https://doi.org/10.1007/BF00289414</ee>
 </article>
</dblp>

The XML has multiple publication types denoted by the title of the element (i.e. proceedings, book, phdthesis) and for each of which I have defined a separate struct in my program:

package main
import (
 "compress/gzip"
 "encoding/csv"
 "encoding/xml"
 "fmt"
 "io"
 "log"
 "os"
 "sort"
 "strconv"
 "time"
 "golang.org/x/text/encoding/charmap"
)
// Dblp contains the array of articles in the dblp xml file
type Dblp struct {
 XMLName xml.Name `xml:"dblp"`
 Dblp []Article
}
// Metadata contains the fields shared by all structs
type Metadata struct {
 Key string `xml:"key,attr"` // not currently in use
 Year string `xml:"year"`
 Author string `xml:"author"` // not currently in use
 Title string `xml:"title"` // not currently in use
}
// Article struct and the following structs contain the elements we want to parse and they "inherit" the metadata struct defined above
type Article struct {
 XMLName xml.Name `xml:"article"`
 Metadata
}
type InProceedings struct {
 XMLName xml.Name `xml:"inproceedings"`
 Metadata
}
type Proceedings struct {
 XMLName xml.Name `xml:"proceedings"`
 Metadata
}
type Book struct {
 XMLName xml.Name `xml:"book"`
 Metadata
}
type InCollection struct {
 XMLName xml.Name `xml:"incollection"`
 Metadata
}
type PhdThesis struct {
 XMLName xml.Name `xml:"phdthesis"`
 Metadata
}
type MastersThesis struct {
 XMLName xml.Name `xml:"mastersthesis"`
 Metadata
}
type WWW struct {
 XMLName xml.Name `xml:"www"`
 Metadata
}
// Record is used to store each Article's type and year which will be passed as a value to map m
type Record struct {
 UID int
 ID int
 Type string
 Year string
}
// SumRecord is used to store the aggregated articles by year in srMap map
//(count is stored in the map's int which is used as key)
type SumRecord struct {
 Type string
 Year string
}

The program stores each publication in a map structure and finally exports two csv files:

results.csv which contains an id, publication type and year for each publication
sumresults.csv which contains the sum of each publication type per year

It is the first "complete" program I've written in Go - I'm currently trying to get a grasp on the language and I've needed to ask two questions on Stack Overflow while writing it here and here.

The rest of the code:

func main() {
 // Start counting time
 start := time.Now()
 // Initialize counter variables for each publication type
 var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
 InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter int
 var i = 1
 // Initialize hash map
 m := make(map[int]Record)
 //Open gzipped dblp xml
 xmlFile, err := os.Open("dblp.xml.gz")
 gz, err := gzip.NewReader(xmlFile)
 if err != nil {
 log.Fatal(err)
 }
 defer gz.Close()
 //Directly open xml file for testing purposes if needed - be sure to comment out gzip file opening above
 //xmlFile, err := os.Open("dblp.xml")
 //xmlFile, err := os.Open("TestDblp.xml")
 if err != nil {
 fmt.Println(err)
 } else {
 log.Println("Successfully Opened Dblp XML file")
 }
 // defer the closing of XML file so that we can parse it later on
 defer xmlFile.Close()
 // Initialize main object from Dblp struct
 var articles Dblp
 // Create decoder element
 decoder := xml.NewDecoder(gz)
 // Suppress xml errors
 decoder.Strict = false
 decoder.CharsetReader = makeCharsetReader
 err = decoder.Decode(&articles.Dblp)
 if err != nil {
 fmt.Println(err)
 }
 for {
 // Read tokens from the XML document in a stream.
 t, err := decoder.Token()
 // If we reach the end of the file, we are done
 if err == io.EOF {
 log.Println("XML successfully parsed:", err)
 break
 } else if err != nil {
 log.Fatalf("Error decoding token: %t", err)
 } else if t == nil {
 break
 }
 // Here, we inspect the token
 switch se := t.(type) {
 // We have the start of an element and the token we created above in t:
 case xml.StartElement:
 switch se.Name.Local {
 case "dblp":
 case "article":
 var p Article
 decoder.DecodeElement(&p, &se)
 increment(&articleCounter)
 m[i] = Record{i, articleCounter, "article", p.Year}
 increment(&i)
 case "inproceedings":
 var p InProceedings
 decoder.DecodeElement(&p, &se)
 increment(&InProceedingsCounter)
 m[i] = Record{i, InProceedingsCounter, "inproceedings", p.Year}
 increment(&i)
 case "proceedings":
 var p Proceedings
 decoder.DecodeElement(&p, &se)
 increment(&ProceedingsCounter)
 m[i] = Record{i, ProceedingsCounter, "proceedings", p.Year}
 increment(&i)
 case "book":
 var p Book
 decoder.DecodeElement(&p, &se)
 increment(&BookCounter)
 m[i] = Record{i, BookCounter, "proceedings", p.Year}
 increment(&i)
 case "incollection":
 var p InCollection
 decoder.DecodeElement(&p, &se)
 increment(&InCollectionCounter)
 m[i] = Record{i, InCollectionCounter, "incollection", p.Year}
 increment(&i)
 case "phdthesis":
 var p PhdThesis
 decoder.DecodeElement(&p, &se)
 increment(&PhdThesisCounter)
 m[i] = Record{i, PhdThesisCounter, "phdthesis", p.Year}
 increment(&i)
 case "mastersthesis":
 var p MastersThesis
 decoder.DecodeElement(&p, &se)
 increment(&mastersThesisCounter)
 m[i] = Record{i, mastersThesisCounter, "mastersthesis", p.Year}
 increment(&i)
 case "www":
 var p WWW
 decoder.DecodeElement(&p, &se)
 increment(&wwwCounter)
 m[i] = Record{i, wwwCounter, "www", p.Year}
 increment(&i)
 }
 }
 }
 log.Println("Element parsing completed in:", time.Since(start))
 // All parsed elements have been added to m := make(map[int]Record)
 // We can start processing the map.
 // First we create a map and count the number of occurences of each publication type for a given year.
 srMap := make(map[SumRecord]int)
 log.Println("Creating sums by article type per year")
 for key := range m {
 sr := SumRecord{
 Type: m[key].Type,
 Year: m[key].Year,
 }
 srMap[sr]++
 }
 //// Create sum csv
 log.Println("Creating sum results csv file")
 sumfile, err := os.Create("sumresult.csv")
 checkError("Cannot create file", err)
 defer sumfile.Close()
 sumwriter := csv.NewWriter(sumfile)
 defer sumwriter.Flush()
 // define column headers
 sumheaders := []string{
 "type",
 "year",
 "sum",
 }
 sumwriter.Write(sumheaders)
 var SumString string
 // Create sorted map by VALUE (integer)
 SortedSrMap := map[int]SumRecord{}
 SortedSrMapKeys := []int{}
 for key, val := range SortedSrMap {
 // SortedSrMap[val] = key
 // SortedSrMapKeys = append(SortedSrMapKeys, val)
 SumString = strconv.Itoa(key)
 fmt.Println("sumstring:", SumString, "value: ", val)
 }
 sort.Ints(SortedSrMapKeys)
 // END Create sorted map by VALUE (integer)
 // Export sum csv
 for key, val := range srMap {
 r := make([]string, 0, 1+len(sumheaders))
 SumString = strconv.Itoa(val)
 r = append(
 r,
 key.Type,
 key.Year,
 SumString,
 )
 sumwriter.Write(r)
 }
 sumwriter.Flush()
 // CREATE RESULTS CSV
 log.Println("Creating results csv file")
 file, err := os.Create("result.csv")
 checkError("Cannot create file", err)
 defer file.Close()
 writer := csv.NewWriter(file)
 defer writer.Flush()
 // define column headers
 headers := []string{
 "uid",
 "id",
 "type",
 "year",
 }
 // write column headers
 writer.Write(headers)
 var idString string
 var uidString string
 // Create sorted map
 var keys []int
 for k := range m {
 keys = append(keys, k)
 }
 sort.Ints(keys)
 for _, k := range keys {
 r := make([]string, 0, 1+len(headers)) // capacity of 4, 1 + the number of properties our struct has & the number of column headers we are passing
 // convert the Record.ID and UID ints to string in order to pass into append()
 idString = strconv.Itoa(m[k].ID)
 uidString = strconv.Itoa(m[k].UID)
 r = append(
 r,
 uidString,
 idString,
 m[k].Type,
 m[k].Year,
 )
 writer.Write(r)
 }
 writer.Flush()
 // END CREATE RESULTS CSV
 // Finally report results - update below line with more counters as desired
 log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:", ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:", PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
 //log.Println("map:", m)
 //log.Println("map length:", len(m))
 //log.Println("sum map length:", len(srMap))
 //fmt.Println("sum map contents:", srMap)
 log.Println("XML parsing and csv export executed in:", time.Since(start))
}
func increment(i *int) {
 *i = *i + 1
}
func checkError(message string, err error) {
 if err != nil {
 log.Fatal(message, err)
 }
}
func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
 if charset == "ISO-8859-1" {
 // Windows-1252 is a superset of ISO-8859-1, so it should be ok for this case
 return charmap.Windows1252.NewDecoder().Reader(input), nil
 }
 return nil, fmt.Errorf("Unknown charset: %s", charset)
}

Main problems and issues I've identified:

The parsing is quite slow (it takes about 3:45 minutes) given the size of the file (474 Mb gzip). Can I improve something to make it faster?
Can the code be made less verbose but not at the expense of making it less readable / understandable to a person just starting out with Go? For example, by generalizing the structs which are used to define the different publication types as well as the case / switch statements?

Question 2

The decoder.Decode call is unnecessary and in fact throws an error at the moment.

To your second point, yes, especially the case statements can all be compressed down to a single function most likely, since they all only have a few variables exchanged.

The indexing into a hash map map[int]Record is not ideal, in fact that's probably causing a slowdown too with the two million elements in that table, instead you can simply append the elements to a slice and then it's all sorted and fine for iteration later on, no sorting necessary at all.

And for increment(&i) ... just go ahead and increment the counters. If you make functions, okay, but like this it's not helping with readability (i += 1 is much clearer).

make([]string, 0, 1+len(headers) - well that's valid, but you can simply create the array with all elements instead, like []string{uidString, ..., m[k].Year etc. Might be even better if you can reuse that array for all loop iterations.

Well I can't see any other obvious things to change. There's the possibility that getting rid of DecodeElement and doing the whole decoding yourself might improve things, but I'm skeptical. If I, for example, remove the whole switch block, doing nothing but XML decoding essentially, this still takes three minutes for me, essentially just one minute less than with that block included! Meaning that with this library it's just not going to get much quicker overall.

Question 3

Thanks for taking the time to review the code! I've reworked it based on your feedback: - Removed decoder.Decode. - Created single function to process the elements I'm interested in which will do the increments / map / slice appends. - For the increment functions, indeed they would make the code a bit more readable, however I want to keep them for now for learning's sake. - Working on removing the maps and using a slice instead. I was wondering whether it would be possible to "concatenate" the different structs, as the only difference about them is the xml.Name element.

Question 4

From what I know about the encoding/xml package I don't think there's anything to make more succinct about the structs unfortunately. You could go to a generic nested struct and to the decoding though without the struct definitions.

Question 5

I've revisited the code to clean it up a bit and to follow some of the recommendations as I progress with my understanding of the language.

Main points:

Only two structs are now used:

type Metadata struct {
 Key string `xml:"key,attr"`
 Year string `xml:"year"`
 Author string `xml:"author"`
 Title string `xml:"title"`
}
type Record struct {
 UID int
 ID int
 Type string
 Year string
}

The publications are all processed with the following function:

func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
 m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
}

The entire code looks now like this:

package main
import (
 "compress/gzip"
 "encoding/csv"
 "encoding/xml"
 "fmt"
 "io"
 "log"
 "os"
 "sort"
 "strconv"
 "time"
 "golang.org/x/text/encoding/charmap"
)
// Metadata contains the fields shared by all structs
type Metadata struct {
 Key string `xml:"key,attr"` // currently not in use
 Year string `xml:"year"`
 Author string `xml:"author"` // currently not in use
 Title string `xml:"title"` // currently not in use
}
// Record is used to store each Article's type and year which will be passed as a value to map m
type Record struct {
 UID int
 ID int
 Type string
 Year string
}
type Count int
type Counter interface {
 Incr() int
 ReturnInt() int
}
var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
 InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter, i Count
func main() {
 start := time.Now()
 //Open gzipped dblp xml
 //xmlFile, err := os.Open("TestDblp.xml.gz")
 // Uncomment below for actual xml
 xmlFile, err := os.Open("dblp.xml.gz")
 gz, err := gzip.NewReader(xmlFile)
 if err != nil {
 log.Fatal(err)
 } else {
 log.Println("Successfully Opened Dblp XML file")
 }
 defer gz.Close()
 // Create decoder element
 decoder := xml.NewDecoder(gz)
 // Suppress xml errors
 decoder.Strict = false
 decoder.CharsetReader = makeCharsetReader
 if err != nil {
 log.Fatal(err)
 }
 m := make(map[int]Record)
 var p Metadata
 for {
 // Read tokens from the XML document in a stream.
 t, err := decoder.Token()
 // If we reach the end of the file, we are done with parsing.
 if err == io.EOF {
 log.Println("XML successfully parsed:", err)
 break
 } else if err != nil {
 log.Fatalf("Error decoding token: %t", err)
 } else if t == nil {
 break
 }
 // Let's inspect the token
 switch se := t.(type) {
 // We have the start of an element and the token we created above in t:
 case xml.StartElement:
 switch se.Name.Local {
 case "article":
 decoder.DecodeElement(&p, &se)
 ProcessPublication(&i, &articleCounter, se.Name.Local, p.Year, m)
 case "inproceedings":
 decoder.DecodeElement(&p, &se)
 ProcessPublication(&i, &InProceedingsCounter, se.Name.Local, p.Year, m)
 case "proceedings":
 decoder.DecodeElement(&p, &se)
 ProcessPublication(&i, &ProceedingsCounter, se.Name.Local, p.Year, m)
 case "book":
 decoder.DecodeElement(&p, &se)
 ProcessPublication(&i, &BookCounter, se.Name.Local, p.Year, m)
 case "incollection":
 decoder.DecodeElement(&p, &se)
 ProcessPublication(&i, &InCollectionCounter, se.Name.Local, p.Year, m)
 case "phdthesis":
 decoder.DecodeElement(&p, &se)
 ProcessPublication(&i, &PhdThesisCounter, se.Name.Local, p.Year, m)
 case "mastersthesis":
 decoder.DecodeElement(&p, &se)
 ProcessPublication(&i, &mastersThesisCounter, se.Name.Local, p.Year, m)
 case "www":
 decoder.DecodeElement(&p, &se)
 ProcessPublication(&i, &wwwCounter, se.Name.Local, p.Year, m)
 }
 }
 }
 log.Println("XML parsing done in:", time.Since(start))
 // All parsed elements have been added to m := make(map[int]Record)
 // We create srMap map object and count the number of occurences of each publication type for a given year.
 srMap := make(map[Record]int)
 log.Println("Creating sums by article type per year")
 for key := range m {
 sr := Record{
 Type: m[key].Type,
 Year: m[key].Year,
 }
 srMap[sr]++
 }
 // Create sumresult.csv
 log.Println("Creating sum results csv file")
 sumfile, err := os.Create("sumresult.csv")
 checkError("Cannot create file", err)
 defer sumfile.Close()
 sumwriter := csv.NewWriter(sumfile)
 defer sumwriter.Flush()
 sumheaders := []string{
 "publicationType",
 "year",
 "sum",
 }
 sumwriter.Write(sumheaders)
 // Export sumresult.csv
 for key, val := range srMap {
 r := make([]string, 0, 1+len(sumheaders))
 r = append(
 r,
 key.Type,
 key.Year,
 strconv.Itoa(val),
 )
 sumwriter.Write(r)
 }
 sumwriter.Flush()
 // Create result.csv
 log.Println("Creating result.csv")
 file, err := os.Create("result.csv")
 checkError("Cannot create file", err)
 defer file.Close()
 writer := csv.NewWriter(file)
 defer writer.Flush()
 headers := []string{
 "uid",
 "id",
 "type",
 "year",
 }
 writer.Write(headers)
 // Create sorted map
 var keys []int
 for k := range m {
 keys = append(keys, k)
 }
 sort.Ints(keys)
 for _, k := range keys {
 r := make([]string, 0, 1+len(headers))
 r = append(
 r,
 strconv.Itoa(m[k].UID),
 strconv.Itoa(m[k].ID),
 m[k].Type,
 m[k].Year,
 )
 writer.Write(r)
 }
 writer.Flush()
 // Finally report results
 log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:",
 ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:",
 PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
 log.Println("Distinct publication map length:", len(m))
 log.Println("Sum map length:", len(srMap))
 log.Println("XML parsing and csv export executed in:", time.Since(start))
}
func checkError(message string, err error) {
 if err != nil {
 log.Fatal(message, err)
 }
}
func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
 if charset == "ISO-8859-1" {
 // Windows-1252 is a superset of ISO-8859-1, so it should be ok for correctly decoding the dblp.xml
 return charmap.Windows1252.NewDecoder().Reader(input), nil
 }
 return nil, fmt.Errorf("Unknown charset: %s", charset)
}
func (c *Count) Incr() int {
 *c = *c + 1
 return int(*c)
}
func (c *Count) ReturnInt() int {
 return int(*c)
}
func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
 m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
}

I feel that the csv generation parts can be further streamlined as they are still a bit messy.

ferada ferada 11.4k25 silver badges65 bronze badges · Accepted Answer · 2019-03-11 23:37:23Z

The decoder.Decode call is unnecessary and in fact throws an error at the moment.

To your second point, yes, especially the case statements can all be compressed down to a single function most likely, since they all only have a few variables exchanged.

The indexing into a hash map map[int]Record is not ideal, in fact that's probably causing a slowdown too with the two million elements in that table, instead you can simply append the elements to a slice and then it's all sorted and fine for iteration later on, no sorting necessary at all.

And for increment(&i) ... just go ahead and increment the counters. If you make functions, okay, but like this it's not helping with readability (i += 1 is much clearer).

make([]string, 0, 1+len(headers) - well that's valid, but you can simply create the array with all elements instead, like []string{uidString, ..., m[k].Year etc. Might be even better if you can reuse that array for all loop iterations.

Well I can't see any other obvious things to change. There's the possibility that getting rid of DecodeElement and doing the whole decoding yourself might improve things, but I'm skeptical. If I, for example, remove the whole switch block, doing nothing but XML decoding essentially, this still takes three minutes for me, essentially just one minute less than with that block included! Meaning that with this library it's just not going to get much quicker overall.

Thanks for taking the time to review the code! I've reworked it based on your feedback: - Removed decoder.Decode. - Created single function to process the elements I'm interested in which will do the increments / map / slice appends. - For the increment functions, indeed they would make the code a bit more readable, however I want to keep them for now for learning's sake. - Working on removing the maps and using a slice instead. I was wondering whether it would be possible to "concatenate" the different structs, as the only difference about them is the xml.Name element.
From what I know about the encoding/xml package I don't think there's anything to make more succinct about the structs unfortunately. You could go to a generic nested struct and to the decoding though without the struct definitions.

Stack Exchange Network

Parse dblp XML and output sums of publications grouped by year and type

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parse dblp XML and output sums of publications grouped by year and type

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions