In trying to answer this question on StackOverflow about using the gocarina/gocsv package to read a CSV with a header column name that has a comma, I got to thinking about how to preprocess the first record of a CSV as it's being read.
I thought of a reader/fixer which takes a reader of CSV data, reads the first (header) record from the data and removes specified strings and sends that along, then forwards all subsequent bytes along with the least amount of interference.
I was able to code this up:
package main
import (
"bufio"
"bytes"
"encoding/csv"
"fmt"
"io"
"log"
"strings"
)
// HeaderFixer removes certains strings from the field names in the header record of CSV data.
type HeaderFixer struct {
rd *bufio.Reader
removes []string
done bool
}
func NewReader(r io.Reader, removes []string) *HeaderFixer {
return &HeaderFixer{
bufio.NewReader(r),
removes,
false,
}
}
func (hf *HeaderFixer) Read(p []byte) (n int, err error) {
if hf.done {
n, err = hf.rd.Read(p)
return
}
cr := csv.NewReader(hf.rd)
header, err := cr.Read()
if err != nil {
return
}
for i, field := range header {
for _, remove := range hf.removes {
field = strings.Replace(field, remove, "", -1)
}
header[i] = field
}
var buf bytes.Buffer
cw := csv.NewWriter(&buf)
cw.Write(header)
cw.Flush()
copy(p, buf.Bytes())
n = int(cr.InputOffset())
hf.done = true
return
}
var csvBlob = `"Col,1","Col
2"
a,b
c,d
e,f
g,h
`
func main() {
sr := strings.NewReader(csvBlob)
hr := NewReader(sr, []string{",", "\n"})
cr := csv.NewReader(hr)
for {
record, err := cr.Read()
if err != nil {
if err == io.EOF {
break
}
log.Fatal(err)
}
fmt.Println(record)
}
}
It definitely removes the the newline and comma from the header:
[Col1 Col2]
[a b]
[c d]
[e f]
[g h]
Using InputOffset()
seems to be correct for reporting how far the header/fixer read into the original CSV data. I'm not so sure about the "least amount of interference" in the guard clause that just wants to forward bytes along as efficiently as possible.
I also started this exploration by looking at golang.org/x/text/transform, but I could not figure out how to make that work for me... the only example is deprecated.
1 Answer 1
Complexity is more apparent to readers than writers. If you write a piece of code and it seems simple to you, but other people think it is complex, then it is complex.
"Obvious" is in the mind of the reader: it’s easier to notice that someone else’s code is nonobvious than to see problems with your own code. If someone reading your code says it’s not obvious, then it’s not obvious, no matter how clear it may seem to you.
A Philosophy of Software Design, John Ousterhout
I read your code. I revised your code to be simpler and more obvious. It provides a simple, obvious replacement for csv.Reader
Read
method.
package main
import (
"encoding/csv"
"fmt"
"io"
"log"
"strings"
)
// HeaderReader is a csv.Reader that removes strings
// from the field names in the header record.
type HeaderReader struct {
cr *csv.Reader
removes []string
header bool
}
func NewHeaderReader(r io.Reader, removes []string) *HeaderReader {
return &HeaderReader{
cr: csv.NewReader(r),
removes: removes,
header: false,
}
}
func (hr *HeaderReader) Read() (record []string, err error) {
if hr.header {
return hr.cr.Read()
}
hr.header = true
header, err := hr.cr.Read()
if err != nil {
return nil, err
}
for i, field := range header {
for _, remove := range hr.removes {
field = strings.ReplaceAll(field, remove, "")
}
header[i] = field
}
return header, nil
}
var csvBlob = `"Col,1","Col
2"
a,b
c,d
e,f
g,h
`
func main() {
sr := strings.NewReader(csvBlob)
hr := NewHeaderReader(sr, []string{",", "\n"})
for {
record, err := hr.Read()
if err != nil {
if err == io.EOF {
break
}
log.Fatal(err)
}
fmt.Println(record)
}
}
https://go.dev/play/p/MW8KScSr6uF
[Col1 Col2]
[a b]
[c d]
[e f]
[g h]
If we want a complete replacement for csv.Reader
then we can add the remaining csv.Reader
methods as pass-through wrappers.
func (hr *HeaderReader) FieldPos(field int) (line, column int) {
return hr.cr.FieldPos(field)
}
func (hr *HeaderReader) InputOffset() int64 {
return hr.cr.InputOffset()
}
func (hr *HeaderReader) ReadAll() (records [][]string, err error) {
return hr.cr.ReadAll()
}