Converting Markdown to HTML using Go

Question 1

I recently ported a blog of mine from Python to Go (to improve speed and performance) and while all is great so far, I'd like some help optimising the Markdown function to improve the general performance, maintenance and readability of the function.

I have this function because I write my blog articles in Markdown (.md) and then use ~~(削除) Python (削除ここまで)~~ Go to convert the raw Markdown to HTML for output as this saves me from having to write ridiculous amounts of HTML. (which can be tedious to say the least)

The Markdown function takes one argument (raw) which is a string that contains the raw Markdown (obtained using ioutil.ReadFile).

It then splits the Markdown by \n (removing the empty lines) and converts:

Bold and italic text (***,**,*)
Strikethrough text (~~blah blah blah~~)
Underscored text (__blah blah blah__)
Links ([https://example.com](Example Link))
Blockquotes (> sample quote by an important person)
Inline code (`abcccc`)
Headings (h1-h6)

While some of the supported features aren't exactly standard, this function works and outputs the expected result without any errors but being a new Go programmer and this being my first "real" Go project I'd like to know whether or not my code could be optimised for better performance, maintainability and readability.

Here a few questions I have regarding optimisation:

Would it make a difference to performance if I reduced the amount of imports?
Would it improve readability if I put the regexp.MustCompile functions into variables above the Markdown function?
Would it improve performance if I used individual regexes to convert Markdown headings instead of using for i := 6; i >= 1; i-- {...}?
If not, is there a way to convert i (an integer) to a string without using strconv.Itoa(i) (to help reduce the amount of imports)?

Here is my code:

package parse
import (
 "regexp"
 "strings"
 "strconv"
)
func Markdown(raw string) string {
 // ignore empty lines with "string.Split(...)"
 lines := strings.FieldsFunc(raw, func(c rune) bool {
 return c == '\n'
 })
 for i, line := range lines {
 // wrap bold and italic text in "<b>" and "<i>" elements
 line = regexp.MustCompile(`\*\*\*(.*?)\*\*\*`).ReplaceAllString(line, `<b><i>1ドル</i></b>`)
 line = regexp.MustCompile(`\*\*(.*?)\*\*`).ReplaceAllString(line, `<b>1ドル</b>`)
 line = regexp.MustCompile(`\*(.*?)\*`).ReplaceAllString(line, `<i>1ドル</i>`)
 // wrap strikethrough text in "<s>" tags
 line = regexp.MustCompile(`\~\~(.*?)\~\~`).ReplaceAllString(line, `<s>1ドル</s>`)
 // wrap underscored text in "<u>" tags
 line = regexp.MustCompile(`__(.*?)__`).ReplaceAllString(line, `<u>1ドル</u>`)
 // convert links to anchor tags
 line = regexp.MustCompile(`\[(.*?)\]\((.*?)\)[^\)]`).ReplaceAllString(line, `<a href="2ドル">1ドル</a>`)
 // escape and wrap blockquotes in "<blockquote>" tags
 line = regexp.MustCompile(`^\>(\s|)`).ReplaceAllString(line, `&gt;`)
 line = regexp.MustCompile(`\&gt\;(.*?)$`).ReplaceAllString(line, `<blockquote>1ドル</blockquote>`)
 // wrap the content of backticks inside of "<code>" tags
 line = regexp.MustCompile("`(.*?)`").ReplaceAllString(line, `<code>1ドル</code>`)
 // convert headings
 for i := 6; i >= 1; i-- {
 size, md_header := strconv.Itoa(i), strings.Repeat("#", i)
 line = regexp.MustCompile(`^` + md_header + `(\s|)(.*?)$`).ReplaceAllString(line, `<h` + size + `>2ドル</h` + size + `>`)
 }
 // update the line
 lines[i] = line
 }
 // return the joined lines
 return strings.Join(lines, "\n")
}

Question 2

Performance

Regex

regex.MustCompile() is very expensive! Do not use this method inside a loop !

instead, define your regex as global variables only once:

var (
 boldItalicReg = regexp.MustCompile(`\*\*\*(.*?)\*\*\*`)
 boldReg = regexp.MustCompile(`\*\*(.*?)\*\*`)
 ...
)

Headers

If a line is a header, it will start by a #. We can check for this before calling ReplaceAllString() 6 times ! All we need to do is to trim the line, and then check if it starts with #:

line = strings.TrimSpace(line)
if strings.HasPrefix(line, "#") {
 // convert headings
 ...
}

We could go further and unrolling the loop to avoid unecessary allocations:

count := strings.Count(line, "#")
switch count {
case 1:
 line = h1Reg.ReplaceAllString(line, `<h1>2ドル</h1>`)
case 2: 
 ...
}

Use a scanner

The idiomatic way to read a file line by line in go is to use a scanner. It takes an io.Reader as parameters, so you can directly pass your mardown file instead of converting it into a string first:

func NewMarkdown(input io.Reader) string {
 scanner := bufio.NewScanner(input)
 for scanner.Scan() {
 line := scanner.Text()
 ...
 }
}

Use `[]byte` instead of `string`

In go, a string is a read-only slice of bytes. Working with strings is usually more expensive than working with slice of bytes, so use []byte instead of strings when you can:

line := scanner.Bytes()
line = boldItalicReg.ReplaceAll(line, []byte(`<b><i>1ドル</i></b>`))

Write result to a `bytes.Buffer`

Instead of string.Join(), we can use a buffer to write each line in order to further reduce the number of allocations:

buf := bytes.NewBuffer(nil)
scanner := bufio.NewScanner(input)
for scanner.Scan() {
 line := scanner.Bytes()
 ...
 buf.Write(line)
 buf.WriteByte('\n')
}
return buf.String()

final code:

package parse
import (
 "bufio"
 "bytes"
 "io"
 "regexp"
)
var (
 boldItalicReg = regexp.MustCompile(`\*\*\*(.*?)\*\*\*`)
 boldReg = regexp.MustCompile(`\*\*(.*?)\*\*`)
 italicReg = regexp.MustCompile(`\*(.*?)\*`)
 strikeReg = regexp.MustCompile(`\~\~(.*?)\~\~`)
 underscoreReg = regexp.MustCompile(`__(.*?)__`)
 anchorReg = regexp.MustCompile(`\[(.*?)\]\((.*?)\)[^\)]`)
 escapeReg = regexp.MustCompile(`^\>(\s|)`)
 blockquoteReg = regexp.MustCompile(`\&gt\;(.*?)$`)
 backtipReg = regexp.MustCompile("`(.*?)`")
 h1Reg = regexp.MustCompile(`^#(\s|)(.*?)$`)
 h2Reg = regexp.MustCompile(`^##(\s|)(.*?)$`)
 h3Reg = regexp.MustCompile(`^###(\s|)(.*?)$`)
 h4Reg = regexp.MustCompile(`^####(\s|)(.*?)$`)
 h5Reg = regexp.MustCompile(`^#####(\s|)(.*?)$`)
 h6Reg = regexp.MustCompile(`^######(\s|)(.*?)$`)
)
func NewMarkdown(input io.Reader) string {
 buf := bytes.NewBuffer(nil)
 scanner := bufio.NewScanner(input)
 for scanner.Scan() {
 line := bytes.TrimSpace(scanner.Bytes())
 if len(line) == 0 {
 buf.WriteByte('\n')
 continue
 }
 // wrap bold and italic text in "<b>" and "<i>" elements
 line = boldItalicReg.ReplaceAll(line, []byte(`<b><i>1ドル</i></b>`))
 line = boldReg.ReplaceAll(line, []byte(`<b>1ドル</b>`))
 line = italicReg.ReplaceAll(line, []byte(`<i>1ドル</i>`))
 // wrap strikethrough text in "<s>" tags
 line = strikeReg.ReplaceAll(line, []byte(`<s>1ドル</s>`))
 // wrap underscored text in "<u>" tags
 line = underscoreReg.ReplaceAll(line, []byte(`<u>1ドル</u>`))
 // convert links to anchor tags
 line = anchorReg.ReplaceAll(line, []byte(`<a href="2ドル">1ドル</a>`))
 // escape and wrap blockquotes in "<blockquote>" tags
 line = escapeReg.ReplaceAll(line, []byte(`&gt;`))
 line = blockquoteReg.ReplaceAll(line, []byte(`<blockquote>1ドル</blockquote>`))
 // wrap the content of backticks inside of "<code>" tags
 line = backtipReg.ReplaceAll(line, []byte(`<code>1ドル</code>`))
 // convert headings
 if line[0] == '#' {
 count := bytes.Count(line, []byte(`#`))
 switch count {
 case 1:
 line = h1Reg.ReplaceAll(line, []byte(`<h1>2ドル</h1>`))
 case 2:
 line = h2Reg.ReplaceAll(line, []byte(`<h2>2ドル</h2>`))
 case 3:
 line = h3Reg.ReplaceAll(line, []byte(`<h3>2ドル</h3>`))
 case 4:
 line = h4Reg.ReplaceAll(line, []byte(`<h4>2ドル</h4>`))
 case 5:
 line = h5Reg.ReplaceAll(line, []byte(`<h5>2ドル</h5>`))
 case 6:
 line = h6Reg.ReplaceAll(line, []byte(`<h6>2ドル</h6>`))
 }
 }
 buf.Write(line)
 buf.WriteByte('\n')
 }
 return buf.String()
}

Benchmarks

I used the folowing code for benchmarks, on a 20kB md file:

func BenchmarkMarkdown(b *testing.B) {
 md, err := ioutil.ReadFile("README.md")
 if err != nil {
 b.Fail()
 }
 raw := string(md)
 b.ResetTimer()
 for n := 0; n < b.N; n++ {
 _ = Markdown(raw)
 }
}
func BenchmarkMarkdownNew(b *testing.B) {
 for n := 0; n < b.N; n++ {
 file, err := os.Open("README.md")
 if err != nil {
 b.Fail()
 }
 _ = NewMarkdown(file)
 file.Close()
 }
}

Results:

> go test -bench=. -benchmem
goos: linux
goarch: amd64
BenchmarkMarkdown-4 10 104990431 ns/op 364617427 B/op 493813 allocs/op
BenchmarkMarkdownNew-4 1000 1464745 ns/op 379376 B/op 11085 allocs/op

benchstat diff:

name old time/op new time/op delta
Markdown-4 105ms ± 0% 1ms ± 0% ~ (p=1.000 n=1+1)
name old alloc/op new alloc/op delta
Markdown-4 365MB ± 0% 0MB ± 0% ~ (p=1.000 n=1+1)
name old allocs/op new allocs/op delta
Markdown-4 494k ± 0% 11k ± 0% ~ (p=1.000 n=1+1)

felix felix 6085 silver badges16 bronze badges · Accepted Answer · 2018-12-28 10:48:27Z

Performance

Regex

regex.MustCompile() is very expensive! Do not use this method inside a loop !

instead, define your regex as global variables only once:

var (
 boldItalicReg = regexp.MustCompile(`\*\*\*(.*?)\*\*\*`)
 boldReg = regexp.MustCompile(`\*\*(.*?)\*\*`)
 ...
)

Headers

If a line is a header, it will start by a #. We can check for this before calling ReplaceAllString() 6 times ! All we need to do is to trim the line, and then check if it starts with #:

line = strings.TrimSpace(line)
if strings.HasPrefix(line, "#") {
 // convert headings
 ...
}

We could go further and unrolling the loop to avoid unecessary allocations:

count := strings.Count(line, "#")
switch count {
case 1:
 line = h1Reg.ReplaceAllString(line, `<h1>2ドル</h1>`)
case 2: 
 ...
}

Use a scanner

The idiomatic way to read a file line by line in go is to use a scanner. It takes an io.Reader as parameters, so you can directly pass your mardown file instead of converting it into a string first:

func NewMarkdown(input io.Reader) string {
 scanner := bufio.NewScanner(input)
 for scanner.Scan() {
 line := scanner.Text()
 ...
 }
}

Use `[]byte` instead of `string`

In go, a string is a read-only slice of bytes. Working with strings is usually more expensive than working with slice of bytes, so use []byte instead of strings when you can:

line := scanner.Bytes()
line = boldItalicReg.ReplaceAll(line, []byte(`<b><i>1ドル</i></b>`))

Write result to a `bytes.Buffer`

Instead of string.Join(), we can use a buffer to write each line in order to further reduce the number of allocations:

buf := bytes.NewBuffer(nil)
scanner := bufio.NewScanner(input)
for scanner.Scan() {
 line := scanner.Bytes()
 ...
 buf.Write(line)
 buf.WriteByte('\n')
}
return buf.String()

final code:

package parse
import (
 "bufio"
 "bytes"
 "io"
 "regexp"
)
var (
 boldItalicReg = regexp.MustCompile(`\*\*\*(.*?)\*\*\*`)
 boldReg = regexp.MustCompile(`\*\*(.*?)\*\*`)
 italicReg = regexp.MustCompile(`\*(.*?)\*`)
 strikeReg = regexp.MustCompile(`\~\~(.*?)\~\~`)
 underscoreReg = regexp.MustCompile(`__(.*?)__`)
 anchorReg = regexp.MustCompile(`\[(.*?)\]\((.*?)\)[^\)]`)
 escapeReg = regexp.MustCompile(`^\>(\s|)`)
 blockquoteReg = regexp.MustCompile(`\&gt\;(.*?)$`)
 backtipReg = regexp.MustCompile("`(.*?)`")
 h1Reg = regexp.MustCompile(`^#(\s|)(.*?)$`)
 h2Reg = regexp.MustCompile(`^##(\s|)(.*?)$`)
 h3Reg = regexp.MustCompile(`^###(\s|)(.*?)$`)
 h4Reg = regexp.MustCompile(`^####(\s|)(.*?)$`)
 h5Reg = regexp.MustCompile(`^#####(\s|)(.*?)$`)
 h6Reg = regexp.MustCompile(`^######(\s|)(.*?)$`)
)
func NewMarkdown(input io.Reader) string {
 buf := bytes.NewBuffer(nil)
 scanner := bufio.NewScanner(input)
 for scanner.Scan() {
 line := bytes.TrimSpace(scanner.Bytes())
 if len(line) == 0 {
 buf.WriteByte('\n')
 continue
 }
 // wrap bold and italic text in "<b>" and "<i>" elements
 line = boldItalicReg.ReplaceAll(line, []byte(`<b><i>1ドル</i></b>`))
 line = boldReg.ReplaceAll(line, []byte(`<b>1ドル</b>`))
 line = italicReg.ReplaceAll(line, []byte(`<i>1ドル</i>`))
 // wrap strikethrough text in "<s>" tags
 line = strikeReg.ReplaceAll(line, []byte(`<s>1ドル</s>`))
 // wrap underscored text in "<u>" tags
 line = underscoreReg.ReplaceAll(line, []byte(`<u>1ドル</u>`))
 // convert links to anchor tags
 line = anchorReg.ReplaceAll(line, []byte(`<a href="2ドル">1ドル</a>`))
 // escape and wrap blockquotes in "<blockquote>" tags
 line = escapeReg.ReplaceAll(line, []byte(`&gt;`))
 line = blockquoteReg.ReplaceAll(line, []byte(`<blockquote>1ドル</blockquote>`))
 // wrap the content of backticks inside of "<code>" tags
 line = backtipReg.ReplaceAll(line, []byte(`<code>1ドル</code>`))
 // convert headings
 if line[0] == '#' {
 count := bytes.Count(line, []byte(`#`))
 switch count {
 case 1:
 line = h1Reg.ReplaceAll(line, []byte(`<h1>2ドル</h1>`))
 case 2:
 line = h2Reg.ReplaceAll(line, []byte(`<h2>2ドル</h2>`))
 case 3:
 line = h3Reg.ReplaceAll(line, []byte(`<h3>2ドル</h3>`))
 case 4:
 line = h4Reg.ReplaceAll(line, []byte(`<h4>2ドル</h4>`))
 case 5:
 line = h5Reg.ReplaceAll(line, []byte(`<h5>2ドル</h5>`))
 case 6:
 line = h6Reg.ReplaceAll(line, []byte(`<h6>2ドル</h6>`))
 }
 }
 buf.Write(line)
 buf.WriteByte('\n')
 }
 return buf.String()
}

Benchmarks

I used the folowing code for benchmarks, on a 20kB md file:

func BenchmarkMarkdown(b *testing.B) {
 md, err := ioutil.ReadFile("README.md")
 if err != nil {
 b.Fail()
 }
 raw := string(md)
 b.ResetTimer()
 for n := 0; n < b.N; n++ {
 _ = Markdown(raw)
 }
}
func BenchmarkMarkdownNew(b *testing.B) {
 for n := 0; n < b.N; n++ {
 file, err := os.Open("README.md")
 if err != nil {
 b.Fail()
 }
 _ = NewMarkdown(file)
 file.Close()
 }
}

Results:

> go test -bench=. -benchmem
goos: linux
goarch: amd64
BenchmarkMarkdown-4 10 104990431 ns/op 364617427 B/op 493813 allocs/op
BenchmarkMarkdownNew-4 1000 1464745 ns/op 379376 B/op 11085 allocs/op

benchstat diff:

name old time/op new time/op delta
Markdown-4 105ms ± 0% 1ms ± 0% ~ (p=1.000 n=1+1)
name old alloc/op new alloc/op delta
Markdown-4 365MB ± 0% 0MB ± 0% ~ (p=1.000 n=1+1)
name old allocs/op new allocs/op delta
Markdown-4 494k ± 0% 11k ± 0% ~ (p=1.000 n=1+1)

Stack Exchange Network

Converting Markdown to HTML using Go

1 Answer 1

Performance

Regex

Headers

Use a scanner

Use `[]byte` instead of `string`

Write result to a `bytes.Buffer`

Benchmarks

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Converting Markdown to HTML using Go

1 Answer 1

Performance

Regex

Headers

Use a scanner

Use []byte instead of string

Write result to a bytes.Buffer

Benchmarks

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Use `[]byte` instead of `string`

Write result to a `bytes.Buffer`