Golang function to clean a string of scripts

Question 1

I am trying to make an efficient algorithm for removing script tags from an HTML string. Can someone point out any flaws in this? This seems to be the best I could think of.

func removeScripts(s string) string {
 startingScriptTag := "<script"
 endingScriptTag := "</script>"
 var script string
 for {
 startingScriptTagIndex := strings.Index(s, startingScriptTag)
 endingScriptTagIndex := strings.Index(s, endingScriptTag)
 if startingScriptTagIndex > -1 && endingScriptTagIndex > -1 {
 script = s[startingScriptTagIndex:endingScriptTagIndex + len(endingScriptTag)]
 s = strings.Replace(s, script, "", 1)
 continue
 }
 break
 }
 return s
}

Question 2

This code fails with < script and probably a number of other variations that will work in a browser (peterSO's answer has the same problem). Also wouldn't be surprised if there are ways to get browsers to run code even when there's no end </script>. If security is your aim, I recommend goquery or, even better, bluemonday.

Question 3

Good point! Security actually isn't an issue here, the product I am making has no security vulnerabilities because the HTML/URL that will be provided is only loaded for the client requesting it, and thus any scripts would be self-attacking. I made this function anyways, just to see if I could do it.

Question 4

As per usual, I'd say the best way to reliably get rid of any script tags in an HTML string, is to use a parser. HTML is a bit too complex to consume using your standard string functions and regular expressions. It's a hierarchical language, best processed as such. Thankfully, golang has a package for this, and it's fantastically easy to remove script tags:

import (
 "bytes"
 "fmt"
 "log"
 "string"
 "golang.org/x/net/html" // go get -u golang.org etc...
)
func main() {
 doc, err := html.Parse(strings.NewReader(htmlString))
 if err != nil {
 log.Fatal(err)
 }
 removeScript(doc)
 buf := bytes.NewBuffer([]bytes{})
 if err := html.Render(buf, doc); err != nil {
 log.Fatal(err)
 }
 fmt.Println(buf.String())
}
func removeScript(n *html.Node) {
 // if note is script tag
 if n.Type == html.ElementNode && n.Data == "script" {
 n.Parent.RemoveChild(n)
 return // script tag is gone...
 }
 // traverse DOM
 for c := n.FirstChild; c != nil; c = c.NextSibling {
 removeScript(c)
 }
}

The use of n.Data is not a typo BTW. The field name is a bit unfortunate, but as the doc pages state:

A Node consists of a NodeType and some Data (tag name for element nodes, content for text)

This code was not tested. It's loosely based on the parse example in the official godoc pages.

Although not relevant in this case, it is worth looking into the tokenizer API, too. It is a lower-level api, that can help you process an HTML stream (eg parsing/validating a large file in a stream). You can use it to check how many script tags there are, for example:

tokenizer := html.NewTokenizer(strings.NewReader(htmlString))
tagCount := 0
for {
 tt := z.Next()
 switch tt {
 case ErrorToken:
 return z.Err()
 //case TextToken: ignore, we don't need it here
 case StartTagToken //, EndTagToken: ignore end-tags to make life easier
 tn, _ := z.TagName()
 if string(tn) == "script" {
 tagCount++
 }
 }
}

Do with it as you like. Again, in this case, there's no reason why you would use the tokenizer I think, unless you want to manually write all tags that aren't script tags to a separate buffer and process them some more. Just thought it worth mentioning here...

Question 5

Is the program correct? Not if the input is imperfect. For example, switch the script start and end tags,

html := `aaaa<scriptxxx</script>bbb`
fmt.Println(html)
fmt.Println(removeScripts(html))
html = `aaaa</script>xxx<script>bbb`
fmt.Println(html)
fmt.Println(removeScripts(html))

Output:

aaaa<scriptxxx</script>bbb
aaaabbb
aaaa</script>xxx<script>bbb
panic: runtime error: slice bounds out of range

The Go programming language was designed to operate at Google scale. Go programs are usually written to be reasonably efficient. For example, assume that your function is used by Google's search web crawler on billions of HTML pages (with multiple scripts per page) per day.

I see opportunities to make your function more efficient by using CPU time and memory only when necessary. To estimate how much your function might be improved, I ran some Go benchmarks.

old.txt (Lansana):
goos: linux
goarch: amd64
pkg: cr/script
BenchmarkLansana-4 2000 604549 ns/op 79872 B/op 16 allocs/op
new.txt (PeterSO):
goos: linux
goarch: amd64
pkg: cr/script
BenchmarkPeterSO-4 100000 11039 ns/op 10240 B/op 2 allocs/op
old.txt (Lansana) versus new.txt (PeterSO):
benchmark old ns/op new ns/op delta
BenchmarkScripts-4 604549 11039 -98.17%
benchmark old allocs new allocs delta
BenchmarkScripts-4 16 2 -87.50%
benchmark old bytes new bytes delta
BenchmarkScripts-4 79872 10240 -87.18%

You should try to make your function more efficient. Don't restart at the beginning, don't make unnecessary string and other allocations, don't copy unnecessarily, and so on. Benchmark critical functions and methods.

script_test.go:

package main
import (
 "strings"
 "testing"
)
// benchmark
var (
 scriptElement = `<script type="text/javascript">` + strings.Repeat(` JavaScript `, 8) + `</script>`
 htmlElement = ` ` + scriptElement + strings.Repeat(`X`, 1024) + scriptElement + ` `
 html = strings.Repeat(htmlElement, 4)
)
// removeScripts removes all HTML script elements.
func removeScriptsPeterSO(s string) string {
 const (
 startTag = "<script"
 endTag = "</script>"
 )
 start := strings.Index(s, startTag)
 if start < 0 {
 return s
 }
 b := make([]byte, start, len(s))
 copy(b, s)
 for {
 end := strings.Index(s[start+len(startTag):], endTag)
 if end < 0 {
 b = append(b, s[start:]...)
 break
 }
 end += (start + len(startTag)) + len(endTag)
 start = strings.Index(s[end:], startTag)
 if start < 0 {
 b = append(b, s[end:]...)
 break
 }
 start += end
 b = append(b, s[end:start]...)
 }
 return string(b)
}
func BenchmarkPeterSO(b *testing.B) {
 b.ReportAllocs()
 for i := 0; i < b.N; i++ {
 removeScriptsPeterSO(html)
 }
}
func removeScriptsLansana(s string) string {
 startingScriptTag := "<script"
 endingScriptTag := "</script>"
 var script string
 for {
 startingScriptTagIndex := strings.Index(s, startingScriptTag)
 endingScriptTagIndex := strings.Index(s, endingScriptTag)
 if startingScriptTagIndex > -1 && endingScriptTagIndex > -1 {
 script = s[startingScriptTagIndex : endingScriptTagIndex+len(endingScriptTag)]
 s = strings.Replace(s, script, "", 1)
 continue
 }
 break
 }
 return s
}
func BenchmarkLansana(b *testing.B) {
 b.ReportAllocs()
 for i := 0; i < b.N; i++ {
 removeScriptsLansana(html)
 }
}

For a program to be correct, maintainable, and reasonably efficient it must be readable. This is not readable:

func removeScripts(s string) string {
 script = s[startingScriptTagIndex : endingScriptTagIndex+len(endingScriptTag)]
}

This is more readable:

func removeScripts(s string) string {
 script = s[start : end+len(endTag)]
}

Question 6

Wow thanks for the elaborate answer! I thought I was being clever by using some dynamic programming/memoization but apparently it's super inefficient! I come from a non-CS background (self taught), so efficiency in terms of time/space complexity is still something I'm working to improve on every day. I will come back and read through your code when I get home to see how it actually works...

Question 7

What if some (admittedly sick) person decided to use client-side generated custom tags. For example, a code blog that styles code snippet blocks by wrapping them in a custom <script-block> tag? using <script to find the opening tag wouldn't be reliable. Nor would you be able to handle an unfortunately placed space char. After years web dev, I've seen all sorts of horrific types of tag-soup. Trust me: never assume that a tag is going to be as predictable to detect. A more realistic scenario to begin with: and old <SCRIPT> tags (upper-case) & lower-case close tags will be a pain. parse it

Elias Van Ootegem Elias Van Ootegem 11k31 silver badges49 bronze badges · Accepted Answer · 2017-04-24 16:40:21Z

As per usual, I'd say the best way to reliably get rid of any script tags in an HTML string, is to use a parser. HTML is a bit too complex to consume using your standard string functions and regular expressions. It's a hierarchical language, best processed as such. Thankfully, golang has a package for this, and it's fantastically easy to remove script tags:

import (
 "bytes"
 "fmt"
 "log"
 "string"
 "golang.org/x/net/html" // go get -u golang.org etc...
)
func main() {
 doc, err := html.Parse(strings.NewReader(htmlString))
 if err != nil {
 log.Fatal(err)
 }
 removeScript(doc)
 buf := bytes.NewBuffer([]bytes{})
 if err := html.Render(buf, doc); err != nil {
 log.Fatal(err)
 }
 fmt.Println(buf.String())
}
func removeScript(n *html.Node) {
 // if note is script tag
 if n.Type == html.ElementNode && n.Data == "script" {
 n.Parent.RemoveChild(n)
 return // script tag is gone...
 }
 // traverse DOM
 for c := n.FirstChild; c != nil; c = c.NextSibling {
 removeScript(c)
 }
}

The use of n.Data is not a typo BTW. The field name is a bit unfortunate, but as the doc pages state:

A Node consists of a NodeType and some Data (tag name for element nodes, content for text)

This code was not tested. It's loosely based on the parse example in the official godoc pages.

Although not relevant in this case, it is worth looking into the tokenizer API, too. It is a lower-level api, that can help you process an HTML stream (eg parsing/validating a large file in a stream). You can use it to check how many script tags there are, for example:

tokenizer := html.NewTokenizer(strings.NewReader(htmlString))
tagCount := 0
for {
 tt := z.Next()
 switch tt {
 case ErrorToken:
 return z.Err()
 //case TextToken: ignore, we don't need it here
 case StartTagToken //, EndTagToken: ignore end-tags to make life easier
 tn, _ := z.TagName()
 if string(tn) == "script" {
 tagCount++
 }
 }
}

Do with it as you like. Again, in this case, there's no reason why you would use the tokenizer I think, unless you want to manually write all tags that aren't script tags to a separate buffer and process them some more. Just thought it worth mentioning here...

Stack Exchange Network

Golang function to clean a string of scripts

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Golang function to clean a string of scripts

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions