I am trying to make an efficient algorithm for removing script tags from an HTML string. Can someone point out any flaws in this? This seems to be the best I could think of.
func removeScripts(s string) string {
startingScriptTag := "<script"
endingScriptTag := "</script>"
var script string
for {
startingScriptTagIndex := strings.Index(s, startingScriptTag)
endingScriptTagIndex := strings.Index(s, endingScriptTag)
if startingScriptTagIndex > -1 && endingScriptTagIndex > -1 {
script = s[startingScriptTagIndex:endingScriptTagIndex + len(endingScriptTag)]
s = strings.Replace(s, script, "", 1)
continue
}
break
}
return s
}
2 Answers 2
As per usual, I'd say the best way to reliably get rid of any script tags in an HTML string, is to use a parser. HTML is a bit too complex to consume using your standard string functions and regular expressions. It's a hierarchical language, best processed as such. Thankfully, golang has a package for this, and it's fantastically easy to remove script tags:
import (
"bytes"
"fmt"
"log"
"string"
"golang.org/x/net/html" // go get -u golang.org etc...
)
func main() {
doc, err := html.Parse(strings.NewReader(htmlString))
if err != nil {
log.Fatal(err)
}
removeScript(doc)
buf := bytes.NewBuffer([]bytes{})
if err := html.Render(buf, doc); err != nil {
log.Fatal(err)
}
fmt.Println(buf.String())
}
func removeScript(n *html.Node) {
// if note is script tag
if n.Type == html.ElementNode && n.Data == "script" {
n.Parent.RemoveChild(n)
return // script tag is gone...
}
// traverse DOM
for c := n.FirstChild; c != nil; c = c.NextSibling {
removeScript(c)
}
}
The use of n.Data
is not a typo BTW. The field name is a bit unfortunate, but as the doc pages state:
A Node consists of a
NodeType
and someData
(tag name for element nodes, content for text)
This code was not tested. It's loosely based on the parse example in the official godoc pages.
Although not relevant in this case, it is worth looking into the tokenizer API, too. It is a lower-level api, that can help you process an HTML stream (eg parsing/validating a large file in a stream). You can use it to check how many script tags there are, for example:
tokenizer := html.NewTokenizer(strings.NewReader(htmlString))
tagCount := 0
for {
tt := z.Next()
switch tt {
case ErrorToken:
return z.Err()
//case TextToken: ignore, we don't need it here
case StartTagToken //, EndTagToken: ignore end-tags to make life easier
tn, _ := z.TagName()
if string(tn) == "script" {
tagCount++
}
}
}
Do with it as you like. Again, in this case, there's no reason why you would use the tokenizer I think, unless you want to manually write all tags that aren't script tags to a separate buffer and process them some more. Just thought it worth mentioning here...
Is the program correct? Not if the input is imperfect. For example, switch the script start and end tags,
html := `aaaa<scriptxxx</script>bbb`
fmt.Println(html)
fmt.Println(removeScripts(html))
html = `aaaa</script>xxx<script>bbb`
fmt.Println(html)
fmt.Println(removeScripts(html))
Output:
aaaa<scriptxxx</script>bbb
aaaabbb
aaaa</script>xxx<script>bbb
panic: runtime error: slice bounds out of range
The Go programming language was designed to operate at Google scale. Go programs are usually written to be reasonably efficient. For example, assume that your function is used by Google's search web crawler on billions of HTML pages (with multiple scripts per page) per day.
I see opportunities to make your function more efficient by using CPU time and memory only when necessary. To estimate how much your function might be improved, I ran some Go benchmarks.
old.txt (Lansana):
goos: linux
goarch: amd64
pkg: cr/script
BenchmarkLansana-4 2000 604549 ns/op 79872 B/op 16 allocs/op
new.txt (PeterSO):
goos: linux
goarch: amd64
pkg: cr/script
BenchmarkPeterSO-4 100000 11039 ns/op 10240 B/op 2 allocs/op
old.txt (Lansana) versus new.txt (PeterSO):
benchmark old ns/op new ns/op delta
BenchmarkScripts-4 604549 11039 -98.17%
benchmark old allocs new allocs delta
BenchmarkScripts-4 16 2 -87.50%
benchmark old bytes new bytes delta
BenchmarkScripts-4 79872 10240 -87.18%
You should try to make your function more efficient. Don't restart at the beginning, don't make unnecessary string and other allocations, don't copy unnecessarily, and so on. Benchmark critical functions and methods.
script_test.go
:
package main
import (
"strings"
"testing"
)
// benchmark
var (
scriptElement = `<script type="text/javascript">` + strings.Repeat(` JavaScript `, 8) + `</script>`
htmlElement = ` ` + scriptElement + strings.Repeat(`X`, 1024) + scriptElement + ` `
html = strings.Repeat(htmlElement, 4)
)
// removeScripts removes all HTML script elements.
func removeScriptsPeterSO(s string) string {
const (
startTag = "<script"
endTag = "</script>"
)
start := strings.Index(s, startTag)
if start < 0 {
return s
}
b := make([]byte, start, len(s))
copy(b, s)
for {
end := strings.Index(s[start+len(startTag):], endTag)
if end < 0 {
b = append(b, s[start:]...)
break
}
end += (start + len(startTag)) + len(endTag)
start = strings.Index(s[end:], startTag)
if start < 0 {
b = append(b, s[end:]...)
break
}
start += end
b = append(b, s[end:start]...)
}
return string(b)
}
func BenchmarkPeterSO(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
removeScriptsPeterSO(html)
}
}
func removeScriptsLansana(s string) string {
startingScriptTag := "<script"
endingScriptTag := "</script>"
var script string
for {
startingScriptTagIndex := strings.Index(s, startingScriptTag)
endingScriptTagIndex := strings.Index(s, endingScriptTag)
if startingScriptTagIndex > -1 && endingScriptTagIndex > -1 {
script = s[startingScriptTagIndex : endingScriptTagIndex+len(endingScriptTag)]
s = strings.Replace(s, script, "", 1)
continue
}
break
}
return s
}
func BenchmarkLansana(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
removeScriptsLansana(html)
}
}
For a program to be correct, maintainable, and reasonably efficient it must be readable. This is not readable:
func removeScripts(s string) string {
script = s[startingScriptTagIndex : endingScriptTagIndex+len(endingScriptTag)]
}
This is more readable:
func removeScripts(s string) string {
script = s[start : end+len(endTag)]
}
-
\$\begingroup\$ Wow thanks for the elaborate answer! I thought I was being clever by using some dynamic programming/memoization but apparently it's super inefficient! I come from a non-CS background (self taught), so efficiency in terms of time/space complexity is still something I'm working to improve on every day. I will come back and read through your code when I get home to see how it actually works... \$\endgroup\$Lansana– Lansana2017年04月22日 16:06:24 +00:00Commented Apr 22, 2017 at 16:06
-
\$\begingroup\$ What if some (admittedly sick) person decided to use client-side generated custom tags. For example, a code blog that styles code snippet blocks by wrapping them in a custom
<script-block>
tag? using<script
to find the opening tag wouldn't be reliable. Nor would you be able to handle an unfortunately placed space char. After years web dev, I've seen all sorts of horrific types of tag-soup. Trust me: never assume that a tag is going to be as predictable to detect. A more realistic scenario to begin with: and old<SCRIPT>
tags (upper-case) & lower-case close tags will be a pain. parse it \$\endgroup\$Elias Van Ootegem– Elias Van Ootegem2017年04月24日 16:46:21 +00:00Commented Apr 24, 2017 at 16:46
< script
and probably a number of other variations that will work in a browser (peterSO's answer has the same problem). Also wouldn't be surprised if there are ways to get browsers to run code even when there's no end</script>
. If security is your aim, I recommend goquery or, even better, bluemonday. \$\endgroup\$