I have a slice of strings, and within each string contains multiple key=value
formatted messages. I want to pull all the keys out of the strings so I can collect them to use as the header for a CSV file. I do not know all potential key
fields, so I have to use regular expression matching to find them.
Here is my code.
package main
import (
"fmt"
"regexp"
)
func GetKeys(logs []string) []string {
// topMatches is the final array to be returned.
// midMatches contains no duplicates, but the data is `key=`.
// subMatches contains all initial matches.
// initialRegex matches for anthing that matches `key=`. this is because the matching patterns.
// cleanRegex massages `key=` to `key`
topMatches := []string{}
midMatches := []string{}
subMatches := []string{}
initialRegex := regexp.MustCompile(`([a-zA-Z]{1,}\=)`)
cleanRegex := regexp.MustCompile(`([a-zA-Z]{1,})`)
// the nested loop for matches is because FindAllString
// returns []string
for _, i := range logs {
matches := initialRegex.FindAllString(i, -1)
for _, m := range matches {
subMatches = append(subMatches, m)
}
}
// remove duplicates.
seen := map[string]string{}
for _, x := range subMatches {
if _, ok := seen[x]; !ok {
midMatches = append(midMatches, x)
seen[x] = x
}
}
// this is where I remove the `=` character.
for _, y := range midMatches {
clean := cleanRegex.FindAllString(y, 1)
topMatches = append(topMatches, clean[0])
}
return topMatches
}
func main() {
y := []string{"key=value", "msg=payload", "test=yay", "msg=payload"}
y = GetKeys(y)
fmt.Println(y)
}
I think my code is inefficient because I cannot determine how to properly optimise the initialRegex
regular expression to match just the key
in the key=value
format without matching the value as well.
Can my first regular expression, initialRegex
, be optimised so I do not have to do a second matching loop to remove the =
character?
Playground: http://play.golang.org/p/ONMf_cympM
2 Answers 2
You're not making good use of regular expressions. A single regex can do the job:
pattern := regexp.MustCompile(`([a-zA-Z]+)=`)
The parentheses (...)
are the capture the interesting part for you.
You can use result = pattern.FindAllStringSubmatch(s)
to match a string against the regex pattern. The return value is a [][]string
, where in each []string
slice, the 1st element is the entire matched string, and the 2nd, 3rd, ... elements have the content of the capture groups. In this example we have one capture group (...)
, so the value of the key will be in item[1]
of each []string
slice.
Instead of a map[string]string
map for seen
, a map[string]boolean
would be more efficient.
Putting it together:
func GetKeys(logs []string) []string {
var keys []string
pattern := regexp.MustCompile(`([a-zA-Z]+)=`)
seen := make(map[string]bool)
for _, log := range(logs) {
result := pattern.FindAllStringSubmatch(log, -1)
for _, item := range result {
key := item[1]
if _, ok := seen[key]; !ok {
keys = append(keys, key)
seen[key] = true
}
}
}
return keys
}
If the input strings are not guaranteed to be in the right format matching the pattern, then you might want to add a guard statement inside the main for loop, for example:
if len(result) != 2 {
continue
}
-
\$\begingroup\$ I was reading about submatches, but I wasn't entirely sure if that's what I wanted or not because I didn't understand that was for capture groups. That makes a lot of sense. What is more efficient about a boolean over the string? \$\endgroup\$Sienna– Sienna2016年03月04日 20:32:55 +00:00Commented Mar 4, 2016 at 20:32
-
\$\begingroup\$ @mynameismevin the storage of a
bool
is smaller than astring
\$\endgroup\$janos– janos2016年03月06日 20:53:21 +00:00Commented Mar 6, 2016 at 20:53 -
\$\begingroup\$ The original code used
FindAllString
to find multiple matches per message, but yours usesFindStringSubmatch
which looks to return a single match. \$\endgroup\$David Harkness– David Harkness2016年03月07日 21:36:10 +00:00Commented Mar 7, 2016 at 21:36 -
\$\begingroup\$ A
map[string]struct{}
is even more efficient and uses less memory thanmap[string]bool
. \$\endgroup\$OneOfOne– OneOfOne2016年03月28日 08:38:39 +00:00Commented Mar 28, 2016 at 8:38 -
\$\begingroup\$ @OneOfOne really? how? and how would I change the line
seen[key] = true
to make that work? \$\endgroup\$janos– janos2016年03月28日 10:13:38 +00:00Commented Mar 28, 2016 at 10:13
I know this is an old question but it popped up in my feed so I figured I'd contribute.
Out of curiosity, why use a regular expression at all? You could achieve the same thing use standard strings package and keep things simple. Here's a Playground that outputs the same result as your Playground.
package main
import (
"fmt"
"strings"
)
func GetKeys(logs []string) []string {
exists := make(map[string]bool)
keys := make([]string, 0)
for _, log := range logs {
parts := strings.Split(log, "=")
if len(parts) >= 1 {
k := parts[0]
if !exists[k] {
keys = append(keys, k)
exists[k] = true
}
}
}
return keys
}
func main() {
y := []string{"key=value", "msg=payload", "test=yay", "msg=payload"}
fmt.Println(GetKeys(y))
}
{1,}
is equivalent to+
, and if the Goregexp
package supports it, you can use positive look-ahead to detect but not capture the=
:[a-zA-Z]+(?=\=)
. IIRC, the second=
doesn't need to be escaped since it has no special meaning outside of this context. Finally, I doubt you need the capturing group around the whole expression. \$\endgroup\$