4
\$\begingroup\$

I recently decided to try to learn the Go language. In order to do this, I wrote a small program which parses a JSON configuration file, and generates N bson documents according to its properties.

For example, with this config file:

{
 "collection": "test",
 "count": 100,
 "content" : {
 "name": {
 "type": "string",
 "nullPercentage": 10,
 "length": 8
 },
 "count": {
 "type": "int",
 "nullPercentage": 30,
 "min": 1,
 "max": 200
 },
 "verified": {
 "type": "boolean",
 "nullPercentage": 0
 },
 "firstArray": {
 "type": "array",
 "nullPercentage": 10,
 "size": 3,
 "arrayContent": {
 "type": "string",
 "length": 3
 }
 },
 "firstObject": {
 "type": "object",
 "nullPercentage": 10,
 "objectContent": {
 "key1": { 
 "type": "string",
 "nullPercentage": 0,
 "length": 12
 }, 
 "key2": { 
 "type": "int",
 "nullPercentage": 50,
 "min": 10,
 "max": 20
 }
 }
 }
 }
}

the program will generate 100000 bson objects which would look like this:

{
 "count": 55,
 "firstArray": [
 "tco",
 "nua",
 "uim"
 ],
 "name": "zninfepa",
 "verified": false
 },
 {
 "count": 67,
 "firstArray": [
 "djt",
 "cei",
 "lty"
 ],
 "firstObject": {
 "key1": "nbbogspsvqsw",
 "key2": 19
 },
 "verified": true
 },
 ...

These documents are then stored in a database (mongoDB), but I skip that part in this question to focus on the generation of bson documents. I wrote a small prototype in Java, and then tried to convert it to Go. Here is what I have so far:

package main
import (
 "encoding/json"
 "fmt"
 "gopkg.in/mgo.v2/bson"
 "io/ioutil"
 "math/rand"
 "os"
 "time"
)
const (
 letterBytes = "abcdefghijklmnopqrstuvwxyz"
 letterIdxBits = 6 // 6 bits to represent a letter index
 letterIdxMask = 1<<letterIdxBits - 1 // All 1-bits, as many as letterIdxBits
 letterIdxMax = 63 / letterIdxBits // # of letter indices fitting in 63 bits
)
// GeneratorJSON structure containing all possible options
type GeneratorJSON struct {
 // Type of object to genereate. string | int | boolean supported for the moment
 Type string `json:"type"`
 // For `string` type only. Specify the length of the string to generate
 Length int `json:"length"`
 // Percentage of documents that won't contains this field
 NullPercentage int `json:"nullPercentage"`
 // For `int` type only. Lower bound for the int to generate
 Min int `json:"min"`
 // For `int` type only. Higher bound for the int to generate
 Max int `json:"max"`
 // For `array` only. Size of the array
 Size int `json:"size"`
 // For `array` only. GeneratorJSON to fill the array. Need to
 // pass a pointer here to avoid 'invalid recursive type' error
 ArrayContent *GeneratorJSON `json:"arrayContent"`
 // For `object` only. List of GeneratorJSON to generate the content
 // of the object
 ObjectContent map[string]GeneratorJSON `json:"objectContent"`
}
// Collection structure storing global collection info
type Collection struct {
 // Collection name in the database
 Collection string `json:"collection"`
 // Number of documents to insert in the collection
 Count int `json:"count"`
 // Schema of the documents for this collection
 Content map[string]GeneratorJSON `json:"content"`
}
// Generatorer interface for all generator objects
type Generatorer interface {
 // Get a random value according to the generator type. string | int | boolean supported for the moment
 getValue(r rand.Rand) interface{}
 getCommonProperties() CommonProperties
}
// CommonProper interface for commonProperties object ( methods with same behavior for each generator)
type CommonProper interface {
 // Generate a pseudo-random boolean with `nullPercentage` chance of being false
 exist(r rand.Rand) bool
 // Get the key of the generator
 getKey() string
}
// CommonProperties store
type CommonProperties struct {
 key string
 nullPercentage int
}
// StringGenerator struct that implements Generatorer. Used to
// generate random string of `length` length
type StringGenerator struct {
 common CommonProperties
 length int
}
// IntGenerator struct that implements Generatorer. Used to
// generate random int between `min` and `max`
type IntGenerator struct {
 common CommonProperties
 min int
 max int
}
// BoolGenerator struct that implements Generatorer. Used to
// generate random bool
type BoolGenerator struct {
 common CommonProperties
}
// ArrayGenerator struct that implements Generatorer. Used to
// generate random array
type ArrayGenerator struct {
 common CommonProperties
 size int
 generator Generatorer
}
// ObjectGenerator struct that implements Generatorer. Used to
// generate random object
type ObjectGenerator struct {
 common CommonProperties
 generatorList []Generatorer
}
func (c CommonProperties) exist(r rand.Rand) bool { return r.Intn(100) > c.nullPercentage }
func (c CommonProperties) getKey() string { return c.key }
// getValue returns a random String of `g.length` length
func (g StringGenerator) getValue(r rand.Rand) interface{} {
 by := make([]byte, g.length)
 for i, cache, remain := g.length-1, r.Int63(), letterIdxMax; i >= 0; {
 if remain == 0 {
 cache, remain = r.Int63(), letterIdxMax
 }
 if idx := int(cache & letterIdxMask); idx < len(letterBytes) {
 by[i] = letterBytes[idx]
 i--
 }
 cache >>= letterIdxBits
 remain--
 }
 return string(by)
}
// getValue returns a random int between `g.min` and `g.max`
func (g IntGenerator) getValue(r rand.Rand) interface{} { return r.Intn(g.max-g.min) + g.min }
// getValue returns a random boolean
func (g BoolGenerator) getValue(r rand.Rand) interface{} { return r.Int()%2 == 0 }
// getValue returns a random array
func (g ArrayGenerator) getValue(r rand.Rand) interface{} {
 array := make([]interface{}, g.size)
 for i := 0; i < g.size; i++ {
 array[i] = g.generator.getValue(r)
 }
 return array
}
// getValue returns a random object
func (g ObjectGenerator) getValue(r rand.Rand) interface{} {
 m := bson.M{}
 for _, gen := range g.generatorList {
 if gen.getCommonProperties().exist(r) {
 m[gen.getCommonProperties().getKey()] = gen.getValue(r)
 }
 }
 return m
}
func (g StringGenerator) getCommonProperties() CommonProperties { return g.common }
func (g IntGenerator) getCommonProperties() CommonProperties { return g.common }
func (g BoolGenerator) getCommonProperties() CommonProperties { return g.common }
func (g ArrayGenerator) getCommonProperties() CommonProperties { return g.common }
func (g ObjectGenerator) getCommonProperties() CommonProperties { return g.common }
// GetPropertyGenerator returns an array of generator from
// a list of GeneratorJSON
func GetPropertyGenerator(k string, v GeneratorJSON) Generatorer {
 switch v.Type {
 case "string":
 return StringGenerator{common: CommonProperties{key: k, nullPercentage: v.NullPercentage}, length: v.Length}
 case "int":
 return IntGenerator{common: CommonProperties{key: k, nullPercentage: v.NullPercentage}, min: v.Min, max: v.Max}
 case "boolean":
 return BoolGenerator{common: CommonProperties{key: k, nullPercentage: v.NullPercentage}}
 case "array":
 return ArrayGenerator{common: CommonProperties{key: k, nullPercentage: v.NullPercentage}, size: v.Size, generator: GetPropertyGenerator("", *v.ArrayContent)}
 case "object":
 var genArr = GeneratePropertyGeneratorList(v.ObjectContent)
 return ObjectGenerator{common: CommonProperties{key: k, nullPercentage: v.NullPercentage}, generatorList: genArr}
 default:
 return BoolGenerator{common: CommonProperties{key: k, nullPercentage: v.NullPercentage}}
 }
}
// GeneratePropertyGeneratorList create an array of generators from a JSON GeneratorJSON document
func GeneratePropertyGeneratorList(content map[string]GeneratorJSON) []Generatorer {
 genArr := make([]Generatorer, 0)
 for k, v := range content {
 genArr = append(genArr, GetPropertyGenerator(k, v))
 }
 return genArr
}
func main() {
 // Create a rand.Rand object to generate our random values
 var randSource = rand.New(rand.NewSource(time.Now().UnixNano()))
 // read the json config file
 file, e := ioutil.ReadFile("./config.json")
 if e != nil {
 fmt.Printf("File error: %v\n", e)
 os.Exit(1)
 }
 // map to a json object
 var collection Collection
 err := json.Unmarshal(file, &collection)
 if err != nil {
 panic(err)
 }
 // arrays that store all generators
 var genArr = GeneratePropertyGeneratorList(collection.Content)
 // counter for already generated documents
 count := 0
 // array that store 10 bson documents
 var docList [10]bson.M
 for count < collection.Count {
 for i := 0; i < 10; i++ {
 m := bson.M{}
 // iterate over generators to create values for each key of the bson document
 for _, v := range genArr {
 // check for exist before generating a value to avoid unneccessary computations
 if v.getCommonProperties().exist(*randSource) {
 m[v.getCommonProperties().getKey()] = v.getValue(*randSource)
 }
 }
 docList[i] = m
 // insert docs in database
 count += 10
 }
 }
 // pretty print last 10 generated Objects
 rawjson, err := json.MarshalIndent(docList, "", " ")
 if err != nil {
 panic("failed")
 }
 fmt.Printf("generated: %s", string(rawjson))
}

go vet and golint both return no warnings for this code. How can this be improved, in terms of readability first, and then in terms of performance?


EDIT

I also created some benchmarks:

test file :

package main
import (
 "encoding/json"
 "fmt"
 "gopkg.in/mgo.v2/bson"
 "io/ioutil"
 "math/rand"
 "os"
 "testing"
 "time"
)
func BenchmarkRandomString(b *testing.B) {
 var randSource = rand.New(rand.NewSource(time.Now().UnixNano()))
 stringGenerator := StringGenerator{common: CommonProperties{key: "key", nullPercentage: 0}, length: 5}
 for n := 0; n < b.N; n++ {
 stringGenerator.getValue(*randSource)
 }
}
func BenchmarkRandomInt(b *testing.B) {
 var randSource = rand.New(rand.NewSource(time.Now().UnixNano()))
 intGenerator := IntGenerator{common: CommonProperties{key: "key", nullPercentage: 0}, min: 0, max: 100}
 for n := 0; n < b.N; n++ {
 intGenerator.getValue(*randSource)
 }
}
func BenchmarkRandomBool(b *testing.B) {
 var randSource = rand.New(rand.NewSource(time.Now().UnixNano()))
 boolGenerator := BoolGenerator{common: CommonProperties{key: "key", nullPercentage: 0}}
 for n := 0; n < b.N; n++ {
 boolGenerator.getValue(*randSource)
 }
}
func BenchmarkJSONGeneration(b *testing.B) {
 var randSource = rand.New(rand.NewSource(time.Now().UnixNano()))
 file, e := ioutil.ReadFile("./config.json")
 if e != nil {
 fmt.Printf("File error: %v\n", e)
 os.Exit(1)
 }
 var collection Collection
 err := json.Unmarshal(file, &collection)
 if err != nil {
 panic(err)
 }
 var genArr = GeneratePropertyGeneratorList(collection.Content)
 var docList [1000]bson.M
 for n := 0; n < b.N; n++ {
 for i := 0; i < 1000; i++ {
 m := bson.M{}
 for _, v := range genArr {
 if v.getCommonProperties().exist(*randSource) {
 m[v.getCommonProperties().getKey()] = v.getValue(*randSource)
 }
 }
 docList[i] = m
 }
 }
}

and here are the results (go test -bench=.):

BenchmarkRandomString-4 10000000 221 ns/op
BenchmarkRandomInt-4 30000000 59.2 ns/op
BenchmarkRandomBool-4 50000000 37.8 ns/op
BenchmarkJSONGeneration-4 500 2516883 ns/op

And 'real scenario' benchmark (1000000 documents with above config.json file) give this :

generating json doc only : 4.5s
generating json doc + inserting in db : 17s

Edit 2:

I've spend some time improving this program and open sourced it on github. It's available here: feliixx/mgodatagen

Thanks everybody!

asked Jul 11, 2017 at 11:42
\$\endgroup\$
4
  • \$\begingroup\$ // insert docs in database and count += 10 should probably be after the next } (outside the for i := 0; i < 10; i++ loop) \$\endgroup\$ Commented Jul 13, 2017 at 18:18
  • \$\begingroup\$ You mentioned "performance" in your last question. Did you run any benchmark? Is the generation so slow in comparison to the insertion into the database? (you could make a buffered channel between one -or more - "document producers" and a "database inserter") \$\endgroup\$ Commented Jul 14, 2017 at 10:23
  • \$\begingroup\$ @oliverpool I added some benchmark results in my edit. Data generation is faster than insertion in database, but still take some time (4.5s for 1000000 docs). Test with -cpuprofile show that gccollection take 1/3 of the time, followed by getValue() functions, but I don't know how to optimize that. I'll take a look at buffered channels, it seems to be an interesting idea ! \$\endgroup\$ Commented Jul 16, 2017 at 15:05
  • \$\begingroup\$ Thanks the for benchmarks: I added performance suggestions in my answer \$\endgroup\$ Commented Jul 17, 2017 at 9:30

2 Answers 2

2
\$\begingroup\$

In your main, you "iterate over generators to create values for each key". You could actually create a "base" generator for this!

It would be an array generator (10 elements, 0 nullPercentage) of objects (being the objects that you generate).

Using @ferada's answer leads also to much less code (I also renamed CommonProper to EmptyGenerator). It uses embedding (one could further optimize, by using the BoolGenerator as base - instead of EmptyGenerator).

There was a typo on the exist function : it should be >= g.nullPercentage and not > (because r.Intn(100) returns between 0 and 99 included).

Here is what I come up with:

package main
import (
 "encoding/json"
 "fmt"
 "io/ioutil"
 "math/rand"
 "os"
 "time"
 "gopkg.in/mgo.v2/bson"
)
const (
 letterBytes = "abcdefghijklmnopqrstuvwxyz"
 letterIdxBits = 6 // 6 bits to represent a letter index
 letterIdxMask = 1<<letterIdxBits - 1 // All 1-bits, as many as letterIdxBits
 letterIdxMax = 63 / letterIdxBits // # of letter indices fitting in 63 bits
)
// GeneratorJSON structure containing all possible options
type GeneratorJSON struct {
 // Type of object to genereate. string | int | boolean supported for the moment
 Type string `json:"type"`
 // For `string` type only. Specify the length of the string to generate
 Length int `json:"length"`
 // Percentage of documents that won't contains this field
 NullPercentage int `json:"nullPercentage"`
 // For `int` type only. Lower bound for the int to generate
 Min int `json:"min"`
 // For `int` type only. Higher bound for the int to generate
 Max int `json:"max"`
 // For `array` only. Size of the array
 Size int `json:"size"`
 // For `array` only. GeneratorJSON to fill the array. Need to
 // pass a pointer here to avoid 'invalid recursive type' error
 ArrayContent *GeneratorJSON `json:"arrayContent"`
 // For `object` only. List of GeneratorJSON to generate the content
 // of the object
 ObjectContent map[string]GeneratorJSON `json:"objectContent"`
}
// Collection structure storing global collection info
type Collection struct {
 // Collection name in the database
 Collection string `json:"collection"`
 // Number of documents to insert in the collection
 Count int `json:"count"`
 // Schema of the documents for this collection
 Content map[string]GeneratorJSON `json:"content"`
}
// Generator interface for all generator objects
type Generator interface {
 Key() string
 // Get a random value according to the generator type. string | int | boolean supported for the moment
 Value(r rand.Rand) interface{}
 Exists(r rand.Rand) bool
}
// EmptyGenerator serves as base for the actual generators
type EmptyGenerator struct {
 key string
 nullPercentage int
}
// Key returns the key of the object
func (g EmptyGenerator) Key() string { return g.key }
// Exists returns true if the generation should be performed
func (g EmptyGenerator) Exists(r rand.Rand) bool { return r.Intn(100) >= g.nullPercentage }
// StringGenerator struct that implements Generator. Used to
// generate random string of `length` length
type StringGenerator struct {
 EmptyGenerator
 length int
}
// Value returns a random String of `g.length` length
func (g StringGenerator) Value(r rand.Rand) interface{} {
 by := make([]byte, g.length)
 for i, cache, remain := g.length-1, r.Int63(), letterIdxMax; i >= 0; {
 if remain == 0 {
 cache, remain = r.Int63(), letterIdxMax
 }
 if idx := int(cache & letterIdxMask); idx < len(letterBytes) {
 by[i] = letterBytes[idx]
 i--
 }
 cache >>= letterIdxBits
 remain--
 }
 return string(by)
}
// IntGenerator struct that implements Generator. Used to
// generate random int between `min` and `max`
type IntGenerator struct {
 EmptyGenerator
 min int
 max int
}
// Value returns a random int between `g.min` and `g.max`
func (g IntGenerator) Value(r rand.Rand) interface{} { return r.Intn(g.max-g.min) + g.min }
// BoolGenerator struct that implements Generator. Used to
// generate random bool
type BoolGenerator struct {
 EmptyGenerator
}
// Value returns a random boolean
func (g BoolGenerator) Value(r rand.Rand) interface{} { return r.Int()%2 == 0 }
// ArrayGenerator struct that implements Generator. Used to
// generate random array
type ArrayGenerator struct {
 EmptyGenerator
 size int
 generator Generator
}
// Value returns a random array
func (g ArrayGenerator) Value(r rand.Rand) interface{} {
 array := make([]interface{}, g.size)
 for i := 0; i < g.size; i++ {
 array[i] = g.generator.Value(r)
 }
 return array
}
// ObjectGenerator struct that implements Generator. Used to
// generate random object
type ObjectGenerator struct {
 EmptyGenerator
 generators []Generator
}
// Value returns a random object
func (g ObjectGenerator) Value(r rand.Rand) interface{} {
 m := bson.M{}
 for _, gen := range g.generators {
 if gen.Exists(r) {
 m[gen.Key()] = gen.Value(r)
 }
 }
 return m
}
// NewGenerator returns a new Generator based on a JSON configuration
func NewGenerator(k string, v GeneratorJSON) Generator {
 eg := EmptyGenerator{key: k, nullPercentage: v.NullPercentage}
 switch v.Type {
 case "string":
 return StringGenerator{EmptyGenerator: eg, length: v.Length}
 case "int":
 return IntGenerator{EmptyGenerator: eg, min: v.Min, max: v.Max}
 case "boolean":
 return BoolGenerator{EmptyGenerator: eg}
 case "array":
 return ArrayGenerator{EmptyGenerator: eg, size: v.Size, generator: NewGenerator("", *v.ArrayContent)}
 case "object":
 return ObjectGenerator{EmptyGenerator: eg, generators: NewGeneratorsFromMap(v.ObjectContent)}
 default:
 return BoolGenerator{EmptyGenerator: eg}
 }
}
// NewGeneratorsFromMap creates a slice of generators based on a JSON configuration map
func NewGeneratorsFromMap(content map[string]GeneratorJSON) []Generator {
 genArr := make([]Generator, 0)
 for k, v := range content {
 genArr = append(genArr, NewGenerator(k, v))
 }
 return genArr
}
func main() {
 // Create a rand.Rand object to generate our random values
 var randSource = rand.New(rand.NewSource(time.Now().UnixNano()))
 // read the json config file
 file, e := ioutil.ReadFile("./config.json")
 if e != nil {
 fmt.Printf("File error: %v\n", e)
 os.Exit(1)
 }
 // map to a json object
 var collection Collection
 err := json.Unmarshal(file, &collection)
 if err != nil {
 panic(err)
 }
 // arrays that store all generators
 generator := baseGenerator(collection.Content)
 // counter for already generated documents
 count := 0
 // array that store 10 bson documents
 var docList []interface{}
 for count < collection.Count {
 docList = generator.Value(*randSource).([]interface{})
 // insert docs in database
 count += generator.size
 }
 // pretty print last 10 generated Objects
 rawjson, err := json.MarshalIndent(docList, "", " ")
 if err != nil {
 panic("failed")
 }
 fmt.Printf("generated: %s", string(rawjson))
}
func baseGenerator(content map[string]GeneratorJSON) ArrayGenerator {
 return ArrayGenerator{
 size: 10,
 generator: ObjectGenerator{
 generators: NewGeneratorsFromMap(content),
 },
 }
}

edit regarding the performance

I don't see simple changes which could drastically improve the speed.

There are some minor changes (from looking at the source of rand.go):

func (g BoolGenerator) Value(r rand.Rand) interface{} { return r.Int63()&1 == 0 }

The Exists method could also be faster if it didn't use Intn. For instance, it could use the last 7 bits (from 0 to 127) of Int63 (or multiply the probability by 10 and use the last 10 bits - from 0 to 1023).

But there is probably a bigger optimization regarding the generation + insertion in the database.

Currently the data is generated and inserted into the database sequentially. You could do it concurrently with a record chan bson.M channel and two goroutines:

  • a producer which fills the record
  • a consumer which inserts them into the database (could be the main goroutine)

Possible code:

record := make(chan []interface{}, 3) // 3 is the buffer size
// generate records concurrently
go func() {
 for count < collection.Count {
 record <- generator.Value(*randSource).([]interface{})
 count += generator.size
 }
 close(record)
}()
// save the records
for r := range record {
 _ = r
 // insert record in DB
}

If you have a multicore processor, this should improve the overall performance: instead of having 4.5s + 12.5s, it should be much closer to 12.5s (with some overhead for the first run and the synchronization)

answered Jul 13, 2017 at 19:08
\$\endgroup\$
1
  • \$\begingroup\$ Thanks for the time and effort you put in your answer, it definitely helped me a lot! \$\endgroup\$ Commented Jul 17, 2017 at 11:21
2
\$\begingroup\$

A few notes, in general it looks good to me.

  • Generatorer should probably just be Generator, exist should be exists. CommonProper is harder, perhaps WithCommonProperties or HasCommonProperties.
  • CommonProperties doesn't need a common name in all the structs, you could simply mention it inline:

    type StringGenerator struct {
     CommonProperties
     length int
    }
    

    Though with that you'll have to mention the name twice when creating the objects, e.g. StringGenerator{CommonProperties: CommonProperties{key: k, nullPercentage: v.NullPercentage}, length: v.Length}.

    Also you'll only need one getCommonProperties with that: func (g CommonProperties) getCommonProperties() CommonProperties { return g } basically.

answered Jul 13, 2017 at 17:22
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.