Return to Answer

deleted 195 characters in body

Source Link

edited Sep 3, 2012 at 18:33

Gareth Rees

edited Sep 3, 2012 at 18:33

Gareth Rees

50.1k
3
130
210

There's no test suite. I know it's hard for you to write test cases because of the randomness. You probably want to take a hint from Mennon van Slooten and provide a way for to change the random choices to a deterministic sequence of choices.
Your collections of example data (_male_first_name etc.) are stored as tuples, but it would be better to store these as lists. Tuples are fixed-size collections typically used as lightweight representations of records. Lists sets are variable-size collections typically containing similar items.
Your example data is very America-centric. Given that the purpose of your module is to produce example data for testing, it will be useful to have a wider variety of names, including names with accented letters, or in different scripts.
It is generally results in source code that is easier to read (with fewer quotation characters and commas) if you produce this kind of data using split. For example,
```
 _female_first_names = u"""
 Mary Patricia Linda Barbara Elizabeth
 Zoé Yến София فاطمة 明子 美玲 ยิ่งลักษณ์
 """.split()
```
The function _random_item is already in the Python library as random.choice.
The string "0123456789" is already in the Python library as string.digits.
It's not clear why _random_data needs to strip all initial @ signs from its argument before trying to look it up. You only call this function in four places: either with a literal argument (for example, _random_data('@LETTER_LOWER')) where you could just call the corresponding function directly, or in a context where the calleryou could have easily strippedstrip the @ themselves. It also fails to raise an error if the key is missing (so that mistakes like @NUBMER are not caught).

Your definitions of data will look nicer (fewer quotation marks) if you use the dict constructor:

 data = dict(
 NUMBER = lambda: _random_item("0123456789"),
 LETTER_UPPER = lambda: _random_item(string.uppercase),
 LETTER_LOWER = lambda: _random_item(string.lowercase),
 ...

Since most of the values in data are of the form lambda: _random_item(...), why not special-case them to save on boilerplate? You could write something like:

 def _random_data(key):
 """
 Construct a random data item for `key` and return it.
 Raise KeyError if `key` is unknown.
 """
 constructor = data[key]
 if isinstance(constructor, types.FunctionType):
 return constructor()
 else:
 return random.choice(constructor)

 increment = 0
 template_key = (id(template), key)
 if template_key in self._increments:
 increment = _increments[template_key]
 elif m.group(4):
 increment = int(m.group(4))
 _increments[template_key] = increment

OK, let's put all that together. (There are a couple of bonus improvements in here for you to spot for yourself!)

_constantly_zero = itertools.repeat(0)
class _Instantiator(object):
 def __init__(self):
 self._increments = dict()
 def instantiate(self, template, have_n = False, n = 1,
 increment = _constantly_zero):
 if isinstance(template, dict):
 generated = {}
 for key, value in template.iteritems():
 m = re.match(r"^(\w+)(?:\|(?:(\d+)-(\d+)|\+(\d+)))?$", key)
 if not m:
 raise ValueError("Bad key: {0}".format(key))
 have_n = False
 n = 1
 if m.group(2):
 have_n = True
 n = random.randint(int(m.group(2)), int(m.group(3)))
 template_key = (id(template), key)
 increment = _constantly_zero
 if template_key in self._increments:
 increment = self._increments[template_key]
 elif m.group(4):
 increment = itertools.count(start = 0, step = int(m.group(4)))
 self._increments[template_key] = increment
 generated[m.group(1)] = self.instantiate(value, have_n, n, increment)
 return generated
 elif isinstance(template, list):
 return [self.instantiate(template[0]) for _ in xrange(n)]
 elif isinstance(template, bool):
 if have_n:
 return [False, True][n]bool(n)
 else:
 return template
 elif isinstance(template, int):
 if have_n:
 return n
 else:
 return template + next(increment)
 elif isinstance(template, str) or isinstance(template, unicode):
 if template:
 pattern = re.compile(r'@([A-Z_0-9]+)')
 repl = lambda m: _random_data(m.group(1))
 return ''.join(pattern.sub(repl, template) for _ in xrange(n))
 else:
 return ''.join(random.choice(string.letters) for _ in xrange(n))
 else:
 return template
def instantiate_object(template):
 """
 Instantiate an object based on `template` and return it.
 """
 return _Instantiator().instantiate(template)

There's no test suite. I know it's hard for you to write test cases because of the randomness. You probably want to take a hint from Mennon van Slooten and provide a way for to change the random choices to a deterministic sequence of choices.
Your collections of example data (_male_first_name etc.) are stored as tuples, but it would be better to store these as lists. Tuples are fixed-size collections typically used as lightweight representations of records. Lists sets are variable-size collections typically containing similar items.
Your example data is very America-centric. Given that the purpose of your module is to produce example data for testing, it will be useful to have a wider variety of names, including names with accented letters, or in different scripts.
It is generally results in source code that is easier to read (with fewer quotation characters and commas) if you produce this kind of data using split. For example,
```
 _female_first_names = u"""
 Mary Patricia Linda Barbara Elizabeth
 Zoé Yến София فاطمة 明子 美玲 ยิ่งลักษณ์
 """.split()
```
The function _random_item is already in the Python library as random.choice.
The string "0123456789" is already in the Python library as string.digits.
It's not clear why _random_data needs to strip all initial @ signs from its argument before trying to look it up. You only call this function in four places: either with a literal argument (for example, _random_data('@LETTER_LOWER')) where you could just call the corresponding function directly, or in a context where the caller could have easily stripped the @ themselves. It also fails to raise an error if the key is missing (so that mistakes like @NUBMER are not caught).

Your definitions of data will look nicer (fewer quotation marks) if you use the dict constructor:

 data = dict(
 NUMBER = lambda: _random_item("0123456789"),
 LETTER_UPPER = lambda: _random_item(string.uppercase),
 LETTER_LOWER = lambda: _random_item(string.lowercase),
 ...

Since most of the values in data are of the form lambda: _random_item(...), why not special-case them to save boilerplate? You could write something like:

 def _random_data(key):
 """
 Construct a random data item for `key` and return it.
 Raise KeyError if `key` is unknown.
 """
 constructor = data[key]
 if isinstance(constructor, types.FunctionType):
 return constructor()
 else:
 return random.choice(constructor)

 increment = 0
 template_key = (id(template), key)
 if template_key in self._increments:
 increment = _increments[template_key]
 elif m.group(4):
 increment = int(m.group(4))
 _increments[template_key] = increment

OK, let's put all that together.

_constantly_zero = itertools.repeat(0)
class _Instantiator(object):
 def __init__(self):
 self._increments = dict()
 def instantiate(self, template, have_n = False, n = 1,
 increment = _constantly_zero):
 if isinstance(template, dict):
 generated = {}
 for key, value in template.iteritems():
 m = re.match(r"^(\w+)(?:\|(?:(\d+)-(\d+)|\+(\d+)))?$", key)
 if not m:
 raise ValueError("Bad key: {0}".format(key))
 have_n = False
 n = 1
 if m.group(2):
 have_n = True
 n = random.randint(int(m.group(2)), int(m.group(3)))
 template_key = (id(template), key)
 increment = _constantly_zero
 if template_key in self._increments:
 increment = self._increments[template_key]
 elif m.group(4):
 increment = itertools.count(start = 0, step = int(m.group(4)))
 self._increments[template_key] = increment
 generated[m.group(1)] = self.instantiate(value, have_n, n, increment)
 return generated
 elif isinstance(template, list):
 return [self.instantiate(template[0]) for _ in xrange(n)]
 elif isinstance(template, bool):
 if have_n:
 return [False, True][n]
 else:
 return template
 elif isinstance(template, int):
 if have_n:
 return n
 else:
 return template + next(increment)
 elif isinstance(template, str) or isinstance(template, unicode):
 if template:
 pattern = re.compile(r'@([A-Z_0-9]+)')
 repl = lambda m: _random_data(m.group(1))
 return ''.join(pattern.sub(repl, template) for _ in xrange(n))
 else:
 return ''.join(random.choice(string.letters) for _ in xrange(n))
 else:
 return template
def instantiate_object(template):
 """
 Instantiate an object based on `template` and return it.
 """
 return _Instantiator().instantiate(template)

There's no test suite. I know it's hard for you to write test cases because of the randomness. You probably want to take a hint from Mennon van Slooten and provide a way for to change the random choices to a deterministic sequence of choices.
Your collections of example data (_male_first_name etc.) are stored as tuples, but it would be better to store these as lists. Tuples are fixed-size collections typically used as lightweight representations of records. Lists sets are variable-size collections typically containing similar items.
Your example data is very America-centric. Given that the purpose of your module is to produce example data for testing, it will be useful to have a wider variety of names, including names with accented letters, or in different scripts.
It is generally results in source code that is easier to read (with fewer quotation characters and commas) if you produce this kind of data using split. For example,
```
 _female_first_names = u"""
 Mary Patricia Linda Barbara Elizabeth
 Zoé Yến София فاطمة 明子 美玲 ยิ่งลักษณ์
 """.split()
```
The function _random_item is already in the Python library as random.choice.
The string "0123456789" is already in the Python library as string.digits.
It's not clear why _random_data needs to strip all initial @ signs from its argument before trying to look it up. You only call this function in a context where you could easily strip the @. It also fails to raise an error if the key is missing (so that mistakes like @NUBMER are not caught).

Your definitions of data will look nicer (fewer quotation marks) if you use the dict constructor:

 data = dict(
 NUMBER = lambda: _random_item("0123456789"),
 LETTER_UPPER = lambda: _random_item(string.uppercase),
 LETTER_LOWER = lambda: _random_item(string.lowercase),
 ...

Since most of the values in data are of the form lambda: _random_item(...), why not special-case them to save on boilerplate? You could write something like:

 def _random_data(key):
 """
 Construct a random data item for `key` and return it.
 Raise KeyError if `key` is unknown.
 """
 constructor = data[key]
 if isinstance(constructor, types.FunctionType):
 return constructor()
 else:
 return random.choice(constructor)

 increment = 0
 template_key = (id(template), key)
 if template_key in _increments:
 increment = _increments[template_key]
 elif m.group(4):
 increment = int(m.group(4))
 _increments[template_key] = increment

OK, let's put all that together. (There are a couple of bonus improvements in here for you to spot for yourself!)

_constantly_zero = itertools.repeat(0)
class _Instantiator(object):
 def __init__(self):
 self._increments = dict()
 def instantiate(self, template, have_n = False, n = 1,
 increment = _constantly_zero):
 if isinstance(template, dict):
 generated = {}
 for key, value in template.iteritems():
 m = re.match(r"^(\w+)(?:\|(?:(\d+)-(\d+)|\+(\d+)))?$", key)
 if not m:
 raise ValueError("Bad key: {0}".format(key))
 have_n = False
 n = 1
 if m.group(2):
 have_n = True
 n = random.randint(int(m.group(2)), int(m.group(3)))
 template_key = (id(template), key)
 increment = _constantly_zero
 if template_key in self._increments:
 increment = self._increments[template_key]
 elif m.group(4):
 increment = itertools.count(start = 0, step = int(m.group(4)))
 self._increments[template_key] = increment
 generated[m.group(1)] = self.instantiate(value, have_n, n, increment)
 return generated
 elif isinstance(template, list):
 return [self.instantiate(template[0]) for _ in xrange(n)]
 elif isinstance(template, bool):
 if have_n:
 return bool(n)
 else:
 return template
 elif isinstance(template, int):
 if have_n:
 return n
 else:
 return template + next(increment)
 elif isinstance(template, str) or isinstance(template, unicode):
 if template:
 pattern = re.compile(r'@([A-Z_0-9]+)')
 repl = lambda m: _random_data(m.group(1))
 return ''.join(pattern.sub(repl, template) for _ in xrange(n))
 else:
 return ''.join(random.choice(string.letters) for _ in xrange(n))
 else:
 return template
def instantiate_object(template):
 """
 Instantiate an object based on `template` and return it.
 """
 return _Instantiator().instantiate(template)

oops, paragraph about `have_n` was curtailed

Source Link

edited Sep 3, 2012 at 18:17

Gareth Rees

edited Sep 3, 2012 at 18:17

Gareth Rees

50.1k
3
130
210

The name could be improved: generate has a special meaning in Python (referring to the output of a generator), and the json part is misleading since there's nothing JSON-specific about this function (it operates on Python objects, not on JSON representations). Since instantiate is often used for filling in a template, I think I would use a name like instantiate_object. (And instantiate_json for generate_json.)
You have to go through contortions because you are overloading the meaning ofThe variable length is misleadingly named because it's not always a length (it might just be a number). Perhaps n would be better. You also have to go through contortions because you are overloading its meaning: it's None if no lengthnumber was specified (in which case sometimes it needneeds to be treated as if it has the value 1), or else it's a number. It would be much clearer to have a separate variable (have_n in my code below) that indicates whether a number was supplied.
Your regular expression for finding repeat counts in JSON keys is \w+\|(\d+)-(\d+). This matches strings like "a|0-2trailing garbage". It's probably a good idea to anchor it to the end of the string (\w\|(\d+)-(\d+)$). Or maybe, depending on exactly what you want to match, to the beginning of the string as well: re.match(r'\w+\|(\d+)-(\d+)$', name). But then you need to decide what to do for non-matching keys. Maybe raise an exception?

The structure

 matches = re.search(...)
 if matches:
 ...

The name could be improved: generate has a special meaning in Python (referring to the output of a generator), and the json part is misleading since there's nothing JSON-specific about this function (it operates on Python objects, not on JSON representations). Since instantiate is often used for filling in a template, I think I would use a name like instantiate_object. (And instantiate_json for generate_json.)
You have to go through contortions because you are overloading the meaning of length: it's None if no length was specified (in which case sometimes it need to be treated as if it has the value 1)
Your regular expression for finding repeat counts in JSON keys is \w+\|(\d+)-(\d+). This matches strings like "a|0-2trailing garbage". It's probably a good idea to anchor it to the end of the string (\w\|(\d+)-(\d+)$). Or maybe, depending on exactly what you want to match, to the beginning of the string as well: re.match(r'\w+\|(\d+)-(\d+)$', name). But then you need to decide what to do for non-matching keys. Maybe raise an exception?

The structure

 matches = re.search(...)
 if matches:
 ...

The name could be improved: generate has a special meaning in Python (referring to the output of a generator), and the json part is misleading since there's nothing JSON-specific about this function (it operates on Python objects, not on JSON representations). Since instantiate is often used for filling in a template, I think I would use a name like instantiate_object. (And instantiate_json for generate_json.)
The variable length is misleadingly named because it's not always a length (it might just be a number). Perhaps n would be better. You also have to go through contortions because you are overloading its meaning: it's None if no number was specified (in which case sometimes it needs to be treated as if it has the value 1), or else it's a number. It would be much clearer to have a separate variable (have_n in my code below) that indicates whether a number was supplied.
Your regular expression for finding repeat counts in JSON keys is \w+\|(\d+)-(\d+). This matches strings like "a|0-2trailing garbage". It's probably a good idea to anchor it to the end of the string (\w\|(\d+)-(\d+)$). Or maybe, depending on exactly what you want to match, to the beginning of the string as well: re.match(r'\w+\|(\d+)-(\d+)$', name). But then you need to decide what to do for non-matching keys. Maybe raise an exception?

The structure

 matches = re.search(...)
 if matches:
 ...

add section headings

Source Link

edited Sep 3, 2012 at 18:04

Gareth Rees

edited Sep 3, 2012 at 18:04

Gareth Rees

50.1k
3
130
210

1. Introduction

This review grew to be very long, so I'll say up front that you shouldn't take the length of this to heart: your code is not bad, especially if you are new to Python. There's always a lot of things to say about a piece of code of this length, and idiomatic Python has a bunch of features (sets, generators, comprehensions, iterators) that will be unfamiliar to users of some other languages. So take everything I have to say with a pinch of salt (except for item #1 under "General comments", which is really the only big problem here).

2. General comments

There's no test suite. I know it's hard for you to write test cases because of the randomness. You probably want to take a hint from Mennon van Slooten and provide a way for to change the random choices to a deterministic sequence of choices.
Your collections of example data (_male_first_name etc.) are stored as tuples, but it would be better to store these as lists. Tuples are fixed-size collections typically used as lightweight representations of records. Lists sets are variable-size collections typically containing similar items.
Your example data is very America-centric. Given that the purpose of your module is to produce example data for testing, it will be useful to have a wider variety of names, including names with accented letters, or in different scripts.
It is generally results in source code that is easier to read (with fewer quotation characters and commas) if you produce this kind of data using split. For example,
```
 _female_first_names = u"""
 Mary Patricia Linda Barbara Elizabeth
 Zoé Yến София فاطمة 明子 美玲 ยิ่งลักษณ์
 """.split()
```
The function _random_item is already in the Python library as random.choice.
The string "0123456789" is already in the Python library as string.digits.
It's not clear why _random_data needs to strip all initial @ signs from its argument before trying to look it up. You only call this function in four places: either with a literal argument (for example, _random_data('@LETTER_LOWER')) where you could just call the corresponding function directly, or in a context where the caller could have easily stripped the @ themselves. It also fails to raise an error if the key is missing (so that mistakes like @NUBMER are not caught).

Your definitions of data will look nicer (fewer quotation marks) if you use the dict constructor:

 data = dict(
 NUMBER = lambda: _random_item("0123456789"),
 LETTER_UPPER = lambda: _random_item(string.uppercase),
 LETTER_LOWER = lambda: _random_item(string.lowercase),
 ...

Since most of the values in data are of the form lambda: _random_item(...), why not special-case them to save boilerplate? You could write something like:

 def _random_data(key):
 """
 Construct a random data item for `key` and return it.
 Raise KeyError if `key` is unknown.
 """
 constructor = data[key]
 if isinstance(constructor, types.FunctionType):
 return constructor()
 else:
 return random.choice(constructor)

Building a string by repeatedly extending it with += is a well-known anti-pattern in Python. This is because extending a string is inefficient in Python: a new string gets allocated and the old string copied across each time. This means that any algorithm that builds a string by repeated extension runs like O(n²). It is nearly always better to generate the items to be assembled into the string and then join them to produce the result. This technique also avoidavoids fencepost difficultieserrors like spurious extra spaces at the start or end. So your could write:
```
 def _lorem_ipsum():
 """
 Return a random paragraph of placeholder text.
 """
 length = random.randrange(2, len(_lorem) / 2)
 return ' '.join(random.choice(_lorem) for _ in xrange(length))
```
generate_json takes an optional argument name which is always ignored. You can omit this.

My remaining comments apply to generate_json_object.

3. Comments on `generate_json_object`

It is usually better in Python to test if an object belongs to a type t by writing isinstance(object, t) rather than type(object) is t. The reason is that object may belong to a subtype of t (for example, a defaultdict or OrderedDict instead of a plain dict). There is a gotcha here, which is the bool is a subtype of int, so you need to order your tests so that Booleans get tested before integers.
It is usually better in Python to test if an object belongs to a type t by writing isinstance(object, t) rather than type(object) is t. The reason is that object may belong to a subtype of t (for example, a defaultdict or OrderedDict instead of a plain dict).

There is a gotcha here, which is that bool is a subtype of int, so you need to order your tests so that Booleans get tested before integers.

The structure of generate_json_object involves various branches which assign an object to the variable generated. This is then returned at the end. Why not just return the object directly and avoid the local variable?
When you're building a list, you do this by repeatedly calling append. This is not quite as bad as repeatedly calling += on a string (lists are a bit more flexible), but you can simplify this code by using a list comprehension:
```
 elif isinstance(template, list):
 return [generate_json_object(template[0]) for _ in xrange(length)]
```

Where should this _increments dictionary be stored? We need a new one for each instantiation we do, but it needs to persist throughout the recursive series of calls. We could pass it around as another function parameter, but if you find yourself doing this, that's a sign that you need to use a class. Again, see the final version at the end for how this can be done.

4. Revised instantiation code

Your collections of example data (_male_first_name etc.) are stored as tuples, but it would be better to store these as lists. Tuples are fixed-size collections typically used as lightweight representations of records. Lists sets are variable-size collections typically containing similar items.
Your example data is very America-centric. Given that the purpose of your module is to produce example data for testing, it will be useful to have a wider variety of names, including names with accented letters, or in different scripts.
It is generally results in source code that is easier to read (with fewer quotation characters and commas) if you produce this kind of data using split. For example,
```
 _female_first_names = u"""
 Mary Patricia Linda Barbara Elizabeth
 Zoé Yến София فاطمة 明子 美玲 ยิ่งลักษณ์
 """.split()
```
The function _random_item is already in the Python library as random.choice.
The string "0123456789" is already in the Python library as string.digits.
It's not clear why _random_data needs to strip all initial @ signs from its argument before trying to look it up. You only call this function in four places: either with a literal argument (for example, _random_data('@LETTER_LOWER')) where you could just call the corresponding function directly, or in a context where the caller could have easily stripped the @ themselves. It also fails to raise an error if the key is missing (so that mistakes like @NUBMER are not caught).

Your definitions of data will look nicer (fewer quotation marks) if you use the dict constructor:

 data = dict(
 NUMBER = lambda: _random_item("0123456789"),
 LETTER_UPPER = lambda: _random_item(string.uppercase),
 LETTER_LOWER = lambda: _random_item(string.lowercase),
 ...

Since most of the values in data are of the form lambda: _random_item(...), why not special-case them to save boilerplate? You could write something like:

 def _random_data(key):
 """
 Construct a random data item for `key` and return it.
 Raise KeyError if `key` is unknown.
 """
 constructor = data[key]
 if isinstance(constructor, types.FunctionType):
 return constructor()
 else:
 return random.choice(constructor)

Building a string by repeatedly extending it with += is a well-known anti-pattern in Python. This is because extending a string is inefficient in Python: a new string gets allocated and the old string copied across each time. This means that any algorithm that builds a string by repeated extension runs like O(n²). It is nearly always better to generate the items to be assembled into the string and then join them to produce the result. This technique also avoid fencepost difficulties like spurious extra spaces at the start or end. So your could write:
```
 def _lorem_ipsum():
 """
 Return a random paragraph of placeholder text.
 """
 length = random.randrange(2, len(_lorem) / 2)
 return ' '.join(random.choice(_lorem) for _ in xrange(length))
```
generate_json takes an optional argument name which is always ignored. You can omit this.

My remaining comments apply to generate_json_object.

It is usually better in Python to test if an object belongs to a type t by writing isinstance(object, t) rather than type(object) is t. The reason is that object may belong to a subtype of t (for example, a defaultdict or OrderedDict instead of a plain dict). There is a gotcha here, which is the bool is a subtype of int, so you need to order your tests so that Booleans get tested before integers.
The structure of generate_json_object involves various branches which assign an object to the variable generated. This is then returned at the end. Why not just return the object directly and avoid the local variable?
When you're building a list, you do this by repeatedly calling append. This is not quite as bad as repeatedly calling += on a string (lists are a bit more flexible), but you can simplify this code by using a list comprehension:
```
 elif isinstance(template, list):
 return [generate_json_object(template[0]) for _ in xrange(length)]
```

Where should this _increments dictionary be stored? We need a new one for each instantiation we do, but it needs to persist throughout the recursive series of calls. We could pass it around as another function parameter, but if you find yourself doing this, that's a sign that you need to a class. Again, see the final version at the end for how this can be done.

1. Introduction

2. General comments

There's no test suite. I know it's hard for you to write test cases because of the randomness. You probably want to take a hint from Mennon van Slooten and provide a way for to change the random choices to a deterministic sequence of choices.
Your collections of example data (_male_first_name etc.) are stored as tuples, but it would be better to store these as lists. Tuples are fixed-size collections typically used as lightweight representations of records. Lists sets are variable-size collections typically containing similar items.
Your example data is very America-centric. Given that the purpose of your module is to produce example data for testing, it will be useful to have a wider variety of names, including names with accented letters, or in different scripts.
It is generally results in source code that is easier to read (with fewer quotation characters and commas) if you produce this kind of data using split. For example,
```
 _female_first_names = u"""
 Mary Patricia Linda Barbara Elizabeth
 Zoé Yến София فاطمة 明子 美玲 ยิ่งลักษณ์
 """.split()
```
The function _random_item is already in the Python library as random.choice.
The string "0123456789" is already in the Python library as string.digits.
It's not clear why _random_data needs to strip all initial @ signs from its argument before trying to look it up. You only call this function in four places: either with a literal argument (for example, _random_data('@LETTER_LOWER')) where you could just call the corresponding function directly, or in a context where the caller could have easily stripped the @ themselves. It also fails to raise an error if the key is missing (so that mistakes like @NUBMER are not caught).

Your definitions of data will look nicer (fewer quotation marks) if you use the dict constructor:

 data = dict(
 NUMBER = lambda: _random_item("0123456789"),
 LETTER_UPPER = lambda: _random_item(string.uppercase),
 LETTER_LOWER = lambda: _random_item(string.lowercase),
 ...

Since most of the values in data are of the form lambda: _random_item(...), why not special-case them to save boilerplate? You could write something like:

 def _random_data(key):
 """
 Construct a random data item for `key` and return it.
 Raise KeyError if `key` is unknown.
 """
 constructor = data[key]
 if isinstance(constructor, types.FunctionType):
 return constructor()
 else:
 return random.choice(constructor)

Building a string by repeatedly extending it with += is a well-known anti-pattern in Python. This is because extending a string is inefficient in Python: a new string gets allocated and the old string copied across each time. This means that any algorithm that builds a string by repeated extension runs like O(n²). It is nearly always better to generate the items to be assembled into the string and then join them to produce the result. This technique also avoids fencepost errors like spurious extra spaces at the start or end. So your could write:
```
 def _lorem_ipsum():
 """
 Return a random paragraph of placeholder text.
 """
 length = random.randrange(2, len(_lorem) / 2)
 return ' '.join(random.choice(_lorem) for _ in xrange(length))
```
generate_json takes an optional argument name which is always ignored. You can omit this.

3. Comments on `generate_json_object`

It is usually better in Python to test if an object belongs to a type t by writing isinstance(object, t) rather than type(object) is t. The reason is that object may belong to a subtype of t (for example, a defaultdict or OrderedDict instead of a plain dict).

There is a gotcha here, which is that bool is a subtype of int, so you need to order your tests so that Booleans get tested before integers.

The structure of generate_json_object involves various branches which assign an object to the variable generated. This is then returned at the end. Why not just return the object directly and avoid the local variable?
When you're building a list, you do this by repeatedly calling append. This is not quite as bad as repeatedly calling += on a string (lists are a bit more flexible), but you can simplify this code by using a list comprehension:
```
 elif isinstance(template, list):
 return [generate_json_object(template[0]) for _ in xrange(length)]
```