I have a MongoDB collection, which, when imported to Python via PyMongo, is a dictionnary in Python. I am looking to transform it into a Numpy Array.
For instance, if the JSON file looks like this :
{
"_id" : ObjectId("57065024c3d1132426c4dd53"),
"B" : {
"BA" : 14,
"BB" : 23,
"BC" : 32,
"BD" : 41
"A" : 50,
}
{
"_id" : ObjectId("57065024c3d1132426c4dd53"),
"A" : 1
"B" : {
"BA" : 1,
"BB" : 2,
"BC" : 3,
"BD" : 4
}
I'd like to get in return this 5*2 Numpy Array : np.array([[50,14,23,32,41], [1,1,2,3,4]]) In that case, the first column corresponds to "A", the second one to "BA", the third one to "BB", etc. Notice that keys are not always sorted in the same order.
My code, which does not work at all (and does not do what I want yet) looks like this :
from pymongo import MongoClient
uri = "mongodb://localhost/test"
client = MongoClient(uri)
db=client.recodb
collection=db.recos
list1=list(collection.find())
array2=np.vstack([[product[key] for key in product.keys()] for product in list1])
2 Answers 2
Assuming you've successfully loaded that JSON into Python, here's one way to create the Numpy array you want. My code has a minimal definition of ObjectId
so that it won't raise a NameError on ObjectId
entries.
sorted(d["B"].items())]
produces a list of (key, value) tuples from the contents of a "B" dictionary, sorted by key. We then extract just the values from those tuples into a list, and append that list to a list containing the value from the "A" item.
import numpy as np
class ObjectId(object):
def __init__(self, objectid):
self.objectid = objectid
def __repr__(self):
return 'ObjectId("{}")'.format(self.objectid)
data = [
{
"_id" : ObjectId("57065024c3d1132426c4dd53"),
"B" : {
"BA" : 14,
"BB" : 23,
"BC" : 32,
"BD" : 41
},
"A" : 50
},
{
"_id" : ObjectId("57065024c3d1132426c4dd53"),
"A" : 1,
"B" : {
"BA" : 1,
"BB" : 2,
"BC" : 3,
"BD" : 4
}
}
]
array2 = np.array([[d["A"]] + [v for _, v in sorted(d["B"].items())] for d in data])
print(array2)
output
[[50 14 23 32 41]
[ 1 1 2 3 4]]
Comments
The flatdict module can sometimes be useful when working with mongodb data structures. It will handle flattening the nested dictionary structure for you:
columns = []
for d in data:
flat = flatdict.FlatDict(d)
del flat['_id']
columns.append([item[1] for item in sorted(flat.items(), key=lambda item: item[0])])
np.vstack(columns)
Of course this can be solved without flatdict too.
3 Comments
for d in data
ObjectId("57065024c3d1132426c4dd53")
isn't a valid JSON item: it should be serialised as some kind of string, eg"ObjectId(\"57065024c3d1132426c4dd53\")"
.