4

I have a MongoDB collection, which, when imported to Python via PyMongo, is a dictionnary in Python. I am looking to transform it into a Numpy Array.

For instance, if the JSON file looks like this :

{
 "_id" : ObjectId("57065024c3d1132426c4dd53"),
 "B" : {
 "BA" : 14,
 "BB" : 23,
 "BC" : 32,
 "BD" : 41
 "A" : 50,
}
{
 "_id" : ObjectId("57065024c3d1132426c4dd53"),
 "A" : 1
 "B" : {
 "BA" : 1,
 "BB" : 2,
 "BC" : 3,
 "BD" : 4

}

I'd like to get in return this 5*2 Numpy Array : np.array([[50,14,23,32,41], [1,1,2,3,4]]) In that case, the first column corresponds to "A", the second one to "BA", the third one to "BB", etc. Notice that keys are not always sorted in the same order.

My code, which does not work at all (and does not do what I want yet) looks like this :

from pymongo import MongoClient
uri = "mongodb://localhost/test"
client = MongoClient(uri)
db=client.recodb
collection=db.recos
list1=list(collection.find())
array2=np.vstack([[product[key] for key in product.keys()] for product in list1])
asked Oct 5, 2016 at 6:20
4
  • I don't know MongoDB, but that isn't a valid JSON object. Is it supposed to be a list of dictionaries? Also, ObjectId("57065024c3d1132426c4dd53") isn't a valid JSON item: it should be serialised as some kind of string, eg "ObjectId(\"57065024c3d1132426c4dd53\")". Commented Oct 5, 2016 at 6:42
  • That is how the file looks in RoboMongo, which I use to vizualize this collection. Commented Oct 5, 2016 at 7:57
  • Then, 'list1' is a list of dictionnary. About the slashes, I am not sure, but since I do not use in the end, it does not really matter. Commented Oct 5, 2016 at 8:04
  • MongoDB stores data in BSON format, not plain json. That is why those ObjectIds are there. Shouldn't matter for this question though. Commented Oct 5, 2016 at 9:36

2 Answers 2

1

Assuming you've successfully loaded that JSON into Python, here's one way to create the Numpy array you want. My code has a minimal definition of ObjectId so that it won't raise a NameError on ObjectId entries.

sorted(d["B"].items())]

produces a list of (key, value) tuples from the contents of a "B" dictionary, sorted by key. We then extract just the values from those tuples into a list, and append that list to a list containing the value from the "A" item.

import numpy as np
class ObjectId(object):
 def __init__(self, objectid):
 self.objectid = objectid
 def __repr__(self):
 return 'ObjectId("{}")'.format(self.objectid)
data = [
 {
 "_id" : ObjectId("57065024c3d1132426c4dd53"),
 "B" : {
 "BA" : 14,
 "BB" : 23,
 "BC" : 32,
 "BD" : 41
 },
 "A" : 50
 },
 {
 "_id" : ObjectId("57065024c3d1132426c4dd53"),
 "A" : 1,
 "B" : {
 "BA" : 1,
 "BB" : 2,
 "BC" : 3,
 "BD" : 4
 }
 }
]
array2 = np.array([[d["A"]] + [v for _, v in sorted(d["B"].items())] for d in data])
print(array2)

output

[[50 14 23 32 41]
 [ 1 1 2 3 4]]
answered Oct 5, 2016 at 9:51
Sign up to request clarification or add additional context in comments.

Comments

1

The flatdict module can sometimes be useful when working with mongodb data structures. It will handle flattening the nested dictionary structure for you:

columns = []
for d in data:
 flat = flatdict.FlatDict(d)
 del flat['_id']
 columns.append([item[1] for item in sorted(flat.items(), key=lambda item: item[0])])
np.vstack(columns)

Of course this can be solved without flatdict too.

answered Oct 5, 2016 at 9:50

3 Comments

Do you think it is possible to achieve this without using a loop on 'data'? 'data' I am actually going to use cointains 14000 elements.
Not sure what you mean about not looping on 'data'. One thing that you could do to increase the speed (if it is needed) is to create the numpy array as the first thing you do and then add the elements from mongodb into that array. I would try the solution as is first though, to make sure that I was optimizing prematurely.
What I mean by looping on data is : is there a way to solve the problem without any loop : for d in data

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.