Problem
I want to read in the data to dictionary
person = {
'name': 'John Doe',
'email': '[email protected]',
'age': 50,
'connected': False
}
The data comes from different formats:
Format A.
dict_a = {
'name': {
'first_name': 'John',
'last_name': 'Doe'
},
'workEmail': '[email protected]',
'age': 50,
'connected': False
}
Format B.
dict_b = {
'fullName': 'John Doe',
'workEmail': '[email protected]',
'age': 50,
'connected': False
}
There will be additional sources added in the future with additional structures.
Background
For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item
or ItemLoader
, but it's ruled out in my case.
There could be potentially 5-10 different structures from which the data will be read from.
Implementation
/database/models.py
"""
Database mapping declarations for SQLAlchemy
"""
from sqlalchemy import Column, Integer, String, Boolean
from database.connection import Base
class PersonModel(Base):
__tablename__ = 'Person'
id = Column(Integer, primary_key=True)
name = Column(String)
email = Column(String)
age = Column(Integer)
connected = Column(Boolean)
/mappers/person.py
"""
Data mappers for Person
"""
# Abstract class for mapper
class Mapper(object):
def __init__(self, data):
self.data = data
# Data mapper for format A, maps the fields from dict_a to Person
class MapperA(Mapper):
def __init__(self, data):
self.name = ' '.join(data.get('name', {}).get(key) for key in ('first_name', 'last_name'))
self.email = data.get('workEmail')
self.age = data.get('age')
self.connected = data.get('connected')
@classmethod
def is_mapper_for(cls, data):
needed = {'name', 'workEmail'}
return needed.issubset(set(data))
# Data mapper for format B, maps the fields from dict_b to Person
class MapperB(Mapper):
def __init__(self, data):
self.name = data.get('fullName')
self.email = data.get('workEmail')
self.age = data.get('age')
self.connected = data.get('connected')
@classmethod
def is_mapper_for(cls, data):
needed = {'fullName', 'workEmail'}
return needed.issubset(set(data))
# Creates a Person instance base on the input data mapping
def Person(data):
for cls in Mapper.__subclasses__():
if cls.is_mapper_for(data):
return cls(data)
raise NotImplementedError
if __name__ == '__main__':
from database.connection import make_session
from database.models import PersonModel
# Sample data for example
dict_a = {
'name': {
'first_name': 'John',
'last_name': 'Doe'
},
'workEmail': '[email protected]',
'age': 50,
'connected': False
}
dict_b = {
'fullName': 'John Doe',
'workEmail': '[email protected]',
'age': 50,
'connected': False
}
# Instantiate Person from data
persons = [PersonModel(**Person(data).__dict__ for data in (dict_a, dict_b)]
with make_session() as session:
session.add_all(persons)
session.commit()
Question
I have limited experience in Python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing hundreds of thousands of Persons from tens of different structures. I was wondering if:
- This is a good solution? What could be the drawbacks and problems down the line?
- Currently I've implemented different subclasses for the mapping. Is there a convention or industry standard for these types of situations?
Update
- For question 2, I found this question to be useful, but would still want to know if this approach in general is good.
- Added style improvement suggestions from @Reinderien
-
\$\begingroup\$ I'm happy that you're taking my feedback into account, but it's against site policy for you to edit your question's code as suggestions come in. A new question should be issued with the revised code. Before that, I'm going to edit my answer with more info. \$\endgroup\$Reinderien– Reinderien2019年01月08日 05:46:42 +00:00Commented Jan 8, 2019 at 5:46
1 Answer 1
Don't abuse inner lists
This:
self.name = ' '.join([data.get('name').get(key) for key in ['first_name', 'last_name']])
should be
self.name = ' '.join(data.get('name', {}}.get(key) for key in ('first_name', 'last_name'))
Note the following:
- Generators don't need to go in a list if they're just being passed to a function (
join
) that needs an iterable - Give an empty dictionary as the default for the first get so that the second get doesn't explode
- Use a tuple instead of the last list because the data are immutable
Use set logic
This:
return all(key in data for key in ('name', 'workEmail'))
is effectively asking "are both 'name' and 'workEmail' in data
?" There's a better way to ask this - with a set
.
needed = {'name', 'workEmail'}
return needed.issubset(set(data))
If data
can be stored as a set
once outside of this function, it will increase efficiency.
Read more here: https://docs.python.org/3/tutorial/datastructures.html#sets
Don't needlessly materialize generators
This:
# Instantiate Person from data
persons = [Person(data) for data in [dict_a, dict_b]]
# Store persons that fit the database model
persons = [PersonModel(**person.__dict__) for person in persons]
makes a generator, saves it to a list in memory, consumes that list, makes a second generator, and stores that generator in a second list in memory. Instead:
persons = [PersonModel(**Person(data).__dict__)
for data in (dict_a, dict_b)]
Again, the last inner list should be a tuple.
Parsing heuristics
It's not useful to write separate parsing classes for formats A and B in this case, because they aren't declared by the API so have no meaning. Write a translation routine for every member you extract from the JSON. Do a series of attempts against known paths in the data to get the members out.
-
\$\begingroup\$ Thanks for the suggestions! Makes sense as well. However, my core question is more on a higher level: is this kind of approach optimal for this challenge or perhaps there is a better or easier way to solve this kind of data mapping from sources. \$\endgroup\$maivel– maivel2019年01月08日 02:09:52 +00:00Commented Jan 8, 2019 at 2:09
-
\$\begingroup\$ @maivel How do you detect which format the data will follow? Do you know by filename, or does the user specify, or do you need to autodetect it? \$\endgroup\$Reinderien– Reinderien2019年01月08日 03:15:54 +00:00Commented Jan 8, 2019 at 3:15
-
1\$\begingroup\$ I would need to autodetect it. The API that provides me the
.json
which I will convert todict
is prone to changing their response layouts and naming conventions. \$\endgroup\$maivel– maivel2019年01月08日 05:40:16 +00:00Commented Jan 8, 2019 at 5:40 -
1\$\begingroup\$ @maivel refer to edited answer. \$\endgroup\$Reinderien– Reinderien2019年01月08日 05:52:12 +00:00Commented Jan 8, 2019 at 5:52