Sourcing data format from multiple different structures

Question 1

Problem

I want to read in the data to dictionary

person = {
 'name': 'John Doe',
 'email': '[email protected]',
 'age': 50,
 'connected': False
}

The data comes from different formats:

Format A.

dict_a = {
 'name': {
 'first_name': 'John',
 'last_name': 'Doe'
 },
 'workEmail': '[email protected]',
 'age': 50,
 'connected': False
}

Format B.

dict_b = {
 'fullName': 'John Doe',
 'workEmail': '[email protected]',
 'age': 50,
 'connected': False
}

There will be additional sources added in the future with additional structures.

Background

For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item or ItemLoader, but it's ruled out in my case.

There could be potentially 5-10 different structures from which the data will be read from.

Implementation

`/database/models.py`

"""
Database mapping declarations for SQLAlchemy
"""
from sqlalchemy import Column, Integer, String, Boolean
from database.connection import Base
class PersonModel(Base):
 __tablename__ = 'Person'
 id = Column(Integer, primary_key=True)
 name = Column(String)
 email = Column(String)
 age = Column(Integer)
 connected = Column(Boolean)

`/mappers/person.py`

"""
Data mappers for Person
"""
# Abstract class for mapper
class Mapper(object):
 def __init__(self, data):
 self.data = data
# Data mapper for format A, maps the fields from dict_a to Person
class MapperA(Mapper):
 def __init__(self, data):
 self.name = ' '.join(data.get('name', {}).get(key) for key in ('first_name', 'last_name'))
 self.email = data.get('workEmail')
 self.age = data.get('age')
 self.connected = data.get('connected')
 @classmethod
 def is_mapper_for(cls, data):
 needed = {'name', 'workEmail'}
 return needed.issubset(set(data))
# Data mapper for format B, maps the fields from dict_b to Person
class MapperB(Mapper):
 def __init__(self, data):
 self.name = data.get('fullName')
 self.email = data.get('workEmail')
 self.age = data.get('age')
 self.connected = data.get('connected')
 @classmethod
 def is_mapper_for(cls, data):
 needed = {'fullName', 'workEmail'}
 return needed.issubset(set(data))
# Creates a Person instance base on the input data mapping
def Person(data):
 for cls in Mapper.__subclasses__():
 if cls.is_mapper_for(data):
 return cls(data)
 raise NotImplementedError
if __name__ == '__main__':
 from database.connection import make_session
 from database.models import PersonModel
 # Sample data for example
 dict_a = {
 'name': {
 'first_name': 'John',
 'last_name': 'Doe'
 },
 'workEmail': '[email protected]',
 'age': 50,
 'connected': False
 }
 dict_b = {
 'fullName': 'John Doe',
 'workEmail': '[email protected]',
 'age': 50,
 'connected': False
 }
 # Instantiate Person from data
 persons = [PersonModel(**Person(data).__dict__ for data in (dict_a, dict_b)]
 with make_session() as session:
 session.add_all(persons)
 session.commit()

Question

I have limited experience in Python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing hundreds of thousands of Persons from tens of different structures. I was wondering if:

This is a good solution? What could be the drawbacks and problems down the line?
Currently I've implemented different subclasses for the mapping. Is there a convention or industry standard for these types of situations?

Update

For question 2, I found this question to be useful, but would still want to know if this approach in general is good.
Added style improvement suggestions from @Reinderien

Question 2

I'm happy that you're taking my feedback into account, but it's against site policy for you to edit your question's code as suggestions come in. A new question should be issued with the revised code. Before that, I'm going to edit my answer with more info.

Question 3

Don't abuse inner lists

This:

self.name = ' '.join([data.get('name').get(key) for key in ['first_name', 'last_name']])

should be

self.name = ' '.join(data.get('name', {}}.get(key) for key in ('first_name', 'last_name'))

Note the following:

Generators don't need to go in a list if they're just being passed to a function (join) that needs an iterable
Give an empty dictionary as the default for the first get so that the second get doesn't explode
Use a tuple instead of the last list because the data are immutable

Use set logic

This:

return all(key in data for key in ('name', 'workEmail'))

is effectively asking "are both 'name' and 'workEmail' in data?" There's a better way to ask this - with a set.

needed = {'name', 'workEmail'}
return needed.issubset(set(data))

If data can be stored as a set once outside of this function, it will increase efficiency.

Read more here: https://docs.python.org/3/tutorial/datastructures.html#sets

Don't needlessly materialize generators

This:

# Instantiate Person from data
persons = [Person(data) for data in [dict_a, dict_b]]
# Store persons that fit the database model
persons = [PersonModel(**person.__dict__) for person in persons]

makes a generator, saves it to a list in memory, consumes that list, makes a second generator, and stores that generator in a second list in memory. Instead:

persons = [PersonModel(**Person(data).__dict__)
 for data in (dict_a, dict_b)]

Again, the last inner list should be a tuple.

Parsing heuristics

It's not useful to write separate parsing classes for formats A and B in this case, because they aren't declared by the API so have no meaning. Write a translation routine for every member you extract from the JSON. Do a series of attempts against known paths in the data to get the members out.

Question 4

Thanks for the suggestions! Makes sense as well. However, my core question is more on a higher level: is this kind of approach optimal for this challenge or perhaps there is a better or easier way to solve this kind of data mapping from sources.

Question 5

@maivel How do you detect which format the data will follow? Do you know by filename, or does the user specify, or do you need to autodetect it?

Question 6

I would need to autodetect it. The API that provides me the .json which I will convert to dict is prone to changing their response layouts and naming conventions.

Question 7

@maivel refer to edited answer.

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Accepted Answer · 2019-01-07 16:08:09Z

Don't abuse inner lists

This:

self.name = ' '.join([data.get('name').get(key) for key in ['first_name', 'last_name']])

should be

self.name = ' '.join(data.get('name', {}}.get(key) for key in ('first_name', 'last_name'))

Note the following:

Generators don't need to go in a list if they're just being passed to a function (join) that needs an iterable
Give an empty dictionary as the default for the first get so that the second get doesn't explode
Use a tuple instead of the last list because the data are immutable

Use set logic

This:

return all(key in data for key in ('name', 'workEmail'))

is effectively asking "are both 'name' and 'workEmail' in data?" There's a better way to ask this - with a set.

needed = {'name', 'workEmail'}
return needed.issubset(set(data))

If data can be stored as a set once outside of this function, it will increase efficiency.

Read more here: https://docs.python.org/3/tutorial/datastructures.html#sets

Don't needlessly materialize generators

This:

# Instantiate Person from data
persons = [Person(data) for data in [dict_a, dict_b]]
# Store persons that fit the database model
persons = [PersonModel(**person.__dict__) for person in persons]

makes a generator, saves it to a list in memory, consumes that list, makes a second generator, and stores that generator in a second list in memory. Instead:

persons = [PersonModel(**Person(data).__dict__)
 for data in (dict_a, dict_b)]

Again, the last inner list should be a tuple.

Parsing heuristics

It's not useful to write separate parsing classes for formats A and B in this case, because they aren't declared by the API so have no meaning. Write a translation routine for every member you extract from the JSON. Do a series of attempts against known paths in the data to get the members out.

Thanks for the suggestions! Makes sense as well. However, my core question is more on a higher level: is this kind of approach optimal for this challenge or perhaps there is a better or easier way to solve this kind of data mapping from sources.
@maivel How do you detect which format the data will follow? Do you know by filename, or does the user specify, or do you need to autodetect it?
I would need to autodetect it. The API that provides me the .json which I will convert to dict is prone to changing their response layouts and naming conventions.

Stack Exchange Network

Sourcing data format from multiple different structures

Problem

Background

Implementation

`/database/models.py`

`/mappers/person.py`

Question

1 Answer 1

Don't abuse inner lists

Use set logic

Don't needlessly materialize generators

Parsing heuristics

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Sourcing data format from multiple different structures

Problem

Background

Implementation

/database/models.py

/mappers/person.py

Question

1 Answer 1

Don't abuse inner lists

Use set logic

Don't needlessly materialize generators

Parsing heuristics

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`/database/models.py`

`/mappers/person.py`