Standardizing data extraction

Question 1

I'm part of a team asked to perform some predictive analysis with a huge relational database. Data is a mess. Documentation ranges from mediocre to incorrect to absent. Information is scattered all over the tables.

For example, if I want to match addresses with telephone numbers, I can query three or four different tables, each one containing information unknown to the others, and maybe there is some information I shouldn't use.

To get data, people I'm working with heavily rely on folklore: they know that in order to obtain phone numbers from addresses, you have to query this and this that way because John told them so a few years ago. And John knew it because Sam told him. And so on. Folklore is essentially not challenged and is often not so right.

Retrieving information is a pain and we spend most of our time just extracting it from the database, without even trying to do something clever with it.

I'd like to establish some standard which we can use in all our projects. Moreover, I'd like it to improve as we gather the folklore. I don't want to create a "How to do it" super document which will probably spawn one million local variants. So basically, I think I want to encapsulate domain knowledge in "something."

I thought that we could create tables that aggregate scattered information in one place, document and query those new tables from now on instead of relying on folklore. So no more three locations for telephone numbers and addresses, one TelephoneToAddress table.

Does it make any sense? In the context of data exploitation, is it even a good idea?

Question 2

On practical approach is to encapsulate what you learn about the data in database views, which provide a consistent and queryable interface to the underlying data.

It puts the logic into the database, where it can be used, and expresses it in terms that the database experts will be familiar with (i.e. SQL).

Question 3

Considering that you do not have much idea about the organisation of the data. If I were you I would consider collecting different folklores of access to required data and ask them to model it around a graph with nodes being the tables and fields as edges.

Once you prepare such sets of graphs you can sort of eliminate the redundant ones(like if you have three different ways to find phone numbers but you want only one out of them,then you can use the model that seems to perform better(or any other constraint you have) and set it as standard and deprecate the other graphs).

Once you've got these graphs use them as a model to create your newer tables.

AND/OR Considering you seem to have to do some predictive analysis(which will involve querying data in a multitude of ways), a graph database seems to be a suitable approach for the aggregated standard huge database. This will help you guys with benefits like expressive queries and easy data relationship management which is what the problem seems to have stemmed from.

Question 4

It does not make sense to put a dead person in nice clothes. It won't be able to dance. If your data source is rotten do not spend a penny in getting clean data out of it. Rather consolidate data where they come from and make them be a single source. If you are forced to dance with the dead, better look for another job.

Question 5

Colorful but questionable. If you're only going to dance with perfectly clean data, you'll never get a date to the ETL party. All data is dirty. Every dataset of interesting scale, duration, complexity, or value is prone to inconsistencies, errors, duplications, omissions, and elements of questionable provenance.

Question 6

@JonathanEunice I know that all data is dirty. But there are more than 50 shades of gray here. And starting from a certain shade on they should not only be called dirty but dump. And that's what I read from the OP's question.

Question 7

I would go with removing redundant variants first & putting it in some table as you thought of rather referring redundant data source. Write a package which will run on interval & do the cleaning.

David Aldridge David Aldridge 1941 silver badge6 bronze badges · Answer 1 · 2016-12-06 09:26:48Z

On practical approach is to encapsulate what you learn about the data in database views, which provide a consistent and queryable interface to the underlying data.

It puts the logic into the database, where it can be used, and expresses it in terms that the database experts will be familiar with (i.e. SQL).

onkkno onkkno 2891 silver badge9 bronze badges · Answer 2 · 2016-12-06 07:20:30Z

Considering that you do not have much idea about the organisation of the data. If I were you I would consider collecting different folklores of access to required data and ask them to model it around a graph with nodes being the tables and fields as edges.

Once you prepare such sets of graphs you can sort of eliminate the redundant ones(like if you have three different ways to find phone numbers but you want only one out of them,then you can use the model that seems to perform better(or any other constraint you have) and set it as standard and deprecate the other graphs).

Once you've got these graphs use them as a model to create your newer tables.

AND/OR Considering you seem to have to do some predictive analysis(which will involve querying data in a multitude of ways), a graph database seems to be a suitable approach for the aggregated standard huge database. This will help you guys with benefits like expressive queries and easy data relationship management which is what the problem seems to have stemmed from.

user188153user188153 · Answer 3 · 2016-12-06 08:45:25Z

0

It does not make sense to put a dead person in nice clothes. It won't be able to dance. If your data source is rotten do not spend a penny in getting clean data out of it. Rather consolidate data where they come from and make them be a single source. If you are forced to dance with the dead, better look for another job.

Share

Improve this answer

answered Dec 6, 2016 at 8:45

user188153user188153

2

Colorful but questionable. If you're only going to dance with perfectly clean data, you'll never get a date to the ETL party. All data is dirty. Every dataset of interesting scale, duration, complexity, or value is prone to inconsistencies, errors, duplications, omissions, and elements of questionable provenance.

Jonathan Eunice
– Jonathan Eunice

2016年12月06日 14:11:57 +00:00
Commented Dec 6, 2016 at 14:11
@JonathanEunice I know that all data is dirty. But there are more than 50 shades of gray here. And starting from a certain shade on they should not only be called dirty but dump. And that's what I read from the OP's question.

user188153
– user188153

2016年12月06日 15:23:23 +00:00
Commented Dec 6, 2016 at 15:23

Add a comment |

Deep Deep 11 bronze badge · Answer 4 · 2016-12-06 07:47:34Z

I would go with removing redundant variants first & putting it in some table as you thought of rather referring redundant data source. Write a package which will run on interval & do the cleaning.

Stack Exchange Network

Standardizing data extraction

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Standardizing data extraction

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions