1

I'm part of a team asked to perform some predictive analysis with a huge relational database. Data is a mess. Documentation ranges from mediocre to incorrect to absent. Information is scattered all over the tables.

For example, if I want to match addresses with telephone numbers, I can query three or four different tables, each one containing information unknown to the others, and maybe there is some information I shouldn't use.

To get data, people I'm working with heavily rely on folklore: they know that in order to obtain phone numbers from addresses, you have to query this and this that way because John told them so a few years ago. And John knew it because Sam told him. And so on. Folklore is essentially not challenged and is often not so right.

Retrieving information is a pain and we spend most of our time just extracting it from the database, without even trying to do something clever with it.

I'd like to establish some standard which we can use in all our projects. Moreover, I'd like it to improve as we gather the folklore. I don't want to create a "How to do it" super document which will probably spawn one million local variants. So basically, I think I want to encapsulate domain knowledge in "something."

I thought that we could create tables that aggregate scattered information in one place, document and query those new tables from now on instead of relying on folklore. So no more three locations for telephone numbers and addresses, one TelephoneToAddress table.

Does it make any sense? In the context of data exploitation, is it even a good idea?

Jonathan Eunice
9,8301 gold badge34 silver badges42 bronze badges
asked Dec 6, 2016 at 5:52

4 Answers 4

1

On practical approach is to encapsulate what you learn about the data in database views, which provide a consistent and queryable interface to the underlying data.

It puts the logic into the database, where it can be used, and expresses it in terms that the database experts will be familiar with (i.e. SQL).

answered Dec 6, 2016 at 9:26
0

Considering that you do not have much idea about the organisation of the data. If I were you I would consider collecting different folklores of access to required data and ask them to model it around a graph with nodes being the tables and fields as edges.

Once you prepare such sets of graphs you can sort of eliminate the redundant ones(like if you have three different ways to find phone numbers but you want only one out of them,then you can use the model that seems to perform better(or any other constraint you have) and set it as standard and deprecate the other graphs).

Once you've got these graphs use them as a model to create your newer tables.

AND/OR Considering you seem to have to do some predictive analysis(which will involve querying data in a multitude of ways), a graph database seems to be a suitable approach for the aggregated standard huge database. This will help you guys with benefits like expressive queries and easy data relationship management which is what the problem seems to have stemmed from.

answered Dec 6, 2016 at 7:20
0

It does not make sense to put a dead person in nice clothes. It won't be able to dance. If your data source is rotten do not spend a penny in getting clean data out of it. Rather consolidate data where they come from and make them be a single source. If you are forced to dance with the dead, better look for another job.

answered Dec 6, 2016 at 8:45
2
  • Colorful but questionable. If you're only going to dance with perfectly clean data, you'll never get a date to the ETL party. All data is dirty. Every dataset of interesting scale, duration, complexity, or value is prone to inconsistencies, errors, duplications, omissions, and elements of questionable provenance. Commented Dec 6, 2016 at 14:11
  • @JonathanEunice I know that all data is dirty. But there are more than 50 shades of gray here. And starting from a certain shade on they should not only be called dirty but dump. And that's what I read from the OP's question. Commented Dec 6, 2016 at 15:23
-1

I would go with removing redundant variants first & putting it in some table as you thought of rather referring redundant data source. Write a package which will run on interval & do the cleaning.

answered Dec 6, 2016 at 7:47

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.