Complex query with multiple normalized fields

Question 1

We have a fairly simple table structure, but with a LOT of fields per table (talking 40+). This data is initially produced in plain-text, user-readable tables, but then it is translated into higher-performance, easier to query tables before being installed for use in production.

What we do is, wherever possible and reasonable we translate certain fields into enumerated values, and keep track of the enumerations in a MasterEnum table. There are usually 20-25 enumerated fields out of 40 or so.

Sample table structure:

Plain text version:

 | PartNumber | Manufacturer | SomeData | SomeMoreData | SomeTextData ...
 ----------------------------------------------------------------------------------
 | 1x9kdah | GizmoCorp | ThisIsData | OtherData | ThisStaysText ...
 | 8xcjkzh | GadgetInc | MoreData | OtherData2 | ThisTooStaysText ...

Target table sample structure:

 | PartNumber | Manufacturer | SomeData | SomeMoreData | SomeTextData ...
 -------------------------------------------------------------------------------------
 | 1x9kdah | 1 | 1 | 1 | ThisStaysText ...
 | 8xcjkzh | 2 | 2 | 2 | ThisTooStaysText ...

Master Enumeration Table Structure

 | FieldName | InputText | ValueCode |
 ---------------------------------------------
 | Manufacturer | GizmoCorp | 1 |
 | Manufacturer | GadgetInc | 2 |
 | SomeData | ThisIsData | 1 |
 | SomeData | MoreData | 2 |
 | SomeMoreData | OtherData | 1 |
 | SomeMoreData | OtherData2 | 2 |

We have a means of doing this translation that works and works well; however it's a little on the slow side since all the processing is done in Java via Spring/Hibernate. My question is:

Is there a way to write a single query that would accomplish all the above translations? (Note that we have an excellent way of keeping track of our field definitions programmaticly, so generating complex SQL queries on the fly is not an issue). If it is not possible to do it in a single query, how would I structure queries to iterate over the individual fields and make sure that as the translations happen the data is inserted into the new table remains associated with the correct rows?

Note that it is safe to assume the target table is always empty at the beginning of the process.

Question 2

I would suggest that before you do any datbase designing work (and you need to - this design is atrocious) that you read amazon.com/SQL-Antipatterns-Programming-Programmers-ebook/dp/…

Question 3

@HLGEM This is an in-place design that I need to make more efficient as is. Thank you for the recommendation however

Question 4

As others have pointed out, this is a Really Bad Idea. Still, if you insist, the SQL is not hugely complicated:

CREATE TABLE RawData
(
PartNumber VARCHAR(30) NOT NULL PRIMARY KEY,
Manufacturer VARCHAR(30) NOT NULL,
Data1 VARCHAR(30),
Data2 VARCHAR(30),
Data3 VARCHAR(30)
)
CREATE TABLE Translations
(
FieldName VARCHAR(30) NOT NULL,
Value VARCHAR(30) NOT NULL,
PRIMARY KEY (FieldName, Value),
ID INT NOT NULL
UNIQUE (FieldName, ID)
)
CREATE TABLE CleanData
(
PartNumber VARCHAR(30) NOT NULL PRIMARY KEY,
Manufacturer VARCHAR(30) NOT NULL,
Data1 VARCHAR(30),
Data2 VARCHAR(30),
Data3 VARCHAR(30)
)
INSERT INTO CleanData (PartNumber, Manufacturer, Data1, Data2, Data3)
 SELECT
 RD.PartNumber,
 TMfr.ID AS Manufacturer,
 TDt1.ID AS Data1,
 TDt2.ID AS Data2,
 TDt3.ID AS Data3
 FROM
 RawData AS RD
 LEFT JOIN Translations AS TMfr ON RD.Manufacturer = TMfr.Value AND TMfr.FieldName = 'Manufacturer'
 LEFT JOIN Translations AS TDt1 ON RD.Data1 = TDt1.Value AND TDt1.FieldName = 'Data1'
 LEFT JOIN Translations AS TDt2 ON RD.Data2 = TDt2.Value AND TDt2.FieldName = 'Data2'
 LEFT JOIN Translations AS TDt3 ON RD.Data3 = TDt3.Value AND TDt3.FieldName = 'Data3'

Extend to the complete set of fields. May Codd have mercy on your soul.

Question 5

For clarification, why would I use LEFT JOIN as opposed to INNER JOIN? Also, only PN and MFG are non-null in the data, does that affect your answer?

Question 6

By LEFT JOINing, you get the ID of the substitute value, if one exists. I'm assuming that not every value has a substitute, but if that is the case, then you can use INNER JOIN and replace COALESCE(X, Y) with just X.

Question 7

Regarding NULLs, good point, I just assumed the fields were NOT NULL. Edited.

Question 8

By the time this data gets to the point of running this query, there will be a substitute value. We automatically generate those as needed as well, and that step happens first

Question 9

If, say, Data2 is null for a particular row and I'm selecting X as you mentioned, is the query going to be smart enough to insert null or do I need to add another condition?

Question 10

Not many people like them, but I would suggest you use a cursor in a stored procedure to loop though the fields and normalise them into the target table (silly ETL process).

You could probably do better having had separate lookup tables for each type of field, but it depends on you scenario.

Question 11

Creating separate lookup tables for each type of field may be possible, though I've already got something like 116 tables for this application, so adding a ton more might be a little painful. I'm not familiar with StoredProcedures - how would that work? And how much maintenance would be involved when you factor in about 40 different tables along the lines of what I described in my original question?

Question 12

Sorry for the late reply. If you are concerned with the amount of tables, then you are probably better off just keeping the one you have instead of splitting it. If you write your stored procedures properly, your maintenance would be minimal (unless new types / methods / fields are added). Here is the basic syntax of a cursor within a stored procedure [link] (dev.mysql.com/doc/refman/5.0/en/cursors.html)

Question 13

Could you create a view (or multiple views) that's based on the query that produces the results you want? If the main bottleneck is speed, having spring/hibernate go against a simple-looking view might be better. This might not want to work if you want to write to the "translated" versions of the tables, but I think using views could at least help for situations where you just want to display the data.

Question 14

No, unfortunately a view won't help here for the sole reason that it is indeed that we're writing the translated table. The data from the translated field is never simply displayed, but queried against, and the enumerations are intended to speed up said queries on some really monstrous tables. The values are then reverse-translated on the fly for user display

Question 15

@StormeHawke: It seems that MySQL has the ability to insert/update to a view: dev.mysql.com/doc/refman/5.0/en/view-updatability.html (I have never tried this but maybe it will help you?)

Question 16

Since views are nothing more than stored select statements, I don't really see how this could help since my fundamental problem is writing a select query that accomplishes my goals

Question 17

@StormeHawke: Sorry, I thought the problem was using Spring and Hibernate to produce the list of translated enum values, so I suggested the view to make it easier for the ORM to handle (since it queries against a single view instead of figuring out all of the joins on its own). But... it looks like I misunderstood what you were asking.

score 2 · Accepted Answer · 2013-03-08 19:57:29Z

As others have pointed out, this is a Really Bad Idea. Still, if you insist, the SQL is not hugely complicated:

CREATE TABLE RawData
(
PartNumber VARCHAR(30) NOT NULL PRIMARY KEY,
Manufacturer VARCHAR(30) NOT NULL,
Data1 VARCHAR(30),
Data2 VARCHAR(30),
Data3 VARCHAR(30)
)
CREATE TABLE Translations
(
FieldName VARCHAR(30) NOT NULL,
Value VARCHAR(30) NOT NULL,
PRIMARY KEY (FieldName, Value),
ID INT NOT NULL
UNIQUE (FieldName, ID)
)
CREATE TABLE CleanData
(
PartNumber VARCHAR(30) NOT NULL PRIMARY KEY,
Manufacturer VARCHAR(30) NOT NULL,
Data1 VARCHAR(30),
Data2 VARCHAR(30),
Data3 VARCHAR(30)
)
INSERT INTO CleanData (PartNumber, Manufacturer, Data1, Data2, Data3)
 SELECT
 RD.PartNumber,
 TMfr.ID AS Manufacturer,
 TDt1.ID AS Data1,
 TDt2.ID AS Data2,
 TDt3.ID AS Data3
 FROM
 RawData AS RD
 LEFT JOIN Translations AS TMfr ON RD.Manufacturer = TMfr.Value AND TMfr.FieldName = 'Manufacturer'
 LEFT JOIN Translations AS TDt1 ON RD.Data1 = TDt1.Value AND TDt1.FieldName = 'Data1'
 LEFT JOIN Translations AS TDt2 ON RD.Data2 = TDt2.Value AND TDt2.FieldName = 'Data2'
 LEFT JOIN Translations AS TDt3 ON RD.Data3 = TDt3.Value AND TDt3.FieldName = 'Data3'

Extend to the complete set of fields. May Codd have mercy on your soul.

For clarification, why would I use LEFT JOIN as opposed to INNER JOIN? Also, only PN and MFG are non-null in the data, does that affect your answer?
By LEFT JOINing, you get the ID of the substitute value, if one exists. I'm assuming that not every value has a substitute, but if that is the case, then you can use INNER JOIN and replace COALESCE(X, Y) with just X.
Regarding NULLs, good point, I just assumed the fields were NOT NULL. Edited.
By the time this data gets to the point of running this query, there will be a substitute value. We automatically generate those as needed as well, and that step happens first
If, say, Data2 is null for a particular row and I'm selecting X as you mentioned, is the query going to be smart enough to insert null or do I need to add another condition?

Stack Exchange Network

Complex query with multiple normalized fields

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Complex query with multiple normalized fields

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions