We have a fairly simple table structure, but with a LOT of fields per table (talking 40+). This data is initially produced in plain-text, user-readable tables, but then it is translated into higher-performance, easier to query tables before being installed for use in production.
What we do is, wherever possible and reasonable we translate certain fields into enumerated values, and keep track of the enumerations in a MasterEnum table. There are usually 20-25 enumerated fields out of 40 or so.
Sample table structure:
Plain text version:
| PartNumber | Manufacturer | SomeData | SomeMoreData | SomeTextData ... ---------------------------------------------------------------------------------- | 1x9kdah | GizmoCorp | ThisIsData | OtherData | ThisStaysText ... | 8xcjkzh | GadgetInc | MoreData | OtherData2 | ThisTooStaysText ...
Target table sample structure:
| PartNumber | Manufacturer | SomeData | SomeMoreData | SomeTextData ... ------------------------------------------------------------------------------------- | 1x9kdah | 1 | 1 | 1 | ThisStaysText ... | 8xcjkzh | 2 | 2 | 2 | ThisTooStaysText ...
Master Enumeration Table Structure
| FieldName | InputText | ValueCode | --------------------------------------------- | Manufacturer | GizmoCorp | 1 | | Manufacturer | GadgetInc | 2 | | SomeData | ThisIsData | 1 | | SomeData | MoreData | 2 | | SomeMoreData | OtherData | 1 | | SomeMoreData | OtherData2 | 2 |
We have a means of doing this translation that works and works well; however it's a little on the slow side since all the processing is done in Java via Spring/Hibernate. My question is:
Is there a way to write a single query that would accomplish all the above translations? (Note that we have an excellent way of keeping track of our field definitions programmaticly, so generating complex SQL queries on the fly is not an issue). If it is not possible to do it in a single query, how would I structure queries to iterate over the individual fields and make sure that as the translations happen the data is inserted into the new table remains associated with the correct rows?
Note that it is safe to assume the target table is always empty at the beginning of the process.
-
I would suggest that before you do any datbase designing work (and you need to - this design is atrocious) that you read amazon.com/SQL-Antipatterns-Programming-Programmers-ebook/dp/…HLGEM– HLGEM2013年03月08日 18:55:46 +00:00Commented Mar 8, 2013 at 18:55
-
@HLGEM This is an in-place design that I need to make more efficient as is. Thank you for the recommendation howeverStormeHawke– StormeHawke2013年03月08日 21:11:12 +00:00Commented Mar 8, 2013 at 21:11
3 Answers 3
As others have pointed out, this is a Really Bad Idea. Still, if you insist, the SQL is not hugely complicated:
CREATE TABLE RawData
(
PartNumber VARCHAR(30) NOT NULL PRIMARY KEY,
Manufacturer VARCHAR(30) NOT NULL,
Data1 VARCHAR(30),
Data2 VARCHAR(30),
Data3 VARCHAR(30)
)
CREATE TABLE Translations
(
FieldName VARCHAR(30) NOT NULL,
Value VARCHAR(30) NOT NULL,
PRIMARY KEY (FieldName, Value),
ID INT NOT NULL
UNIQUE (FieldName, ID)
)
CREATE TABLE CleanData
(
PartNumber VARCHAR(30) NOT NULL PRIMARY KEY,
Manufacturer VARCHAR(30) NOT NULL,
Data1 VARCHAR(30),
Data2 VARCHAR(30),
Data3 VARCHAR(30)
)
INSERT INTO CleanData (PartNumber, Manufacturer, Data1, Data2, Data3)
SELECT
RD.PartNumber,
TMfr.ID AS Manufacturer,
TDt1.ID AS Data1,
TDt2.ID AS Data2,
TDt3.ID AS Data3
FROM
RawData AS RD
LEFT JOIN Translations AS TMfr ON RD.Manufacturer = TMfr.Value AND TMfr.FieldName = 'Manufacturer'
LEFT JOIN Translations AS TDt1 ON RD.Data1 = TDt1.Value AND TDt1.FieldName = 'Data1'
LEFT JOIN Translations AS TDt2 ON RD.Data2 = TDt2.Value AND TDt2.FieldName = 'Data2'
LEFT JOIN Translations AS TDt3 ON RD.Data3 = TDt3.Value AND TDt3.FieldName = 'Data3'
Extend to the complete set of fields. May Codd have mercy on your soul.
-
For clarification, why would I use
LEFT JOIN
as opposed toINNER JOIN
? Also, only PN and MFG are non-null in the data, does that affect your answer?StormeHawke– StormeHawke2013年03月08日 20:25:54 +00:00Commented Mar 8, 2013 at 20:25 -
By
LEFT JOIN
ing, you get the ID of the substitute value, if one exists. I'm assuming that not every value has a substitute, but if that is the case, then you can useINNER JOIN
and replaceCOALESCE(X, Y)
with justX
.Jon of All Trades– Jon of All Trades2013年03月08日 20:28:13 +00:00Commented Mar 8, 2013 at 20:28 -
Regarding NULLs, good point, I just assumed the fields were
NOT NULL
. Edited.Jon of All Trades– Jon of All Trades2013年03月08日 20:29:40 +00:00Commented Mar 8, 2013 at 20:29 -
By the time this data gets to the point of running this query, there will be a substitute value. We automatically generate those as needed as well, and that step happens firstStormeHawke– StormeHawke2013年03月08日 20:37:58 +00:00Commented Mar 8, 2013 at 20:37
-
If, say, Data2 is null for a particular row and I'm selecting X as you mentioned, is the query going to be smart enough to insert null or do I need to add another condition?StormeHawke– StormeHawke2013年03月08日 20:40:12 +00:00Commented Mar 8, 2013 at 20:40
Not many people like them, but I would suggest you use a cursor in a stored procedure to loop though the fields and normalise them into the target table (silly ETL process).
You could probably do better having had separate lookup tables for each type of field, but it depends on you scenario.
-
Creating separate lookup tables for each type of field may be possible, though I've already got something like 116 tables for this application, so adding a ton more might be a little painful. I'm not familiar with StoredProcedures - how would that work? And how much maintenance would be involved when you factor in about 40 different tables along the lines of what I described in my original question?StormeHawke– StormeHawke2013年03月08日 17:24:20 +00:00Commented Mar 8, 2013 at 17:24
-
Sorry for the late reply. If you are concerned with the amount of tables, then you are probably better off just keeping the one you have instead of splitting it. If you write your stored procedures properly, your maintenance would be minimal (unless new types / methods / fields are added). Here is the basic syntax of a cursor within a stored procedure [link] (dev.mysql.com/doc/refman/5.0/en/cursors.html)RoKa– RoKa2013年03月13日 09:16:17 +00:00Commented Mar 13, 2013 at 9:16
Could you create a view (or multiple views) that's based on the query that produces the results you want? If the main bottleneck is speed, having spring/hibernate go against a simple-looking view might be better. This might not want to work if you want to write to the "translated" versions of the tables, but I think using views could at least help for situations where you just want to display the data.
-
No, unfortunately a view won't help here for the sole reason that it is indeed that we're writing the translated table. The data from the translated field is never simply displayed, but queried against, and the enumerations are intended to speed up said queries on some really monstrous tables. The values are then reverse-translated on the fly for user displayStormeHawke– StormeHawke2013年03月08日 17:22:49 +00:00Commented Mar 8, 2013 at 17:22
-
@StormeHawke: It seems that MySQL has the ability to insert/update to a view: dev.mysql.com/doc/refman/5.0/en/view-updatability.html (I have never tried this but maybe it will help you?)FrustratedWithFormsDesigner– FrustratedWithFormsDesigner2013年03月08日 17:25:23 +00:00Commented Mar 8, 2013 at 17:25
-
Since views are nothing more than stored select statements, I don't really see how this could help since my fundamental problem is writing a select query that accomplishes my goalsStormeHawke– StormeHawke2013年03月08日 17:42:24 +00:00Commented Mar 8, 2013 at 17:42
-
@StormeHawke: Sorry, I thought the problem was using Spring and Hibernate to produce the list of translated enum values, so I suggested the view to make it easier for the ORM to handle (since it queries against a single view instead of figuring out all of the joins on its own). But... it looks like I misunderstood what you were asking.FrustratedWithFormsDesigner– FrustratedWithFormsDesigner2013年03月08日 22:04:28 +00:00Commented Mar 8, 2013 at 22:04