I use MariaDB to store data using EAV model. I need EAV, because each attribute of each entity can have its own validity (validFrom
, validTo
).
For example, I have entity person and I track changes made to their homeCity
and degree
attributes:
attributes
table:
attributeId name
------------------------
1 'homeCity'
2 'degree'
values
table:
valueId entityId attributeId value validFrom validTo
------------------------------------------------------------------
1 1 1 'London' 2022年01月01日 2022年12月31日
2 1 1 'Paris' 2023年01月01日 NULL
3 1 2 'BSc' 2022年01月01日 2022年06月30日
4 1 2 'MSc' 2022年07月01日 NULL
5 2 1 'Madrid' 2022年01月01日 NULL
The problem is that I want to query the data conveniently (using WHERE
clauses, JOIN
s with other tables etc.), which is possible only if it is a standard SQL table with rows and columns looking like this:
entityId homeCity degree
----------------------------
1 'Paris' 'Msc'
2 'Madrid' NULL
So I need to programmatically create/update such table from EAV model data after each EAV data update (or maybe create/update it only when user really requests the data).
Another problem is that such table would be valid only for certain date (typically for the current date). If I want to query data valid to a different date (e.g. 2022年01月01日
), I need to dynamically create another table:
entityId homeCity degree
----------------------------
1 'London' 'Bsc'
2 'Madrid' NULL
Is it a good approach to dynamically create a table if user asks to view data valid to a distinct date?
Another solution would be to store all validity data (i.e. all versions of given entity) in a single dynamically created table:
id entityId homeCity degree validFrom validTo
--------------------------------------------------------
1 1 'London' 'MSc' 2022年01月01日 2022年06月30日
2 1 'London' 'BSc' 2022年07月01日 2022年12月31日
3 1 'Paris' 'BSc' 2023年01月01日 NULL
4 2 'Madrid' NULL 2022年01月01日 NULL
In the query I just need to specify filter on validity: WHERE '2022-01-01' BETWEEN validFrom AND IFNULL(validTo, '9999-01-01')
. This solution is ideal for querying, but creating the table is slower, because it is necessary to compute all possible date versions of the entity.
Is this a good approach or is it better not to use EAV model at all?
Some details on my business domain: It is an app of school agenda (students, classes, etc). Each entity has about 50 attributes, almost every of them is potentially changeable (surname, e-mail, nationality...). Some of them only change rarely (home address), some change regularly (e.g. every year - school class which student attends). Attributes usually change individually, but it is not a rule - e.g. when home address changes, ale parts of it change at once (street, city, etc.). I don't expect many users editing the entity at the same time.
3 Answers 3
In a comment, you wrote
Business domain of the app is school agenda (students, classes, etc). Each entity has about 50 attributes, almost every of them is potentially changeable (surname, e-mail, nationality...). Some of them only change rarely (home address), some change regularly (e.g. every year - school class which student attends). Attributes usually change individually, but it is not a rule - e.g. when home address changes, ale parts of it change at once (street, city, etc.). I don't expect many users editing the entity at the same time
which tells me several things which are in no way obvious from your original question:
your entities are too large. I expect a relational model with 3NF in mind to have smaller entitities. Since you are considerung a redesign of your data structures, you can probably resolve each of your current entities into several parent-child tables .
once you have splitted your current entities into smaller ones, it will most probably be sufficient to model validity at the entity level, not at the attribute level.
For example, to keep track of adresses over time, you don't have to store the the information when the city, the street or the zip code changes individually. Same holds for "contact information" of a person (combination of email addresses and phone numbers).
Other kind of information is probably not required to be kept in a historic fashion, the parent entity of a person (with child entities adress and contact information) may contain a unique ID for this person which models its identity and must not be changed over time.
your data model does not have the requirement of incorporating different attributes for each entity record. The attributes are all the same for all records of one entity.
So in short: you do not need an EAV, you only need a uniform approach of modeling history in your relational model for certain entities. Looking through older Q&As on this site and Stackoverflow, I found these might be helpful as a good start:
Database design - maintaining versions of an object
https://stackoverflow.com/questions/3874199/how-to-store-historical-data
The result will be a model which allows you to query data exactly the way you asked for it.
Finally, since it seems you have an existing system, how do you get from here to there? One approach may be indeed to create a new "classic" relational model in a separate database (or DB schema) based on the ideas from above, and then fill the classic model from your current EAV-based system - once initially, and then dynamically by triggers which are fired whenever a change to the EAV records happen. From what you wrote in the question, I guess that is what you already had in mind.
At a later point in time, you may consider to throw your EAV over board and rewrite your application using the "classic" relational model exclusively. But beware, I don't know if that is economically feasible for you - the effort of such big rewrites are often underestimated.
-
1This is a very good point. I was considering to track history on main entity level (versions of person as whole) and on attribute level (versions of person's attributes). The best solution is to split entity into subentities and track history on subentities level. With attribute level solution, the problem is I have to join every attribute to the main table separately. With main entity level solution, when entity changes, whole row with all columns has to be duplicated, which leads to data redundancy. With subentities level solution, both of these problems are solved.user14967413– user149674132023年12月05日 12:05:01 +00:00Commented Dec 5, 2023 at 12:05
-
Fortunately, I currently don't use EAV model. I use relation model, just with limited history tracking possibilities. Thus splitting into subentities is much easier for me than introducing EAV model. Maybe just UI design of subentities history tracking is a challenge. I will consider the final solution and choose the accepted answer then. :-)user14967413– user149674132023年12月05日 12:15:34 +00:00Commented Dec 5, 2023 at 12:15
-
The issue here is that user must understand the concept of subentities. If he wants to enter e-mail valid in some period of time, he doesn't only care about historic changes of e-mail, but about historic changes of the whole contacts subentity. E.g. if he wants to enter e-mail valid from '2023年03月01日', he creates a new subentity instance valid from '2023年03月01日' and containing this e-mail. However there already is another subentity instance valid from '2023年04月01日' which represents changed telephone number. User must understand that he has to change the e-mail in both instances.user14967413– user149674132023年12月05日 12:32:35 +00:00Commented Dec 5, 2023 at 12:32
Is this a good approach or is it better not to use EAV model at all?
I think I'd be on strong ground in saying that most practitioners disfavour EAV, and find that its challenges outweigh its sole advantage which is the ability to add or remove attributes dynamically without changing the explicit schema of the database.
I worked with an EAV database several years ago, apparently written by a postgraduate student who the company had commissioned some years earlier. Not only was it rather more difficult than the norm to navigate and devise queries for, but it's performance was extremely poor and sluggish for workloads and data volumes that were quite ordinary.
And in practice, although the database could handle attributes being added dynamically, the screens and paginated reports could not adapt dynamically. They all still relied on a kind of latent database schema fixed and encoded in the EAV data, and whenever the available attributes changed these peripheral artefacts had to be considered and reworked just as much as if the explicit schema of the database had changed.
So the costs were very real and the advantages were a complete mirage.
My advice would be to model your attributes in standard ways. If you need a history per attribute, then use a table per attribute and join.
If you need to add or remove attributes, then accept the need to alter the explicit database definition by adding or removing the associated tables.
-
1I'd agree, EAV databases are usually a case of "if everything you have is a hammer, ...". But now we have a really rich ecosystem of mature NoSQL databases (including JSON support in nearly all RDBMS), so there tend to be better ways of dealing with more dynamic data. And in a lot of cases, there actually is a reasonable schema.amon– amon2023年12月02日 17:32:59 +00:00Commented Dec 2, 2023 at 17:32
-
5@amon, a JSON column is alright if you just need a dumping ground for residual attributes, but honestly, the need to define structure for data storage is one of the essential and irreducible problems in automatic data processing. Like I say in my answer, if you don't define that structure explicitly (and keep it maintained), you'll instead end up relying on a latent structure - which has all the same rigidities of the explicit structure, but none of the visibility. It just doesn't work like it's proponents seem to think.Steve– Steve2023年12月02日 19:56:36 +00:00Commented Dec 2, 2023 at 19:56
-
I agree I should use standard columns for most attributes and only certain attributes should track historic changes. However it is not convenient to use "table per attribute" solution with separate table for every history tracking attribute. Rather I would store all attributes values in the same
values
table. It is no problem to join them to the main table, as I showed in my answer. Of course, it is necessary to consider thatvalues
table can store different data types - there can be special column for each data type.user14967413– user149674132023年12月05日 12:38:22 +00:00Commented Dec 5, 2023 at 12:38 -
1@user14967413, as I say, the general experience of those who use databases commercially, is that the "convenience" of EAV is a mirage. Because EAV seems to make database schema design easier, and because that schema design tends to be amongst the first things that are designed in a database application, the severe pain that EAV causes always occurs further down the road in a different place than where its modest advantages appeared. If it is inconvenient to model your problem in standard ways, then that is either because it is a hard problem, or because there is something absurd about it.Steve– Steve2023年12月05日 13:15:48 +00:00Commented Dec 5, 2023 at 13:15
-
3@user14967413, if they aren't database programmers then they won't be competent in the essential skills of database programmers, such as storage schema design and maintenance. From my point of view, you've revealed that we are now indeed in the realm of the "absurd" I referred to earlier - you have grandiose ideas that EAV will mean you don't have to hire programmers anymore, but instead replace them with "trained" staff who prefer "no code", who will then maintain the latent schema encoded in the EAV attributes. Utter pie in the sky.Steve– Steve2023年12月05日 18:01:00 +00:00Commented Dec 5, 2023 at 18:01
I realized that EAV model can be transformed into a standard SQL table "on the fly" by creating derived table. In the derived table I use "brute force" join solution where every attribute from EAV is loaded into separate table column by separate LEFT JOIN
.
Here is an example using WITH
for derived table definition. I assume there is a static table personsMain
which defines a list of persons.
WITH
-- derived table definition
persons AS (
SELECT
p.id,
v1.value AS 'homeCity',
v2.value AS 'degree'
FROM personsMain p
LEFT JOIN `values` v1
ON v1.entityId = p.id
AND v1.attributeId = 1
AND CURDATE() BETWEEN v1.validFrom AND IFNULL(v1.validTo, '9999-01-01')
LEFT JOIN `values` v2
ON v2.entityId = p.id
AND v2.attributeId = 2
AND CURDATE() BETWEEN v2.validFrom AND IFNULL(v2.validTo, '9999-01-01')
)
-- any custom query using the derived table
SELECT *
FROM persons
GROUP BY homeCity
This solution is not ideal.
Pros
- no need to create a table from EAV data and maintain it - table gets created on the fly
- data loaded from EAV is always up to date
- it is possible to work with multiple such derived tables in the query (just define them in the
WITH
part)
Cons
- each query is very large and contains repetitive code
- this approach is slow when there is many attributes loaded from EAV - it is necessary to have majority of attributes in the main table (
personsMain
) and load from EAV only those attributes that really change often - this approach it is very slow when working with multiple derived tables and joining them (unless you restrict number of rows by filter condition)
-
1"each query is very large and contains repetitive code" - define a
VIEW
!Bergi– Bergi2023年12月05日 13:33:16 +00:00Commented Dec 5, 2023 at 13:33 -
@Bergi Good idea, however I need to query EAV data valid to an arbitrary date. I.e. I need to replace
CURDATE()
with somehing like2023年01月01日
.VIEW
unfortunately doesn't accept arguments, so I really need to specify the literal query.user14967413– user149674132023年12月05日 13:43:22 +00:00Commented Dec 5, 2023 at 13:43 -
1Hm, you could create a view representing all time ranges, with a
valid
column, that you'd then filter on in each query, however theJOIN
for that range intersection would be rather complicated. Or alternatively, you could create an SQLFUNCTION
that takes the date as an argument. In either case, I'm not certain how well the optimiser can inline them in your query, you'd need to try it out.Bergi– Bergi2023年12月05日 13:49:14 +00:00Commented Dec 5, 2023 at 13:49 -
@Bergi I was also considering using
FUNCTION
, however my RDBMS (MariaDB) doesn't support table-valued functions (functions returning a table).user14967413– user149674132023年12月05日 14:00:25 +00:00Commented Dec 5, 2023 at 14:00
valueId
?