74

We process a routine data feed from a client who just refactored their database from a form that seems familiar (one row per entity, one column per attribute) to one that seems unfamiliar to me (one row per entity per attribute):

Before: one column per attribute

ID Ht_cm wt_kg Age_yr ... 
1 190 82 43 ...
2 170 60 22 ...
3 205 90 51 ...

After: one column for all attributes

ID Metric Value
 1 Ht_cm 190
 1 Wt_kg 82
 1 Age_yr 43
 1 ...
 2 Ht_cm 170
 2 Wt_kg 60
 2 Age_yr 22
 2 ...
 3 Ht_cm 205
 3 Wt_kg 90
 3 Age_yr 51
 3 ...

Is there a name for this database structure? What are the relative advantages? The old way seems easier to place validity constraints on specific attributes (non-null, non-negative, etc.) and easier to calculate averages. But I can see how it might be easier to add new attributes without refactoring the database. Is this a standard/preferred way of structuring data?

Evan Carroll
65.7k50 gold badges259 silver badges510 bronze badges
asked Jul 12, 2012 at 2:41

5 Answers 5

101
+100

It's called Entity-Attribute-Value (also sometimes 'name-value pairs') and it's a classic case of "a round peg in a square hole" when people use the EAV pattern in a relational database.

Here's a list of why you shouldn't use EAV:

  • You can't use data types. It doesn't matter if the value is a date, a number or money (decimal). It's always going to be coerced to varchar. This can be anything from a minor performance problem to a massive gut-ache (ever had to chase down a one-cent variation in a monthly roll-up report?).
  • You can't (easily) enforce constraints. It requires a ridiculous amount of code to enforce "Everyone needs to have a height between 0 and 3 metres" or "Age must be not null and>= 0", as opposed to the 1-2 lines that each of those constraints would be in a properly-modelled system.
  • Related to above, you can't easily guarantee that you get the information you need for each client (age might be missing from one, then the next might be missing their height etc.). You can do it, but it's a hell of a lot more difficult than SELECT height, weight, age FROM Client where height is null or weight is null.
  • Related again, duplicate data is a lot harder to detect (what happens if they give you two ages for one client? De-EAVing the data, as below, will give you two rows of results if you have one attribute doubled. If one client has two separate entries for two attributes, you'll get four rows from the query below).
  • You can't even guarantee that the attribute names are consistent. "Age_yr" might become "AGE_IN_YEARS" or "age". (Admittedly this is less of a problem when you're receiving an extract versus when people are inserting data, but still.)
  • Any sort of nontrivial query is a complete disaster. To relationalise a three-attribute EAV system so you can query it in a rational fashion requires three joins of the EAV table.

Compare:

SELECT cID.ID AS [ID], cH.Value AS [Height], cW.Value AS [Weight], cA.Value AS [Age]
FROM (SELECT DISTINCT ID FROM Client) cID 
 LEFT OUTER JOIN 
 Client cW ON cID.ID = cW.ID AND cW.Metric = "Wt_kg" 
 LEFT OUTER JOIN 
 Client cH ON cID.ID = cH.ID AND cW.Metric = "Ht_cm" 
 LEFT OUTER JOIN 
 Client cA ON cID.ID = cA.ID AND cW.Metric = "Age_yr"

To:

SELECT c.ID, c.Ht_cm, c.Wt_kg, c.Age_yr
FROM Client c

Here's a (very short) list of when you should use EAV:

  • When there's absolutely no way around it and you have to support schema-less data in your database.
  • When you just need to store "stuff" and don't expect to have to need it in a more structured form. Beware, though, the monster called "changing requirements".

I know I just spent this entire post detailing why EAV is a terrible idea in most cases - but there are a few cases where it's needed/unavoidable. however, most of the time (including the example above), it's going to be far more hassle than it's worth. If you have a requirement for wide support of EAV-type data input, you should look at storing them in a key-value system, e.g. Hadoop/HBase, CouchDB, MongoDB, Cassandra, BerkeleyDB.

answered Jul 12, 2012 at 3:57
2
  • 4
    Agreed, but I would add that EAV is also a good approach to use when you are keeping a list of things which are semantically irrelevant to your system (not just schema-less). For example an online product catalog where the product features need to be stored and listed. You have a list of key/value pairs to regurgitate, but the system doesn't actually know or care about what those keys or values are about. In that situation, the perils of EAV are irrelevant. Commented Jul 12, 2012 at 12:03
  • 10
    @JoelBrown You don't care NOW, but if down the road a VP asks to know how many shirts in the catalog have both brown buttons and button down collars, it'll be a bitch of a query to write. EAV itself normally indicates a lack of planning or foresight. Commented Jul 12, 2012 at 12:36
19

Entity Attribute Value (EAV)

It is considered to be an anti-pattern by many, including me.

Here are your alternatives:

  1. use database table inheritance

  2. use XML data and SQLXML functions

  3. use a nosql database, like HBase

answered Jul 12, 2012 at 2:54
1
  • 3
    Definitely an anti-pattern for most use cases. If you have a really small data set and performance doesn't matter it may work for you. Commented Jul 12, 2012 at 2:58
18

In PostgreSQL, one very good way to deal with EAV structures is the additional module hstore, available for version 8.4 or later. The manual:

This module implements the hstore data type for storing sets of key/value pairs within a single PostgreSQL value. This can be useful in various scenarios, such as rows with many attributes that are rarely examined, or semi-structured data. Keys and values are simply text strings.

Requires the additional module hstore. See:

Since Postgres 9.2 there is also the json type and a host of functionality to go with it (most of it added with 9.3).

Postgres 9.4 adds the (largely superior) "binary JSON" data type jsonb. With advanced index options.

answered Nov 13, 2012 at 10:25
11

Funny to see how EAV db model is criticized and even considered as an "anti-pattern" by some.

As far as I'm concerned, the major downsides are :

  • Learning curve is steeper if you get on a project which already started using EAV a while ago. Indeed, the queries are tough as you greatly increase the number of joins (and tables) and so it will ask more time for you to understand . Just have a look at the Magento project and see how the dev external to the project have a tough time working on the DB, yet the documentation is well sustained.
  • Not suited for reporting, if you need to get the number of people who's name started with "M" etc...

However, you should definitely not discard this solution, and here is why :

  • Simon talked about the monster called "changing requirements". I like this expression :). And IMHO this is precisely why EAV may be a good candidate, because this is well suited for "change", as you can add as many attributes as you wish quite easily. Of course it depends on the requirements we're changing. If we're talking about a whole new business, of course you will have to review your dataModel, but EAV offers a lot of flexibility. Just because it asks for more rigour, doesn't mean this is less interesting.
  • It was also said that "You can't use data types." : This is wrong. You may very well have several value tables, one for each dataType. You then have to specify in your attribute table which sort of dataType is your attribute. In fact, a mix of classic relationnal/EAV with class relationship offers a lot of interesting potential in dataBase design.
answered Jan 6, 2013 at 23:25
0
10

If you have a database that is using the EAV structure, it is possible to query the data a variety of ways.

@Simon's answer already shows how to perform a query using multiple joins.

Sample Data Used:

CREATE TABLE yourtable ([ID] int, [Metric] varchar(6), [Value] int);
INSERT INTO yourtable ([ID], [Metric], [Value])
VALUES (1, 'Ht_cm', 190),
 (1, 'Wt_kg', 82),
 (1, 'Age_yr', 43),
 (2, 'Ht_cm', 170),
 (2, 'Wt_kg', 60),
 (2, 'Age_yr', 22),
 (3, 'Ht_cm', 205),
 (3, 'Wt_kg', 90),
 (3, 'Age_yr', 51);

If you are using an RDBMS that has a PIVOT function (SQL Server 2005+/Oracle 11g+) then you can query the data the following way:

select id, Ht_cm, Wt_kg, Age_yr
from
(
 select id, metric, value
 from yourtable
) src
pivot
(
 max(value)
 for metric in (Ht_cm, Wt_kg, Age_yr)
) piv;

See SQL Fiddle with Demo

If you do not have access to a PIVOT function, then you can use an aggregate function with a CASE statement to return the data:

select id,
 max(case when metric ='Ht_cm' then value else null end) Ht_cm,
 max(case when metric ='Wt_kg' then value else null end) Wt_kg,
 max(case when metric ='Age_yr' then value else null end) Age_yr
from yourtable
group by id

See SQL Fiddle with Demo

Both of these queries will return data in the result:

| ID | HT_CM | WT_KG | AGE_YR |
-------------------------------
| 1 | 190 | 82 | 43 |
| 2 | 170 | 60 | 22 |
| 3 | 205 | 90 | 51 |
answered Dec 24, 2012 at 22:24

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.