Does clustered index fragmentation happen in relational DBMS?

Question 1

Is it possible that after a certain pattern of random inserts and deletes leaf data nodes become fragmented with clustered index?

I.e, that the physical order does not reflect the logical order (by say, INT primary key) imposed by the clustered index? This way, range queries would require random I/O even after finding the beginning of the interval.

Most university courses (ex. CMU Introduction to Database Systems by Andy Pavlo) say that data is physically ordered according to the key. While definitely approximate to the reality, that looks as unrealistic to me counting the cost of frequent file defragmentation that would be required.

Question 2

YES

You can create clustered indexes on natural keys, or on keys that are not linearly inserted. For example, using an e-mail address, social security numbers, and others. These make "good enough" natural keys in lots of cases. Or maybe it makes more sense to have the data physically stored differently than an ID value. I would argue that they should still NOT be the clustered key, but it does make sense and is one way that a clustered index could become fragmented.

So you can get fragmented there because a new insert may be out of order and written to the end of the tree when it belongs in the middle.

You can ALSO get fragmentation from a more typical table design if you have deletes. Scenario is that you have a typical table with an ID field that is auto-incrementing. This is the best practice choice most of the time. You create the clustered index on that ID field. This is fine if all you do is insert. All new records are at the end of the tree and are in order by the ID field.

But let's say you delete records.... well now you have fragmentation because the data pages are not full. Or there are formerly empty pages that are no longer contiguous.

Additional note; you can also get fragmentation by updating a record. If you have a column that was created empty, but has a variable width up to 100 characters. When you go back and update it from NULL to a value, that could push parts of that row onto another page... which also causes fragmentation.

Question 3

Disk layout is dependent on the vendor.

Here is some specific information about the InnoDB Engine in MySQL/MariaDB. (Note: Other vendors do not necessarily follow the same design.)

There is always a Primary Key.
The PK is always clustered and always UNIQUE.
The data is ordered by the PK and is stored in B+Trees in 16KB blocks.
As Jonathan says, inserts, deletes, and even updates can mess with with is in a block.
If too much data is put in a block, it is split into 2 blocks.
Some attempt is made to combine two logically adjacent blocks when they become mostly empty.
I use an Auto_Increment ID only about 1 time in 3.
The optimal indexes for a many-to-many mapping table for linking tables A and B is PRIMARY KEY(a_id, b_id), INDEX(b_id, a_id). Note that adding an ID to this table degrades performance.
A BTree does fragment, but it is not worth worrying about. It gravitates toward the average block being about 69% full.
InnoDB does not have a "Row number", unlike many other vendors.

Jonathan Fite Jonathan Fite 9,4341 gold badge26 silver badges30 bronze badges · Accepted Answer · 2023-04-12 19:42:54Z

YES

You can create clustered indexes on natural keys, or on keys that are not linearly inserted. For example, using an e-mail address, social security numbers, and others. These make "good enough" natural keys in lots of cases. Or maybe it makes more sense to have the data physically stored differently than an ID value. I would argue that they should still NOT be the clustered key, but it does make sense and is one way that a clustered index could become fragmented.

So you can get fragmented there because a new insert may be out of order and written to the end of the tree when it belongs in the middle.

You can ALSO get fragmentation from a more typical table design if you have deletes. Scenario is that you have a typical table with an ID field that is auto-incrementing. This is the best practice choice most of the time. You create the clustered index on that ID field. This is fine if all you do is insert. All new records are at the end of the tree and are in order by the ID field.

But let's say you delete records.... well now you have fragmentation because the data pages are not full. Or there are formerly empty pages that are no longer contiguous.

Additional note; you can also get fragmentation by updating a record. If you have a column that was created empty, but has a variable width up to 100 characters. When you go back and update it from NULL to a value, that could push parts of that row onto another page... which also causes fragmentation.

Stack Exchange Network

Does clustered index fragmentation happen in relational DBMS?

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Does clustered index fragmentation happen in relational DBMS?

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions