1

Is it possible that after a certain pattern of random inserts and deletes leaf data nodes become fragmented with clustered index?

I.e, that the physical order does not reflect the logical order (by say, INT primary key) imposed by the clustered index? This way, range queries would require random I/O even after finding the beginning of the interval.

Most university courses (ex. CMU Introduction to Database Systems by Andy Pavlo) say that data is physically ordered according to the key. While definitely approximate to the reality, that looks as unrealistic to me counting the cost of frequent file defragmentation that would be required.

Paul White
95.3k30 gold badges439 silver badges689 bronze badges
asked Apr 12, 2023 at 19:32

2 Answers 2

4

YES

You can create clustered indexes on natural keys, or on keys that are not linearly inserted. For example, using an e-mail address, social security numbers, and others. These make "good enough" natural keys in lots of cases. Or maybe it makes more sense to have the data physically stored differently than an ID value. I would argue that they should still NOT be the clustered key, but it does make sense and is one way that a clustered index could become fragmented.

So you can get fragmented there because a new insert may be out of order and written to the end of the tree when it belongs in the middle.

You can ALSO get fragmentation from a more typical table design if you have deletes. Scenario is that you have a typical table with an ID field that is auto-incrementing. This is the best practice choice most of the time. You create the clustered index on that ID field. This is fine if all you do is insert. All new records are at the end of the tree and are in order by the ID field.

But let's say you delete records.... well now you have fragmentation because the data pages are not full. Or there are formerly empty pages that are no longer contiguous.

Additional note; you can also get fragmentation by updating a record. If you have a column that was created empty, but has a variable width up to 100 characters. When you go back and update it from NULL to a value, that could push parts of that row onto another page... which also causes fragmentation.

answered Apr 12, 2023 at 19:42
0
0

Disk layout is dependent on the vendor.

Here is some specific information about the InnoDB Engine in MySQL/MariaDB. (Note: Other vendors do not necessarily follow the same design.)

  • There is always a Primary Key.
  • The PK is always clustered and always UNIQUE.
  • The data is ordered by the PK and is stored in B+Trees in 16KB blocks.
  • As Jonathan says, inserts, deletes, and even updates can mess with with is in a block.
  • If too much data is put in a block, it is split into 2 blocks.
  • Some attempt is made to combine two logically adjacent blocks when they become mostly empty.
  • I use an Auto_Increment ID only about 1 time in 3.
  • The optimal indexes for a many-to-many mapping table for linking tables A and B is PRIMARY KEY(a_id, b_id), INDEX(b_id, a_id). Note that adding an ID to this table degrades performance.
  • A BTree does fragment, but it is not worth worrying about. It gravitates toward the average block being about 69% full.
  • InnoDB does not have a "Row number", unlike many other vendors.
answered Apr 13, 2023 at 3:30

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.