find if the column of the PostgreSQL table is empty with minimum cost

Question 1

I need to find if there are any values written to the column but with minimum cost.

There is a table with ~1B rows and about 100 columns with data. I need to find columns that have no data at all (all nulls) and delete them. If I query like "SELECT * FROM my_table WHERE my_column IS NOT NULL LIMIT 1" then it costs me 1-2 minutes and I'm looking for the faster solution.

As far as I know (but not sure), if there are no data at all for the column, there is some property in the database saying that the chunk or the table has no data for this column, at least this happens when adding a new column to the existing table, so the database doesn't update all existing rows if the default value of a new column is NULL. I wonder if there is a fast way to get a result based on this info.

P.S. I use timescaleDB extension and the PK is a timestamp, if this changes anything

Question 2

can you share sample design /data? I believe for these kind of columns you can take benefit from unique index.

Question 3

Hi, and welcome to the forum! You could use a constraint CHECK... (col1 IS NOT NULL AND col2 IS NOT NULL AND col3....) - either that or a trigger - having every field as NULL doesn't make much sense...

Question 4

It's probably worth mentioning the existence of pg_stats.null_frac, which might be helpful in finding which columns you need to perform the full scan for. But you still need to do the full scan as in jjanes' answer: statistics data is dropped when server load is high as normal operation, so it can't be trusted to actually hold for every row until you perform the regular scan.

Question 5

Thanks to all contributors! it looks like a full table scan is unavoidable. pg_stats is not guaranteed way so at the end it may lead to a full scan. Regarding constrains, I can't introduce them. It is allowed to have null values for rows. About design: the table holds measured real-time signals. Columns can be added dynamically by the application. It is not allowed to delete columns with data. But totally unused ones should be deleted on user request. Looks like I have two options: accept full scan OR implement some fields-flags and manage on application level.

Question 6

As far as I know (but not sure), if there are no data at all for the column, there is some property in the database saying that the chunk or the table has no data for this column, at least this happens when adding a new column to the existing table, so the database doesn't update all existing rows if the default value of a new column is NULL

If the row data ends early, it assumes all column after that for that particular row are NULL, or in newer versions, are whatever DEFAULT was at the time the column was added. But this says nothing at all about what might be in other rows.

You can't avoid scanning the table unless you a had a very strange constraint which forces the values to all be NULL. But you should be able to scan the table just once, not once for every column.

select count(col1), count(co2), count(col3)...count(col100) from the_table;

Question 7

Thanks for the tip with "select count". I tested it. Unfortunately, it takes even longer than the query in my post. "select count...." query took 9min 40sec. Of course, it depends on the number of columns to test, but I don't want to accept minutes to execute a query. I'll try to construct something on the application layer (introduce some new field as a flag and update it when data is added to the main table). It's sad that the Postges doesn't provide a general flag saying that the column is not empty.

jjanes jjanes 42.4k3 gold badges44 silver badges54 bronze badges · Answer 1 · 2021-04-08 16:48:51Z

As far as I know (but not sure), if there are no data at all for the column, there is some property in the database saying that the chunk or the table has no data for this column, at least this happens when adding a new column to the existing table, so the database doesn't update all existing rows if the default value of a new column is NULL

If the row data ends early, it assumes all column after that for that particular row are NULL, or in newer versions, are whatever DEFAULT was at the time the column was added. But this says nothing at all about what might be in other rows.

You can't avoid scanning the table unless you a had a very strange constraint which forces the values to all be NULL. But you should be able to scan the table just once, not once for every column.

select count(col1), count(co2), count(col3)...count(col100) from the_table;

Thanks for the tip with "select count". I tested it. Unfortunately, it takes even longer than the query in my post. "select count...." query took 9min 40sec. Of course, it depends on the number of columns to test, but I don't want to accept minutes to execute a query. I'll try to construct something on the application layer (introduce some new field as a flag and update it when data is added to the main table). It's sad that the Postges doesn't provide a general flag saying that the column is not empty.

Stack Exchange Network

find if the column of the PostgreSQL table is empty with minimum cost

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

find if the column of the PostgreSQL table is empty with minimum cost

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions