I need to find if there are any values written to the column but with minimum cost.
There is a table with ~1B rows and about 100 columns with data. I need to find columns that have no data at all (all nulls) and delete them. If I query like "SELECT * FROM my_table WHERE my_column IS NOT NULL LIMIT 1" then it costs me 1-2 minutes and I'm looking for the faster solution.
As far as I know (but not sure), if there are no data at all for the column, there is some property in the database saying that the chunk or the table has no data for this column, at least this happens when adding a new column to the existing table, so the database doesn't update all existing rows if the default value of a new column is NULL. I wonder if there is a fast way to get a result based on this info.
P.S. I use timescaleDB extension and the PK is a timestamp, if this changes anything
1 Answer 1
As far as I know (but not sure), if there are no data at all for the column, there is some property in the database saying that the chunk or the table has no data for this column, at least this happens when adding a new column to the existing table, so the database doesn't update all existing rows if the default value of a new column is NULL
If the row data ends early, it assumes all column after that for that particular row are NULL, or in newer versions, are whatever DEFAULT was at the time the column was added. But this says nothing at all about what might be in other rows.
You can't avoid scanning the table unless you a had a very strange constraint which forces the values to all be NULL. But you should be able to scan the table just once, not once for every column.
select count(col1), count(co2), count(col3)...count(col100) from the_table;
-
Thanks for the tip with "select count". I tested it. Unfortunately, it takes even longer than the query in my post. "select count...." query took 9min 40sec. Of course, it depends on the number of columns to test, but I don't want to accept minutes to execute a query. I'll try to construct something on the application layer (introduce some new field as a flag and update it when data is added to the main table). It's sad that the Postges doesn't provide a general flag saying that the column is not empty.Ostap– Ostap2021年04月09日 09:45:34 +00:00Commented Apr 9, 2021 at 9:45
pg_stats.null_frac
, which might be helpful in finding which columns you need to perform the full scan for. But you still need to do the full scan as in jjanes' answer: statistics data is dropped when server load is high as normal operation, so it can't be trusted to actually hold for every row until you perform the regular scan.