3

I created the following table and inserted some values into it as follows:

CREATE TABLE query_all_lexeme (
 payload text,
 normalized tsvector GENERATED ALWAYS AS (to_tsvector('english', payload)) STORED
 );
 
INSERT INTO query_all_lexeme (payload)
 VALUES ('fat cats ate rats');
INSERT INTO query_all_lexeme (payload)
 VALUES ('summarize the functions and operators that are provided for full text searching');
INSERT INTO query_all_lexeme (payload)
 VALUES ('Constructs a phrase query');
INSERT INTO query_all_lexeme (payload)
SELECT
 'Constructs a phrase query this is a test'
FROM
 generate_series(1, 10000);
INSERT INTO query_all_lexeme (payload)
SELECT
 'Constructs a phrase query this is a test'
FROM
 generate_series(1, 100);

Then I created a gin index:

CREATE INDEX query_all_lexeme_vector ON query_all_lexeme USING gin (normalized);

And then I run the query below to get gin index info:

SELECT * FROM
 gin_metapage_info (get_raw_page ('query_all_lexeme_vector', 0)) \gx

Result:

+-[ RECORD 1 ]-----+------------+
| pending_head | 4294967295 |
| pending_tail | 4294967295 |
| tail_free_size | 0 |
| n_pending_pages | 0 |
| n_pending_tuples | 0 |
| n_total_pages | 14 |
| n_entry_pages | 1 |
| n_data_pages | 12 |
| n_entries | 15 |
| version | 2 |
+------------------+------------+
WITH cte AS (
 SELECT
 flags,
 p
 FROM
 generate_series(1, 13) AS p,
 gin_page_opaque_info (get_raw_page ('query_all_lexeme_vector', p)))
 SELECT
 array_agg(p)
 FROM
 cte
 WHERE
 flags::text = '{data,leaf,compressed}';

return

+----------------------+
| array_agg |
+----------------------+
| {3,4,6,7,9,10,12,13} |
+----------------------+

In the following query, I should expect at least one row for the column gin_tid_vs_table_tid value that is true. However, the column values are all false.

WITH cte AS (
 SELECT
 (unnest(normalized)).lexeme AS elements
 , array_agg(ctid) AS ctids
 FROM
 query_all_lexeme
 GROUP BY
 1
)
SELECT
 elements
 , pg_typeof(ctids)
 , ctids = (
 SELECT
 tids
 FROM
 gin_leafpage_items (get_raw_page ('query_all_lexeme_vector' , 3))
 ORDER BY
 1
 LIMIT 1) AS gin_tid_vs_table_tid
FROM
 cte;

I have already run vacuum analyze. Now the data is very stable (only SELECT), and the tid value is stabilized. So why the last query's column gin_tid_vs_table_tid has false values? My logic is like gin index stored lexemes and lexemes' corresponding tid (physical tuple location). So the gin index tid should be equal as array_agg(ctid) with the same lexeme.

asked Jun 25, 2022 at 10:21

1 Answer 1

2

Your logic would only apply if one index entry were guaranteed to contain all the ctids for a given lexeme. There is no such guarantee, and can't be because index tuples have a strictly bounded size which is far less than enough to contain all possible ctids.

Maybe you could switch from array equality to overlaps or contains (&&, @>).

answered Jun 25, 2022 at 14:08
1
  • but the flags is compressed? also in the menu: A GIN index stores a set of (key, posting list) pairs, where a posting list is a set of row IDs in which the key occurs. The same row ID can appear in multiple posting lists, since an item can contain more than one key. Each key value is stored only once, so a GIN index is very compact for cases where the same key appears many times. Commented Jun 27, 2022 at 4:36

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.