Auto-Increment Column
In Doris, the auto increment column is a feature that automatically generates a unique numeric value, commonly used to create unique identifiers for each row of data, such as primary keys. Each time a new record is inserted, the auto increment column automatically assigns an incrementing value, eliminating the need for manually specifying the number. By using Doris's auto increment column, data uniqueness and consistency are ensured, simplifying the data insertion process, reducing human error, and improving data management efficiency. This makes the auto increment column an ideal choice for scenarios requiring unique identifiers, such as user IDs and more.
Functionality
For tables with an auto-increment column, Doris processes data writes as follows:
-
Auto-Population (Column Excluded): If the written data does not include the auto-increment column, Doris generates and populates unique values for this column.
-
Partial Specification (Column Included):
-
Null Values: Doris replaces null values in the written data with system-generated unique values.
-
Non-Null Values: User-provided values remain unchanged.
AttentionUser-provided non-null values can disrupt the uniqueness of the auto-increment column.
-
Uniqueness
Doris guarantees table-wide uniqueness for values it generates in the auto-increment column. However:
- Guaranteed Uniqueness: This applies only to system-generated values.
- User-Provided Values: Doris does not validate or enforce uniqueness for values specified by users in the auto-increment column. This may result in duplicate entries.
Density
Auto-increment values generated by Doris are generally dense but with some considerations:
-
Potential Gaps: Gaps may appear due to performance optimizations. Each backend node (BE) pre-allocates a block of unique values for efficiency, and these blocks do not overlap between nodes.
-
Non-Chronological Values: Doris does not guarantee that values generated in later writes are larger than those from earlier writes.
NoteAuto-increment values cannot be used to infer the chronological order of writes.
Syntax
To use auto-increment columns, you need to add the AUTO_INCREMENT attribute to the corresponding column during table creation (CREATE-TABLE). To manually specify the starting value for an auto-increment column, you can do so by using the AUTO_INCREMENT(start_value) statement when creating the table. If not specified, the default starting value is 1.
Examples
- Creating a duplicate table with an auto-increment column as the key column.
CREATETABLE`demo`.`tbl`(
`id`BIGINTNOTNULLAUTO_INCREMENT,
`value`BIGINTNOTNULL
)ENGINE=OLAP
DUPLICATEKEY(`id`)
DISTRIBUTEDBYHASH(`id`) BUCKETS 10
PROPERTIES (
"replication_allocation"="tag.location.default: 3"
);
2. Creating a duplicatetablewith an auto-increment columnas the keycolumn,and setting the startingvalueto100.
```sql
CREATE TABLE `demo`.`tbl` (
`id` BIGINT NOT NULL AUTO_INCREMENT(100),
`value` BIGINT NOT NULL
) ENGINE=OLAP
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 10
PROPERTIES (
"replication_allocation"="tag.location.default: 3"
);
- Creating a duplicate table with an auto-increment column as one of the value columns.
CREATETABLE`demo`.`tbl`(
`uid`BIGINTNOTNULL,
`name`BIGINTNOTNULL,
`id`BIGINTNOTNULLAUTO_INCREMENT,
`value`BIGINTNOTNULL
)ENGINE=OLAP
DUPLICATEKEY(`uid`,`name`)
DISTRIBUTEDBYHASH(`uid`) BUCKETS 10
PROPERTIES (
"replication_allocation"="tag.location.default: 3"
);
- Creating a unique table with an auto-increment column as the key column.
CREATETABLE`demo`.`tbl`(
`id`BIGINTNOTNULLAUTO_INCREMENT,
`name`varchar(65533)NOTNULL,
`value`int(11)NOTNULL
)ENGINE=OLAP
UNIQUEKEY(`id`)
DISTRIBUTEDBYHASH(`id`) BUCKETS 10
PROPERTIES (
"replication_allocation"="tag.location.default: 3",
"enable_unique_key_merge_on_write"="true"
);
- Creating a unique table with an auto-increment column as one of the value columns.
CREATETABLE`demo`.`tbl`(
`text`varchar(65533)NOTNULL,
`id`BIGINTNOTNULLAUTO_INCREMENT,
)ENGINE=OLAP
UNIQUEKEY(`text`)
DISTRIBUTEDBYHASH(`text`) BUCKETS 10
PROPERTIES (
"replication_allocation"="tag.location.default: 3",
"enable_unique_key_merge_on_write"="true"
);
Constraints and Limitations
- Auto-increment columns can only be used in Duplicate or Unique model tables.
- A table can have only one auto-increment column.
- The auto-increment column must be of type
BIGINTand cannot beNULL. - The manually specified starting value for an auto-increment column must be 0 or greater.
Usage
Loading
Consider the table below:
CREATETABLE`demo`.`tbl`(
`id`BIGINTNOTNULLAUTO_INCREMENT,
`name`varchar(65533)NOTNULL,
`value`int(11)NOTNULL
)ENGINE=OLAP
UNIQUEKEY(`id`)
DISTRIBUTEDBYHASH(`id`) BUCKETS 10
PROPERTIES (
"replication_allocation"="tag.location.default: 3",
"enable_unique_key_merge_on_write"="true"
);
When using the insert into statement to write data without including the auto-increment column id, Doris automatically generates and fills unique values for the column.
insertinto tbl(name,value)values("Bob",10),("Alice",20),("Jack",30);
select*from tbl orderby id;
+------+-------+-------+
| id | name |value|
+------+-------+-------+
|1| Bob |10|
|2| Alice |20|
|3| Jack |30|
+------+-------+-------+
Similarly, when using stream load to load the file test.csv without specifying the auto-increment column id, Doris will automatically populate the id column with generated values.
test.csv:
Tom, 40
John, 50
curl --location-trusted -u user:passwd -H "columns:name,value" -H "column_separator:," -T ./test1.csv http://{host}:{port}/api/{db}/tbl/_stream_load
select*from tbl orderby id;
+------+-------+-------+
| id | name |value|
+------+-------+-------+
|1| Bob |10|
|2| Alice |20|
|3| Jack |30|
|4| Tom |40|
|5| John |50|
+------+-------+-------+
When writing data using the INSERT INTO statement and specifying the auto-increment column id, any null values in the written data for that column will be replaced with generated values.
insertinto tbl(id, name,value)values(null,"Doris",60),(null,"Nereids",70);
select*from tbl orderby id;
+------+---------+-------+
| id | name |value|
+------+---------+-------+
|1| Bob |10|
|2| Alice |20|
|3| Jack |30|
|4| Tom |40|
|5| John |50|
|6| Doris |60|
|7| Nereids |70|
+------+---------+-------+
Partial Update
When performing a partial update on a merge-on-write Unique table with an auto-increment column:
If the auto-increment column is a key column, users must explicitly specify it during partial updates. As a result, the target columns for partial updates must include the auto-increment column. In this case, the behavior aligns with that of standard partial updates.
CREATETABLE`demo`.`tbl2`(
`id`BIGINTNOTNULLAUTO_INCREMENT,
`name`varchar(65533)NOTNULL,
`value`int(11)NOTNULLDEFAULT"0"
)ENGINE=OLAP
UNIQUEKEY(`id`)
DISTRIBUTEDBYHASH(`id`) BUCKETS 10
PROPERTIES (
"replication_allocation"="tag.location.default: 3",
"enable_unique_key_merge_on_write"="true"
);
insertinto tbl2(id, name,value)values(1,"Bob",10),(2,"Alice",20),(3,"Jack",30);
select*from tbl2 orderby id;
+------+-------+-------+
| id | name |value|
+------+-------+-------+
|1| Bob |10|
|2| Alice |20|
|3| Jack |30|
+------+-------+-------+
set enable_unique_key_partial_update=true;
set enable_insert_strict=false;
insertinto tbl2(id, name)values(1,"modified"),(4,"added");
select*from tbl2 orderby id;
+------+----------+-------+
| id | name |value|
+------+----------+-------+
|1| modified |10|
|2| Alice |20|
|3| Jack |30|
|4| added |0|
+------+----------+-------+
When the auto-increment column is a non-key column and no value is provided, its value will be derived from existing rows in the table. If a value is specified for the auto-increment column, null values in the written data will be replaced with generated values, while non-null values will remain unchanged. These records will then be processed according to the semantics of partial updates.
CREATETABLE`demo`.`tbl3`(
`id`BIGINTNOTNULL,
`name`varchar(100)NOTNULL,
`score`BIGINTNOTNULL,
`aid`BIGINTNOTNULLAUTO_INCREMENT
)ENGINE=OLAP
UNIQUEKEY(`id`)
DISTRIBUTEDBYHASH(`id`) BUCKETS 1
PROPERTIES (
"replication_allocation"="tag.location.default: 3",
"enable_unique_key_merge_on_write"="true"
);
insertinto tbl3(id, name, score)values(1,"Doris",100),(2,"Nereids",200),(3,"Bob",300);
select*from tbl3 orderby id;
+------+---------+-------+------+
| id | name | score | aid |
+------+---------+-------+------+
|1| Doris |100|0|
|2| Nereids |200|1|
|3| Bob |300|2|
+------+---------+-------+------+
set enable_unique_key_partial_update=true;
set enable_insert_strict=false;
insertinto tbl3(id, score)values(1,999),(2,888);
select*from tbl3 orderby id;
+------+---------+-------+------+
| id | name | score | aid |
+------+---------+-------+------+
|1| Doris |999|0|
|2| Nereids |888|1|
|3| Bob |300|2|
+------+---------+-------+------+
insertinto tbl3(id, aid)values(1,1000),(3,500);
select*from tbl3 orderby id;
+------+---------+-------+------+
| id | name | score | aid |
+------+---------+-------+------+
|1| Doris |999|1000|
|2| Nereids |888|1|
|3| Bob |300|500|
+------+---------+-------+------+
Usage Scenarios
Dictionary Encoding
Using bitmaps for audience analysis in user profiling involves creating a user dictionary, where each user is assigned a unique integer as their dictionary value. Aggregating these dictionary values can improve the performance of bitmap operations.
For example, in an offline UV (Unique Visitors) and PV (Page Views) analysis scenario, consider a detailed user behavior table:
CREATETABLE`demo`.`dwd_dup_tbl`(
`user_id`varchar(50)NOTNULL,
`dim1`varchar(50)NOTNULL,
`dim2`varchar(50)NOTNULL,
`dim3`varchar(50)NOTNULL,
`dim4`varchar(50)NOTNULL,
`dim5`varchar(50)NOTNULL,
`visit_time`DATENOTNULL
)ENGINE=OLAP
DUPLICATEKEY(`user_id`)
DISTRIBUTEDBYHASH(`user_id`) BUCKETS 32
PROPERTIES (
"replication_allocation"="tag.location.default: 3"
);
Using the auto-increment column to create the following dictionary table:
CREATETABLE`demo`.`dictionary_tbl`(
`user_id`varchar(50)NOTNULL,
`aid`BIGINTNOTNULLAUTO_INCREMENT
)ENGINE=OLAP
UNIQUEKEY(`user_id`)
DISTRIBUTEDBYHASH(`user_id`) BUCKETS 32
PROPERTIES (
"replication_allocation"="tag.location.default: 3",
"enable_unique_key_merge_on_write"="true"
);
Write the user_id values from existing data into the dictionary table to map user_id to corresponding integer values:
insertinto dictionary_tbl(user_id)
select user_id from dwd_dup_tbl groupby user_id;
Alternatively, write only the user_id values from incremental data into the dictionary table.
insertinto dictionary_tbl(user_id)
select dwd_dup_tbl.user_id from dwd_dup_tbl leftjoin dictionary_tbl
on dwd_dup_tbl.user_id = dictionary_tbl.user_id where dwd_dup_tbl.visit_time >'2023-12-10'and dictionary_tbl.user_id isNULL;
In practical applications, Flink connectors can be used to write data into Doris.
To store aggregated results for the statistical dimensions dim1, dim3, and dim5, create the following table:
CREATETABLE`demo`.`dws_agg_tbl`(
`dim1`varchar(50)NOTNULL,
`dim3`varchar(50)NOTNULL,
`dim5`varchar(50)NOTNULL,
`user_id_bitmap` BITMAP BITMAP_UNION NOTNULL,
`pv`BIGINT SUM NOTNULL
)ENGINE=OLAP
AGGREGATE KEY(`dim1`,`dim3`,`dim5`)
DISTRIBUTEDBYHASH(`user_id`) BUCKETS 32
PROPERTIES (
"replication_allocation"="tag.location.default: 3"
);
Save the aggregated data into the results table.
insertinto dws_agg_tbl
select dwd_dup_tbl.dim1, dwd_dup_tbl.dim3, dwd_dup_tbl.dim5, BITMAP_UNION(TO_BITMAP(dictionary_tbl.aid)),COUNT(1)
from dwd_dup_tbl INNERJOIN dictionary_tbl on dwd_dup_tbl.user_id = dictionary_tbl.user_id;
Execute UV and PV queries with the following statement:
select dim1, dim3, dim5, user_id_bitmap as uv, pv from dws_agg_tbl;
Efficient Pagination
Pagination is often required when displaying data on a page. Traditional pagination usually involves using LIMIT, OFFSET, and ORDER BY in SQL queries. For example, consider the following business table designed for display:
CREATETABLE`demo`.`records_tbl`(
`key`int(11)NOTNULLCOMMENT"",
`name`varchar(26)NOTNULLCOMMENT"",
`address`varchar(41)NOTNULLCOMMENT"",
`city`varchar(11)NOTNULLCOMMENT"",
`nation`varchar(16)NOTNULLCOMMENT"",
`region`varchar(13)NOTNULLCOMMENT"",
`phone`varchar(16)NOTNULLCOMMENT"",
`mktsegment`varchar(11)NOTNULLCOMMENT""
)DUPLICATEKEY(`key`,`name`)
DISTRIBUTEDBYHASH(`key`) BUCKETS 10
PROPERTIES (
"replication_allocation"="tag.location.default: 3"
);
Assuming 100 records are displayed per page, the following SQL query can be used to fetch data for the first page:
select*from records_tbl orderby`key`,`name`limit100;
To fetch data for the second page, you can use the following query:
select*from records_tbl orderby`key`,`name`limit100offset100;
However, when performing deep pagination queries (with large offsets), this method can be inefficient, as it reads all data into memory for sorting before processing, even if only a small number of rows are needed. By using an auto-increment column, each row is assigned a unique value, enabling the use of a query like WHERE unique_value> x LIMIT y to filter out a large portion of the data in advance, making pagination more efficient.
To illustrate this, an auto-increment column is added to the business table, giving each row a unique identifier:
CREATETABLE`demo`.`records_tbl2`(
`key`int(11)NOTNULLCOMMENT"",
`name`varchar(26)NOTNULLCOMMENT"",
`address`varchar(41)NOTNULLCOMMENT"",
`city`varchar(11)NOTNULLCOMMENT"",
`nation`varchar(16)NOTNULLCOMMENT"",
`region`varchar(13)NOTNULLCOMMENT"",
`phone`varchar(16)NOTNULLCOMMENT"",
`mktsegment`varchar(11)NOTNULLCOMMENT"",
`unique_value`BIGINTNOTNULLAUTO_INCREMENT
)DUPLICATEKEY(`key`,`name`)
DISTRIBUTEDBYHASH(`key`) BUCKETS 10
PROPERTIES (
"replication_num"="3"
);
For pagination with 100 records per page, the following SQL query can be used to fetch the data for the first page:
select*from records_tbl2 orderby unique_value limit100;
By recording the maximum value of unique_value from the returned results, let's assume it is 99. The following query can then be used to fetch data for the second page:
select*from records_tbl2 where unique_value >99orderby unique_value limit100;
If directly querying data from a later page and it's inconvenient to retrieve the maximum value of unique_value from the previous page's results (for example, when fetching data starting from the 101st page), the following query can be used:
selectkey, name, address, city, nation, region, phone, mktsegment
from records_tbl2,(select unique_value as max_value from records_tbl2 orderby unique_value limit1offset9999)as previous_data
where records_tbl2.unique_value > previous_data.max_value
orderby records_tbl2.unique_value limit100;