In the query
select pricing_date,count(*)
from prices
where pricing_date>='2022-12-01'
group by pricing_date;
I receive results with multiple pricing_date groups of the same date:
+--------------+----------+
| pricing_date | count(*) |
|--------------+----------|
| 2022年12月01日 | 40713 |
| 2022年12月05日 | 1362 |
| 2022年12月06日 | 448 |
| 2022年12月01日 | 112 |
| 2022年12月05日 | 790 |
| 2022年12月06日 | 1007 |
| 2022年12月05日 | 905 |
| 2022年12月06日 | 225 |
etc., for 400 rows. Here's the result of explain
:
explain format=json select pricing_date,count(*) from prices where pricing_date>='2022-12-01' group by pricing_date;
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "397070.02"
},
"grouping_operation": {
"using_filesort": false,
"table": {
"table_name": "prices",
"access_type": "ALL",
"possible_keys": [
"pd_id",
"id_pd"
],
"rows_examined_per_scan": 520926311,
"rows_produced_per_join": 1914444,
"filtered": "0.37",
"Using": "parallel query (1 columns, 1 filters, 0 exprs; 0 extra)",
"cost_info": {
"read_cost": "205625.62",
"eval_cost": "191444.40",
"prefix_cost": "397070.02",
"data_read_per_join": "759M"
},
"used_columns": [
"pricing_date"
],
"attached_condition": "(`prices`.`pricing_date` >= DATE'2022-12-01')"
}
}
}
}
However, if I change the date slightly the 2022年12月05日 and 2022年12月06日 dates appear to group as expected:
select pricing_date,count(*)
from prices
where pricing_date>='2022-12-05'
group by pricing_date;
+--------------+----------+
| pricing_date | count(*) |
|--------------+----------|
| 2022年12月05日 | 149752 |
| 2022年12月06日 | 223728 |
+--------------+----------+
here's the result of explain:
explain select pricing_date,count(*) from prices where pricing_date>='2022-12-04' group by pricing_date;
+----+-------------+-----------------------+------------+-------+-----------------+---------+---------+--------+--------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|----+-------------+-----------------------+------------+-------+-----------------+---------+---------+--------+--------+----------+--------------------------|
| 1 | SIMPLE | prices | <null> | range | pd_id,id_pd | pd_id | 4 | <null> | 976306 | 100.0 | Using where; Using index |
+----+-------------+-----------------------+------------+-------+-----------------+---------+---------+--------+--------+----------+--------------------------+
Why doesn't the first query accurately group all the dates with the same date into the same group? I see that the second query uses a key as expected, but it doesn't seem like group by should fail on non-indexed results.
Here's the table schema:
+-----------+--------------------------------------------------------------------+
| Table | Create Table |
|-----------+--------------------------------------------------------------------|
| prices | CREATE TABLE `prices` ( |
| | `pricing_date` date DEFAULT NULL, |
| | `id` varchar(12) DEFAULT NULL, |
| | UNIQUE KEY `pd_id` (`pricing_date`,`id`), |
| | KEY `id_pd` (`id`,`pricing_date`) |
| | ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |
+-----------+--------------------------------------------------------------------+
I'm using mysql 8.0.23 in AWS Aurora serverless v2 if that's relevant.
1 Answer 1
As suggested by Rick James, turning off Aurora's parallel query seems to solve the issue and give the expected results.
select date(pricing_date) as new_pricing_date, count(*) from prices where date(pricing_date)>='2022年12月01日' group by new_pricing_date;
, does it change the results ?select pricing_date,count(*) from prices force index (pd_id) where pricing_date>='2022年12月01日' group by pricing_date;
(2.3 seconds vs 493 seconds)