We have a table that currently has every postal address for a single state, currently around 13 mil records. Among postal addresses, it also contains spatial coordinates for each address along with meta-data for our software. Our software searches for records via polygons OR full text using address fields(address, city, state, zip). We will be adding additional states soon, so we are having to start think about table performance with these extra states. When all states are imported it will eventually contain around 155 mil records.
I am looking for what would be the best approach to load all these records, and still be able to search via spatial index OR full text index with great performance. If I am using the wrong approach for something like this, by all means please let me know.
My thought is to use partitioning for the table. The major downside I see with partitioning in MySQL, is that a partitioned table does not support spatial or full text indexes. To accomplish this I thought about pulling out the spatial column and the few other columns used for full text indexes, into a separate table. Then use a join on that table against the main table. This allows us to keep the main table partitioned, and then use the new table for strictly spatial and full text indexed searches.
Using MySql 8.0.26 with replicas
Current
CREATE TABLE `Addresses` (
`ID` BIGINT UNSIGNED NOT NULL,
`AddressFull` VARCHAR(150) NULL DEFAULT NULL,
`AddressCity` VARCHAR(50) NULL DEFAULT NULL,
`AddressState` VARCHAR(2) NULL DEFAULT NULL,
`AddressZIP` VARCHAR(5) NULL DEFAULT NULL,
`AddressZIP4` VARCHAR(4) NULL DEFAULT NULL,
`Latitude` FLOAT NULL DEFAULT NULL,
`Longitude` FLOAT NULL DEFAULT NULL,
`Meta1` VARCHAR(100) NULL DEFAULT NULL,
`Meta2` VARCHAR(100) NULL DEFAULT NULL,
`Meta3` VARCHAR(100) NULL DEFAULT NULL,
`Meta4` VARCHAR(100) NULL DEFAULT NULL,
.... 20+ other columns
`geo` point GENERATED ALWAYS AS (st_srid(point(ifnull(`Longitude`,0),ifnull(`Latitude`, 0)),4326)) STORED NOT NULL,
`created` DATETIME NOT NULL DEFAULT 'utc_timestamp()',
`modified` DATETIME NOT NULL DEFAULT 'utc_timestamp()',
PRIMARY KEY (`ID`),
INDEX idx_addresses_zip(`AddressZIP`),
.... 5 other indexes
SPATIAL INDEX spidx_addresses(`geo`),
FULLTEXT INDEX `txt_Meta1_Meta2` (`Meta1`, `Meta2`),
FULLTEXT INDEX `txt_addressSearch` (`AddressFull`, `AddressCity`, `AddressState`, `AddressZIP`)
)
COLLATE='utf8mb4_0900_ai_ci'
ENGINE=InnoDB;
Proposed
CREATE TABLE `Addresses` (
`ID` BIGINT UNSIGNED NOT NULL,
`AddressFull` VARCHAR(150) NULL DEFAULT NULL,
`AddressCity` VARCHAR(50) NULL DEFAULT NULL,
`AddressState` VARCHAR(2) NULL DEFAULT NULL,
`AddressZIP` VARCHAR(5) NULL DEFAULT NULL,
`AddressZIP4` VARCHAR(4) NULL DEFAULT NULL,
`Meta1` VARCHAR(100) NULL DEFAULT NULL,
`Meta2` VARCHAR(100) NULL DEFAULT NULL,
`Meta3` VARCHAR(100) NULL DEFAULT NULL,
`Meta4` VARCHAR(100) NULL DEFAULT NULL,
.... 20+ other columns
`created` DATETIME NOT NULL DEFAULT 'utc_timestamp()',
`modified` DATETIME NOT NULL DEFAULT 'utc_timestamp()',
PRIMARY KEY (`ID`),
INDEX idx_addresses_zip(`AddressZIP`),
.... 5 other indexes
)
COLLATE='utf8mb4_0900_ai_ci'
ENGINE=InnoDB
PARTITION BY KEY(`AddressState`);
CREATE TABLE `AddressesEx` (
`ID` BIGINT UNSIGNED NOT NULL,
`AddressFull` VARCHAR(150) NULL DEFAULT NULL,
`AddressCity` VARCHAR(50) NULL DEFAULT NULL,
`AddressState` VARCHAR(2) NULL DEFAULT NULL,
`AddressZIP` VARCHAR(5) NULL DEFAULT NULL,
`AddressZIP4` VARCHAR(4) NULL DEFAULT NULL,
`Latitude` FLOAT NULL DEFAULT NULL,
`Longitude` FLOAT NULL DEFAULT NULL,
`Meta1` VARCHAR(100) NULL DEFAULT NULL,
`Meta2` VARCHAR(100) NULL DEFAULT NULL,
`geo` point GENERATED ALWAYS AS (st_srid(point(ifnull(`Longitude`,0),ifnull(`Latitude`, 0)),4326)) STORED NOT NULL,
PRIMARY KEY (`ID`),
SPATIAL INDEX spidx_addresses(`geo`),
FULLTEXT INDEX `txt_Meta1_Meta2` (`Meta1`, `Meta2`),
FULLTEXT INDEX `txt_addressSearch` (`AddressFull`, `AddressCity`, `AddressState`, `AddressZIP`)
)
COLLATE='utf8mb4_0900_ai_ci'
ENGINE=InnoDB;
*Update after loading all data
Outcome form SHOW TABLE STATUS LIKE 'Addresses';
Name | Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time | Update_time | Check_time | Collation | Checksum | Create_options | Comment |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Addresses | InnoDB | 10 | Dynamic | 149288514 | 1025 | 153118277632 | 0 | 206872297472 | 6291456 | (NULL) | 2021年07月09日 06:35:28 | 2021年09月27日 10:22:36 | (NULL) | utf8mb4_0900_ai_ci | (NULL) |
*Update, showing select & explain
SELECT `ID`,
`AddressFull`,
`AddressCity`,
`AddressState`,
`AddressZIP`,
MATCH (AddressFull,AddressCity,AddressState,AddressZIP) AGAINST ('+1234* +main*' IN BOOLEAN MODE) AS relevance
FROM `Addresses`
WHERE MATCH (AddressFull,AddressCity,AddressState,AddressZIP) AGAINST ('+1234* +main*' IN BOOLEAN MODE)
ORDER BY `relevance` DESC
LIMIT 50
Explain
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | Addresses | (NULL) | fulltext | txt_addressSearch | txt_addressSearch | 0 | const | 1 | 100.00 | Using where; Ft_hints: sorted, limit = 50 |
Explain as JSON
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "1.07"
},
"ordering_operation": {
"using_filesort": false,
"table": {
"table_name": "Addresses",
"access_type": "fulltext",
"possible_keys": [
"txt_addressSearch"
],
"key": "txt_addressSearch",
"used_key_parts": [
"AddressFull"
],
"key_length": "0",
"ref": [
"const"
],
"rows_examined_per_scan": 1,
"rows_produced_per_join": 1,
"filtered": "100.00",
"ft_hints": "sorted, limit = 50",
"cost_info": {
"read_cost": "0.97",
"eval_cost": "0.10",
"prefix_cost": "1.07",
"data_read_per_join": "16K"
},
"used_columns": [
"ID",
"AddressFull",
"AddressCity",
"AddressState",
"AddressZIP"
],
"attached_condition": "(match `db`.`Addresses`.`AddressFull`,`db`.`Addresses`.`AddressCity`,`db`.`Addresses`.`AddressState`,`db`.`Addresses`.`AddressZIP` against ('+1234* +main*' in boolean mode))"
}
}
}
}
1 Answer 1
SELECT ...,
ST_Distance_Sphere(center_point, item_point) AS meters
FROM table
WHERE ... (any filtering)
AND MBRContains(GeomFromText(Polygon(...)), -- The bounding box is encoded as a 'square'
sp_point)
HAVING meters <= ...
ORDER BY distance ASC, id ASC
LIMIT ...
That also sorts them by "distance".
If you need to skip the polygon search, then build a separate SELECT
.
If a million houses match the MBRContains, performance will probably be terrible. That is, don't expect excellent performance for all requests.
More discussion of that code: http://mysql.rjweb.org/doc.php/find_nearest_in_mysql#spatial
Limitations of FT
innodb_ft_min_token_size
can be set to 1. This will let you search for one-digit street numbers or "S" (for South) or the word "a". But it comes at some cost. Note, for example, that the disk space for indexes is bigger than the data. Do you really need
MATCH ("+1 +Infinity +Dr") AGAINST (address IN BOOLEAN MODR)
I sometimes find that FT is not the full answer. Instead, I construct a FT query _when it is practical, but use FT + something else or simply something else without FT for certain values in the source.
-
Comments are not for extended discussion; this conversation has been moved to chat.2021年10月08日 19:54:57 +00:00Commented Oct 8, 2021 at 19:54
Explore related questions
See similar questions with these tags.
PARTITIONing
is useful only in rare situations. More on PARTITION and Find Nearest. It is unclear whether you need polygons for individual homes; it seems like a centroid or even a point on the curb would be sufficient.