0

I have a table table1 that contains several fields, some textual and other non-textual, on which I do search queries. Let's focus on some text fields, which I simply call field1, field2, field3. The table contains about 1.5 million records and the queries are not so fast, because when I have to search on these fields, I'm doing in fullscan mode.

I give you a concrete example:

SELECT *
FROM table1
WHERE [some conditions]
AND (field1 LIKE "%str1%" OR field2 LIKE "%str1%" or field3 LIKE "%str1%")
AND (field1 LIKE "%str2%" OR field2 LIKE "%str2%" or field3 LIKE "%str2%")
AND (field3 LIKE "%str3%")

Point out that str1, str2, etc ... can be a word or a part of a word (for example "Table" or "Tabl").

Considering how the query is build, I don't see any sense to define any index on field1, field2, field3: they would not bring any advantage, just an useless waste of resources in my opionion.

So, I have evaluated the FULLTEXT indexes of MySQL and the query MATCH() AGAINST(): this is not good, because in my case it works only when exactly field1=word1 or field2=word2, while often the fields contain concatenated words without delimiters (for example field1 contains the value "HOUSETABLERAIN" and not "HOUSE TABLE RAIN").

What can I do to improve the execution time? My question includes changes to the query and the edit of data structure as well.

I thought two solutions, but I'm not convinced of either.

Solution A) Introduce another table table2 (table1_id INT, field VARCHAR, string VARCHAR); for each record in table1, insert all possible substrings (confining itself to combinations with at least 4 characters). Practical example:

If in table1 there is the record

id | field1
45 | HOUSETABLERAIN

then in table2 there will be

table1_id | field | string
 45 | field1 | HOUS
 45 | field1 | HOUSE
 45 | field1 | HOUSET
 45 | field1 | HOUSETA
 45 | field1 | HOUSETAB
 45 | field1 | HOUSETABL
 45 | field1 | HOUSETABLE
 45 | field1 | HOUSETABLER
 45 | field1 | HOUSETABLERA
 45 | field1 | HOUSETABLERAI
 45 | field1 | HOUSETABLERAIN
 45 | field1 | OUSE
 45 | field1 | OUSET
 45 | field1 | OUSETA
...

Afterwards I could define an index on table2.string and I should have some advantages; furthermore, I could also eliminate a part of the fields in table1, because I would no longer search on them.

But I fear it's madness ... even before testing it.

Solution B) Add a TEXT field for the search and enter all possible combinations separated by the blank space " ", then define a FULLTEXT index. The problem are the stopwords that are not considered by the MATCH AGAINST query.

I'd like to have some opinions and maybe some better ideas. Thank you.

asked Jan 23, 2019 at 19:40
7
  • Dedicated full text implementations like solr, sphinx provide a variety of functions so you don't need to implement madness. MySQL isn't the greatest of full text engines. Commented Jan 24, 2019 at 0:01
  • Thanks for your tip, but honestly I can't change database system. Commented Jan 24, 2019 at 13:45
  • 1
    Then drop the requirement. Its a technical professionals job to provide recommendations to change the scope of what is allowed by presenting technical rational. Don't be belligerent and try to achieve this when MySQL is unsuitable for this task. If your client/boss is unconvinced they aren't doing their job and you should find another one. Using more suitable full text engines doesn't mean dropping MySQL for the things its good at. Commented Jan 24, 2019 at 21:06
  • How long are the fields? If they are not "too" long, I will present Solution C. Commented Jan 24, 2019 at 23:18
  • @RickJames: VARCHAR, from 70 chars to 90 chars, only one field is 255 chars. Thanks. Commented Jan 26, 2019 at 8:54

1 Answer 1

0

Digrams (or trigrams).

Take each pair (or triple) of letters in the text, hash it down to 0..63, set a bit in a BIGINT UNSIGNED. The "OR" of a few dozen bits together like that will usually lead to about half the bits being on. (If most of the bits are on, the method is less effective; if too few bits are on, it wastes space.

Do that as you store the data, and store the bigints in extra columns. When testing, do likewise to the data coming in -- get a bigint representing the query.

WHERE (big3 & test3) = test3

Where big3 is the bigint for field3 and test3 is the bigint for "str3", that clause will check that all of the letter pairs of "str3" are probably in field3.

For the combined test of field1/2/3 against "str1",

AND ((big1 | big2 | big3) & test1) = test1

Better yet is to precompute (big1 | big2 | big3) into a combined big123 initially.

I said "probably". That means you will get some false positives and need to double check the 'hard' way, as you already discussed.

Note: If there is some meaning you would like to apply for "word boundary", that can easily be done in the building of the bigints.

Hash

One such hash:

SELECT HEX(1 << (CONV(MID(MD5(LOWER('th')), 5,2), 16, 10) & 0x3F));
+-------------------------------------------------------------+
| HEX(1 << (CONV(MID(MD5(LOWER('th')), 5,2), 16, 10) & 0x3F)) |
+-------------------------------------------------------------+
| 8000 |
+-------------------------------------------------------------+

Translate this into your programming language; it would probably be easier there, not in SQL.

Dissecting it:

HEX(1 << (CONV(MID(MD5(LOWER('th')), 5,2), 16, 10) & 0x3F))
 ^^ sample digraph
 ^^^^^ fold case unless you need to differentiate
 ^^^ a readily available hashing function
 ^^^ ^^^ -- pick some hex digits out of it
 ^^^^ ^^^^^^ -- hex -> decimal
 ^^^^^^ -- only 6 bits
 ^^^^ -- shift "1" to the bit position given by the 6-bit number
^^^ -- (just for displaying the result; not part of the algorithm)
answered Jan 26, 2019 at 16:59
3
  • Thanks. "Hash it down to 0..63". I should calculate the hash encoding of each field (field1, field2, ...), convert this alphanumeric string into a BIGINT and save it in (big1, big2, ...). Am I right? Do you have an idea of which hash coding to choose? There are so many. Commented Jan 27, 2019 at 22:13
  • @boldfra23 - see if my addition is clear enough. Commented Jan 27, 2019 at 22:30
  • I want to thank you for your prompt reply. I will write a comment as soon as I have news, maybe in a few days. Give me time to think, apply and test. Commented Jan 27, 2019 at 22:56

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.