Perform regex search queries of a text column in table of 100+ million using regex

Question 1

I apologize if this is a newbie question but I'm feeling overwhelmed.

I have 100+ million 2-grams(string with two words), 100+ million trigrams, 100+ 4-grams.

I was thinking of separating 2-grams into a table for them, 3-grams into another table, etc., since there's quite a lot of data and I'd like to break it up at least a bit, and don't know how else I could partition it. Plus, I'd like to avoid using enterprise features of SQL Server such as actual partitioning.

Apart from the text column, I'd have Id(bigint), AppearanceCount(int) and Hash(this'd be the hashed text column, it doesn't even need to be a computed column as I don't predict many, or any, edits of the text column).

Typical usage scenarios for the 2-gram table(the others as well) would be:

Search the table with a regex matching the text column.
Delete a row with a specified hash
Insert a row

In an average case, I believe searching the table with a regex should return around 5-10 results, but outliers are possible.

So, what could I do to make sure all of this works fast? Fast being a cycle of the following working in a few seconds for n-grams and not more than a few minutes for 4-grams and 5-grams(the current non-database system for 4-grams takes an hour).

I know about full text search, but I haven't been able to find how to use regexes with it. I would really, really prefer to have work here. If that is not at all possible, I can of course give more details, but I didn't want to overwhelm everyone with an even bigger wall of text.

I'm sorry if this is too general, but I'm in the planning phase of this and would like to know if SQL Server is viable for this. I could perhaps use something else, but I'm most familiar and comfortable with SQL Server.

They're all basically syntagms, aka parts of a sentence that might or might not be gramatically wrong/have a misspelling. The words in those syntagms also have variations because the language I'm working with has cases and genders. https://en.wikipedia.org/wiki/Grammatical_case#Indo-European_languages https://en.wikipedia.org/wiki/Grammatical_gender

These words are not in English in case that matters, but I'll put in some English examples so that it's understandable. This is a stupid example, but if English had cases and wanted to use the word House in various situations, we could have the following situation where we're answering some questions: What is this? "A house." Where are you going? "To houise." What are you buying? "Housua"

Basically, you have a root of a word that can change. So, if people are misspelling the word house or using it in an improper way, we want to delete entries from our database of correct expressions that match a regular expression that signifies how the word house is used incorrectly.

2-grams and 3-grams aren't different at all, it's literally just that 3-grams have one word more.

Let's imagine that the word "home" has different forms based on the type of situation the sentence it's in describes. Also let's imagine English has a word called "foohume" that has nothing to do with the word "home".

Examples of a 2-gram where for some reason people spell home as hume: Went hume Gone hume Going hume Nice humes To huma About humas About foohume

Now, we want to delete all 2-grams which contain an incorrect use of the word home. So basically all of these except the last one. So, someone somewhere would tell the database to delete every 2-gram that matches the following conditions that would be described with a regex: - Is at the beginning of a 2-gram or has a space in front of it - Contains the root "hum" followed by a suffix of "e, es, a, as" So basically it'd be something like "[$ ]hum[e, es, a, as]". I know it's not really a valid regex but you get the idea.

What my fear is is that full text search can't be used properly with this, and that we'll have a situation where we have to sequentially scan the entire table of 2-grams, see if there's a regex match, and return that row's Id(or the entire row, but Id only might be faster). If it was more straightforward than a regex, I suppose a full text search would return the results of this fairly quickly, since as I said there should be about 10 of them on average.

Question 2

Can you at least provide some examples of both the regular expressions you need to use as well as a few rows of each of the 2-gram, 3-gram, etc tables?

Question 3

@srutzky I've edited the additional info into my question. Sorry if you got bombarded with notifications, I'd first added the new info into a series of comments and only later realized that was stupid of me.

Question 4

I did not have much success with regex in tsql and made a CLR to use the regex function hope this helps. dba.stackexchange.com/questions/97733/…

Question 5

@Spörri and Swift: the LIKE operator does not support RegEx in any way. It only supports very simple patterns that most people mistakingly refer to as RegEx or a limited RegEx: zero-or-more-of-any-character wildcard ( % same as RegEx .* ), exactly-one-character-of-any-character ( _ same as RegEx . ), and exactly-one-character-within-this-range or not-in-this-range ([ ] or [^ ] same as RegEx). For actual RegEx you need to use SQLCLR. I wrote a library that has most RegEx functions in the Free version: SQL#.

Question 6

I assume you are using NVARCHAR for this n-gram field? How large is the largest value? Indexes have a max of 900 bytes, and you are starting with 8 bytes for the BIGINT. Assuming just these 2 fields, that gives you a max of 446 two-byte characters (assuming no supplementary characters). A function (RegEx or other) won't use an index and will scan all rows. LIKE can use an index IF not leading with a wildcard. Have you tried using LIKE 'hume %' OR LIKE 'humes %' OR LIKE 'huma %' OR LIKE 'humas %' ... ? Listing out each variation as a separate OR condition? That covers half the options.

Question 7

I recently did an SSIS package that went though a table with over 4 billion records, the first step in the package was a query that extracted about 50 million records. The query just took a few seconds. Granted, my query was able to use an index, but it did have multiple conditions, and had to skip through the index to get at a subset of all the records and then test them with another couple conditions. And the return set was in the same order of magnitude as one of your entire tables. So I don't think you will have to do much to make this happen.

Sounds like you may not know how to use the LIKE operator. You can find details for that here. That will let you use some regular expressions to match against your text.

I'd recommend that you load the 2-grams into a table and give it a try. My guess is that a simple query with a LIKE condition will do the job for you.

Question 8

I know about the LIKE operator, yes. However, I'm not sure about it's expressiveness and speed. Wouldn't using a LIKE operator and no other criteria for the SELECT query mean that the entire table of 100 million rows would be scanned sequentially and the LIKE operator applied to each row's text column? Can the LIKE operator use an index of some kind? Anyways, I was planning to load them into a table and experiment ASAP, but I'll only have access to that test data of 100 million entries in a few weeks. I was hoping to get ahead of it a bit and plan some of my steps in advance.

Question 9

@Swift as I mentioned in a comment on the Question, using any function (built-in, SQLCLR, etc) will disallow using an index on that field. So using a true RegEx SQLCLR function will scan the entire table and apply the pattern to every row's "text" column. The LIKE operator, as long as the pattern does not lead with a wildcard, actually can use an index (though indexes are limited to 900 bytes total, and INCLUDE columns are not searchable).

Question 10

Yes, t he LIKE operator will mean a full scan, most likely, but I expect this will not be anywhere near as onerous as you think it will be. And I don't think there is any way around it. A good database server can scan through 100 million records in seconds. The best optimizations you can make is to make sure that DB server has lots of memory and strong CPU.

Question 11

Based on the given requirements and sample data, I would recommend taking another look at Full-Text Search. I think you are over-complicating the process by requiring full Regular Expressions. Full-Text Search (FTS) should be able to handle your queries even without RegEx.

FTS tokenizes each word in each indexed column. That right there is a large part of why you are wanting to use RegEx: since you don't know if the string field has in it either "Going hume" or "Hume going". But if the words are tokenized then you are only looking at one word at a time anyway.

The other reason for wanting RegEx is to more easily express the variations, such as:

hume
humes
huma
humas

You should be able to accomplish this easily using the CONTAINS function with a "prefix" term as follows (yes, the double-quotes are required):

CONTAINS ([Ngrams], '"hum*"')

Please also review the following documentation to get a better understanding of what FTS can do:

Create and Manage Full-Text Indexes (shows how string fields are tokenized)
Query with Full-Text Search
CONTAINS
CONTAINSTABLE
Get Started with Full-Text Search

With respect to the idea of using LIKE or a SQLCLR RegEx function:

Using any function (built-in, SQLCLR, etc) will disallow using an index on that field. So using a true RegEx SQLCLR function will scan the entire table and apply the pattern to every row's "text" column.
The LIKE operator, as long as the pattern does not lead with a wildcard (i.e. _, %, or []), actually can use an index.
The LIKE operator, if the pattern does lead with a wildcard (i.e. _, %, or []), will scan every row just like when using a function.
Indexes are limited to 900 bytes total (across all key fields). INCLUDE columns do not contribute to that 900 byte limit, but also aren't searchable.
If you have a BIGINT ID field, that is already 8 bytes, leaving you with 892 left. If the data is NVARCHAR, and it most likely is (or at least should be), that is a max of 446 two-byte characters (less than 446 if any four-byte supplementary characters are in there).

Question 12

I am going to accept that you need regex. Regex does stuff you cannot do with Like and Full Text.

Breaking it up in 3 tables will help. Or you could just have a column of word count and include that in the where. Where word count will not be as fast as separate tables but compared to regex it is super fast.

Use Like when you don't need regex

If you need regex use Like to limit the number of regex you need to evaluate if you can

I might be worth split it up the ngrams into separate words. But then you would need to reassemble the ngram in the search (or view). Like 'hum%' can use an index but like '%hum%' cannot use an index. You could do stuff like find words with a small Levenstine distance.

For ngrams I would not use Full Text. I would break it up myself. Full Text has overhead and mixing regex with Full Text could get messy.

user67871user67871 · Answer 1 · 2015-11-12 20:46:12Z

1

I recently did an SSIS package that went though a table with over 4 billion records, the first step in the package was a query that extracted about 50 million records. The query just took a few seconds. Granted, my query was able to use an index, but it did have multiple conditions, and had to skip through the index to get at a subset of all the records and then test them with another couple conditions. And the return set was in the same order of magnitude as one of your entire tables. So I don't think you will have to do much to make this happen.

Sounds like you may not know how to use the LIKE operator. You can find details for that here. That will let you use some regular expressions to match against your text.

I'd recommend that you load the 2-grams into a table and give it a try. My guess is that a simple query with a LIKE condition will do the job for you.

Share

Improve this answer

answered Nov 12, 2015 at 20:46

user67871user67871

3

I know about the LIKE operator, yes. However, I'm not sure about it's expressiveness and speed. Wouldn't using a LIKE operator and no other criteria for the SELECT query mean that the entire table of 100 million rows would be scanned sequentially and the LIKE operator applied to each row's text column? Can the LIKE operator use an index of some kind? Anyways, I was planning to load them into a table and experiment ASAP, but I'll only have access to that test data of 100 million entries in a few weeks. I was hoping to get ahead of it a bit and plan some of my steps in advance.

Swift
– Swift

2015年11月13日 14:38:41 +00:00
Commented Nov 13, 2015 at 14:38
@Swift as I mentioned in a comment on the Question, using any function (built-in, SQLCLR, etc) will disallow using an index on that field. So using a true RegEx SQLCLR function will scan the entire table and apply the pattern to every row's "text" column. The LIKE operator, as long as the pattern does not lead with a wildcard, actually can use an index (though indexes are limited to 900 bytes total, and INCLUDE columns are not searchable).

Solomon Rutzky
– Solomon Rutzky

2015年11月13日 15:53:14 +00:00
Commented Nov 13, 2015 at 15:53
Yes, t he LIKE operator will mean a full scan, most likely, but I expect this will not be anywhere near as onerous as you think it will be. And I don't think there is any way around it. A good database server can scan through 100 million records in seconds. The best optimizations you can make is to make sure that DB server has lots of memory and strong CPU.

user67871
– user67871

2015年11月16日 14:23:35 +00:00
Commented Nov 16, 2015 at 14:23

Add a comment |

score 0 · Answer 2 · 2015-11-13 16:31:34Z

Based on the given requirements and sample data, I would recommend taking another look at Full-Text Search. I think you are over-complicating the process by requiring full Regular Expressions. Full-Text Search (FTS) should be able to handle your queries even without RegEx.

FTS tokenizes each word in each indexed column. That right there is a large part of why you are wanting to use RegEx: since you don't know if the string field has in it either "Going hume" or "Hume going". But if the words are tokenized then you are only looking at one word at a time anyway.

The other reason for wanting RegEx is to more easily express the variations, such as:

hume
humes
huma
humas

You should be able to accomplish this easily using the CONTAINS function with a "prefix" term as follows (yes, the double-quotes are required):

CONTAINS ([Ngrams], '"hum*"')

Please also review the following documentation to get a better understanding of what FTS can do:

Create and Manage Full-Text Indexes (shows how string fields are tokenized)
Query with Full-Text Search
CONTAINS
CONTAINSTABLE
Get Started with Full-Text Search

With respect to the idea of using LIKE or a SQLCLR RegEx function:

Using any function (built-in, SQLCLR, etc) will disallow using an index on that field. So using a true RegEx SQLCLR function will scan the entire table and apply the pattern to every row's "text" column.
The LIKE operator, as long as the pattern does not lead with a wildcard (i.e. _, %, or []), actually can use an index.
The LIKE operator, if the pattern does lead with a wildcard (i.e. _, %, or []), will scan every row just like when using a function.
Indexes are limited to 900 bytes total (across all key fields). INCLUDE columns do not contribute to that 900 byte limit, but also aren't searchable.
If you have a BIGINT ID field, that is already 8 bytes, leaving you with 892 left. If the data is NVARCHAR, and it most likely is (or at least should be), that is a max of 446 two-byte characters (less than 446 if any four-byte supplementary characters are in there).

paparazzo paparazzo 5,0481 gold badge19 silver badges32 bronze badges · Answer 3 · 2015-11-13 17:16:19Z

I am going to accept that you need regex. Regex does stuff you cannot do with Like and Full Text.

Breaking it up in 3 tables will help. Or you could just have a column of word count and include that in the where. Where word count will not be as fast as separate tables but compared to regex it is super fast.

Use Like when you don't need regex

If you need regex use Like to limit the number of regex you need to evaluate if you can

I might be worth split it up the ngrams into separate words. But then you would need to reassemble the ngram in the search (or view). Like 'hum%' can use an index but like '%hum%' cannot use an index. You could do stuff like find words with a small Levenstine distance.

For ngrams I would not use Full Text. I would break it up myself. Full Text has overhead and mixing regex with Full Text could get messy.

Stack Exchange Network

Perform regex search queries of a text column in table of 100+ million using regex

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Perform regex search queries of a text column in table of 100+ million using regex

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions