I apologize if this is a newbie question but I'm feeling overwhelmed.
I have 100+ million 2-grams(string with two words), 100+ million trigrams, 100+ 4-grams.
I was thinking of separating 2-grams into a table for them, 3-grams into another table, etc., since there's quite a lot of data and I'd like to break it up at least a bit, and don't know how else I could partition it. Plus, I'd like to avoid using enterprise features of SQL Server such as actual partitioning.
Apart from the text column, I'd have Id(bigint), AppearanceCount(int) and Hash(this'd be the hashed text column, it doesn't even need to be a computed column as I don't predict many, or any, edits of the text column).
Typical usage scenarios for the 2-gram table(the others as well) would be:
- Search the table with a regex matching the text column.
- Delete a row with a specified hash
- Insert a row
In an average case, I believe searching the table with a regex should return around 5-10 results, but outliers are possible.
So, what could I do to make sure all of this works fast? Fast being a cycle of the following working in a few seconds for n-grams and not more than a few minutes for 4-grams and 5-grams(the current non-database system for 4-grams takes an hour).
I know about full text search, but I haven't been able to find how to use regexes with it. I would really, really prefer to have work here. If that is not at all possible, I can of course give more details, but I didn't want to overwhelm everyone with an even bigger wall of text.
I'm sorry if this is too general, but I'm in the planning phase of this and would like to know if SQL Server is viable for this. I could perhaps use something else, but I'm most familiar and comfortable with SQL Server.
They're all basically syntagms, aka parts of a sentence that might or might not be gramatically wrong/have a misspelling. The words in those syntagms also have variations because the language I'm working with has cases and genders. https://en.wikipedia.org/wiki/Grammatical_case#Indo-European_languages https://en.wikipedia.org/wiki/Grammatical_gender
These words are not in English in case that matters, but I'll put in some English examples so that it's understandable. This is a stupid example, but if English had cases and wanted to use the word House in various situations, we could have the following situation where we're answering some questions: What is this? "A house." Where are you going? "To houise." What are you buying? "Housua"
Basically, you have a root of a word that can change. So, if people are misspelling the word house or using it in an improper way, we want to delete entries from our database of correct expressions that match a regular expression that signifies how the word house is used incorrectly.
2-grams and 3-grams aren't different at all, it's literally just that 3-grams have one word more.
Let's imagine that the word "home" has different forms based on the type of situation the sentence it's in describes. Also let's imagine English has a word called "foohume" that has nothing to do with the word "home".
Examples of a 2-gram where for some reason people spell home as hume: Went hume Gone hume Going hume Nice humes To huma About humas About foohume
Now, we want to delete all 2-grams which contain an incorrect use of the word home. So basically all of these except the last one. So, someone somewhere would tell the database to delete every 2-gram that matches the following conditions that would be described with a regex: - Is at the beginning of a 2-gram or has a space in front of it - Contains the root "hum" followed by a suffix of "e, es, a, as" So basically it'd be something like "[$ ]hum[e, es, a, as]". I know it's not really a valid regex but you get the idea.
What my fear is is that full text search can't be used properly with this, and that we'll have a situation where we have to sequentially scan the entire table of 2-grams, see if there's a regex match, and return that row's Id(or the entire row, but Id only might be faster). If it was more straightforward than a regex, I suppose a full text search would return the results of this fairly quickly, since as I said there should be about 10 of them on average.
3 Answers 3
I recently did an SSIS package that went though a table with over 4 billion records, the first step in the package was a query that extracted about 50 million records. The query just took a few seconds. Granted, my query was able to use an index, but it did have multiple conditions, and had to skip through the index to get at a subset of all the records and then test them with another couple conditions. And the return set was in the same order of magnitude as one of your entire tables. So I don't think you will have to do much to make this happen.
Sounds like you may not know how to use the LIKE operator. You can find details for that here. That will let you use some regular expressions to match against your text.
I'd recommend that you load the 2-grams into a table and give it a try. My guess is that a simple query with a LIKE condition will do the job for you.
-
I know about the LIKE operator, yes. However, I'm not sure about it's expressiveness and speed. Wouldn't using a LIKE operator and no other criteria for the SELECT query mean that the entire table of 100 million rows would be scanned sequentially and the LIKE operator applied to each row's text column? Can the LIKE operator use an index of some kind? Anyways, I was planning to load them into a table and experiment ASAP, but I'll only have access to that test data of 100 million entries in a few weeks. I was hoping to get ahead of it a bit and plan some of my steps in advance.Swift– Swift2015年11月13日 14:38:41 +00:00Commented Nov 13, 2015 at 14:38
-
@Swift as I mentioned in a comment on the Question, using any function (built-in, SQLCLR, etc) will disallow using an index on that field. So using a true RegEx SQLCLR function will scan the entire table and apply the pattern to every row's "text" column. The
LIKE
operator, as long as the pattern does not lead with a wildcard, actually can use an index (though indexes are limited to 900 bytes total, andINCLUDE
columns are not searchable).Solomon Rutzky– Solomon Rutzky2015年11月13日 15:53:14 +00:00Commented Nov 13, 2015 at 15:53 -
Yes, t he LIKE operator will mean a full scan, most likely, but I expect this will not be anywhere near as onerous as you think it will be. And I don't think there is any way around it. A good database server can scan through 100 million records in seconds. The best optimizations you can make is to make sure that DB server has lots of memory and strong CPU.user67871– user678712015年11月16日 14:23:35 +00:00Commented Nov 16, 2015 at 14:23
Based on the given requirements and sample data, I would recommend taking another look at Full-Text Search. I think you are over-complicating the process by requiring full Regular Expressions. Full-Text Search (FTS) should be able to handle your queries even without RegEx.
FTS tokenizes each word in each indexed column. That right there is a large part of why you are wanting to use RegEx: since you don't know if the string field has in it either "Going hume" or "Hume going". But if the words are tokenized then you are only looking at one word at a time anyway.
The other reason for wanting RegEx is to more easily express the variations, such as:
- hume
- humes
- huma
- humas
You should be able to accomplish this easily using the CONTAINS
function with a "prefix" term as follows (yes, the double-quotes are required):
CONTAINS ([Ngrams], '"hum*"')
Please also review the following documentation to get a better understanding of what FTS can do:
- Create and Manage Full-Text Indexes (shows how string fields are tokenized)
- Query with Full-Text Search
- CONTAINS
- CONTAINSTABLE
- Get Started with Full-Text Search
With respect to the idea of using LIKE
or a SQLCLR RegEx function:
- Using any function (built-in, SQLCLR, etc) will disallow using an index on that field. So using a true RegEx SQLCLR function will scan the entire table and apply the pattern to every row's "text" column.
- The
LIKE
operator, as long as the pattern does not lead with a wildcard (i.e._
,%
, or[]
), actually can use an index. - The
LIKE
operator, if the pattern does lead with a wildcard (i.e._
,%
, or[]
), will scan every row just like when using a function. - Indexes are limited to 900 bytes total (across all key fields).
INCLUDE
columns do not contribute to that 900 byte limit, but also aren't searchable. - If you have a
BIGINT
ID field, that is already 8 bytes, leaving you with 892 left. If the data isNVARCHAR
, and it most likely is (or at least should be), that is a max of 446 two-byte characters (less than 446 if any four-byte supplementary characters are in there).
I am going to accept that you need regex. Regex does stuff you cannot do with Like and Full Text.
Breaking it up in 3 tables will help. Or you could just have a column of word count and include that in the where. Where word count will not be as fast as separate tables but compared to regex it is super fast.
Use Like when you don't need regex
If you need regex use Like to limit the number of regex you need to evaluate if you can
I might be worth split it up the ngrams into separate words. But then you would need to reassemble the ngram in the search (or view). Like 'hum%' can use an index but like '%hum%' cannot use an index. You could do stuff like find words with a small Levenstine distance.
For ngrams I would not use Full Text. I would break it up myself. Full Text has overhead and mixing regex with Full Text could get messy.
LIKE
operator does not support RegEx in any way. It only supports very simple patterns that most people mistakingly refer to as RegEx or a limited RegEx: zero-or-more-of-any-character wildcard (%
same as RegEx.*
), exactly-one-character-of-any-character (_
same as RegEx.
), and exactly-one-character-within-this-range or not-in-this-range ([ ]
or[^ ]
same as RegEx). For actual RegEx you need to use SQLCLR. I wrote a library that has most RegEx functions in the Free version: SQL#.BIGINT
. Assuming just these 2 fields, that gives you a max of 446 two-byte characters (assuming no supplementary characters). A function (RegEx or other) won't use an index and will scan all rows.LIKE
can use an index IF not leading with a wildcard. Have you tried usingLIKE 'hume %' OR LIKE 'humes %' OR LIKE 'huma %' OR LIKE 'humas %' ...
? Listing out each variation as a separateOR
condition? That covers half the options.