Given a GUID for an event, say logins, how might I search for similar or dissimilar events, utilizing the other columns data?
For example: Find me similar logins to [GUID] {Server:"s123", State="NY", Status="503", etc...}
OR
Find me logins the opposite of that example given?
I feel there must been a name for this, but Full-Text, or "%Like%" syntax I do not believe would be sufficient.
Is indexing the data with solr or elasticsearch the solution?
1 Answer 1
The simplistic method is to count the number of columns that match:
select
n.id as needle_id,
h.id as haystack_id,
case when n.name = h.name then 1 else 0 end
+ case when n.phone = h.phone then 1 else 0 end as relevance
from
needles n
join
haystack h
on
n.name = h.name
or n.phone = h.phone
order by
relevance desc;
where "name" and "phone" are columns in your tables. You can use the same table as both the needle and haystack if you want (to find duplicates in a table).
http://blog.databasepatterns.com/2014/12/fuzzy-record-matching-in-sql-p1.html
The more correct and much more complicated method is to use Probabilistic Record Linkage. There's an R package for it (recordlinkage)