You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[](https://github.com/SciML/ColPrac)
9
+
10
+
A native Julia implementation of the CPMerge algorithm, which is designed for approximate string matching.
11
+
This package is be particulary useful for natural language processing tasks which demand the retrieval of strings/texts from a very large corpora (big amounts of texts). Currently, this package supports both Character and Word based N-grams feature generations and there are plans to open the package up for custom user defined feature generation methods.
12
+
13
+
## Features
14
+
15
+
-[X] Fast algorithm for string matching
16
+
-[X] 100% exact retrieval
17
+
-[X] Support for unicodes
18
+
-[ ] Custom user defined feature generation methods
19
+
-[ ] Mecab-based tokenizer support
20
+
21
+
## Suported String Similarity Measures
22
+
23
+
-[X] Dice coefficient
24
+
-[X] Jaccard coefficient
25
+
-[X] Cosine coefficient
26
+
-[X] Overlap coefficient
27
+
28
+
## Installation
29
+
30
+
You can grab the latest stable version of this package from Julia registries by simply running;
31
+
32
+
*NB:* Don't forget to invoke Julia's package manager with `]`
33
+
34
+
```julia
35
+
pkg> add SimString
36
+
```
37
+
38
+
The few (and selected) brave ones can simply grab the current experimental features by simply adding the master branch to your development environment after invoking the package manager with `]`:
39
+
40
+
```julia
41
+
pkg> add SimString#master
42
+
```
43
+
44
+
You are good to go with bleeding edge features and breakages!
45
+
46
+
To revert to a stable version, you can simply run:
Copy file name to clipboardExpand all lines: docs/src/index.md
+70Lines changed: 70 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,6 +6,76 @@ CurrentModule = SimString
6
6
7
7
Documentation for [SimString](https://github.com/PyDataBlog/SimString.jl).
8
8
9
+
A native Julia implementation of the CPMerge algorithm, which is designed for approximate string matching.
10
+
This package is be particulary useful for natural language processing tasks which demand the retrieval of strings/texts from a very large corpora (big amounts of texts). Currently, this package supports both Character and Word based N-grams feature generations and there are plans to open the package up for custom user defined feature generation methods.
11
+
12
+
## Features
13
+
14
+
-[X] Fast algorithm for string matching
15
+
-[X] 100% exact retrieval
16
+
-[X] Support for unicodes
17
+
-[ ] Custom user defined feature generation methods
18
+
-[ ] Mecab-based tokenizer support
19
+
20
+
## Suported String Similarity Measures
21
+
22
+
-[X] Dice coefficient
23
+
-[X] Jaccard coefficient
24
+
-[X] Cosine coefficient
25
+
-[X] Overlap coefficient
26
+
27
+
## Installation
28
+
29
+
You can grab the latest stable version of this package from Julia registries by simply running;
30
+
31
+
*NB:* Don't forget to invoke Julia's package manager with `]`
32
+
33
+
```julia
34
+
pkg> add SimString
35
+
```
36
+
37
+
The few (and selected) brave ones can simply grab the current experimental features by simply adding the master branch to your development environment after invoking the package manager with `]`:
38
+
39
+
```julia
40
+
pkg> add SimString#master
41
+
```
42
+
43
+
You are good to go with bleeding edge features and breakages!
44
+
45
+
To revert to a stable version, you can simply run:
46
+
47
+
```julia
48
+
pkg> free SimString
49
+
```
50
+
51
+
## Usage
52
+
53
+
```julia
54
+
using SimString
55
+
56
+
# Inilisate database and some strings
57
+
db =DictDB(CharacterNGrams(2, ""));
58
+
push!(db, "foo");
59
+
push!(db, "bar");
60
+
push!(db, "fooo");
61
+
62
+
# Convinient approach is to use an array of strings for multiple entries: `append!(db, ["foo", "bar", "fooo"]);`
63
+
64
+
# Retrieve the closest match(es)
65
+
res =search(Dice(), db, "foo"; α=0.8, ranked=true)
0 commit comments