Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit 6f71152

Browse files
Initial alpha version (#1)
Initial alpha version
1 parent da14c81 commit 6f71152

File tree

18 files changed

+1041
-15
lines changed

18 files changed

+1041
-15
lines changed

‎.github/workflows/CI.yml‎

Lines changed: 3 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,7 @@
11
name: CI
22
on:
3-
push:
4-
branches:
5-
- main
6-
tags: '*'
7-
pull_request:
3+
- push
4+
- pull_request
85
jobs:
96
test:
107
name: Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }}
@@ -13,19 +10,15 @@ jobs:
1310
fail-fast: false
1411
matrix:
1512
version:
16-
- '1.0'
1713
- '1.6'
14+
- '1.7'
1815
- 'nightly'
1916
os:
2017
- ubuntu-latest
2118
- macOS-latest
2219
- windows-latest
2320
arch:
2421
- x64
25-
- x86
26-
exclude:
27-
- os: macOS-latest
28-
arch: x86
2922
steps:
3023
- uses: actions/checkout@v2
3124
- uses: julia-actions/setup-julia@v1

‎.gitignore‎

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,4 @@
33
*.jl.mem
44
/Manifest.toml
55
/docs/build/
6+
.vscode

‎Project.toml‎

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,18 @@ uuid = "2e3c4037-312d-4650-b9c0-fcd0fc09aae4"
33
authors = ["Bernard Brenyah"]
44
version = "0.1.0"
55

6+
[deps]
7+
CircularArrays = "7a955b69-7140-5f4e-a0ed-f168c5e2e749"
8+
DataStructures = "864edb3b-99cc-5e75-8d2d-829cb0a9cfe8"
9+
OffsetArrays = "6fe1bfb0-de20-5000-8ca7-80f57d26f881"
10+
ProgressMeter = "92933f4c-e287-5a05-a399-4b506db050ca"
11+
612
[compat]
713
julia = "1"
814

915
[extras]
16+
Faker = "0efc519c-db33-5916-ab87-703215c3906f"
1017
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
1118

1219
[targets]
13-
test = ["Test"]
20+
test = ["Test", "Faker"]

‎README.md‎

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,45 @@
66
[![Coverage](https://codecov.io/gh/PyDataBlog/SimString.jl/branch/main/graph/badge.svg)](https://codecov.io/gh/PyDataBlog/SimString.jl)
77
[![Code Style: Blue](https://img.shields.io/badge/code%20style-blue-4495d1.svg)](https://github.com/invenia/BlueStyle)
88
[![ColPrac: Contributor's Guide on Collaborative Practices for Community Packages](https://img.shields.io/badge/ColPrac-Contributor's%20Guide-blueviolet)](https://github.com/SciML/ColPrac)
9+
10+
A native Julia implementation of the CPMerge algorithm, which is designed for approximate string matching.
11+
This package is be particulary useful for natural language processing tasks which demand the retrieval of strings/texts from a very large corpora (big amounts of texts). Currently, this package supports both Character and Word based N-grams feature generations and there are plans to open the package up for custom user defined feature generation methods.
12+
13+
## Features
14+
15+
- [X] Fast algorithm for string matching
16+
- [X] 100% exact retrieval
17+
- [X] Support for unicodes
18+
- [ ] Custom user defined feature generation methods
19+
- [ ] Mecab-based tokenizer support
20+
21+
## Suported String Similarity Measures
22+
23+
- [X] Dice coefficient
24+
- [X] Jaccard coefficient
25+
- [X] Cosine coefficient
26+
- [X] Overlap coefficient
27+
28+
## Installation
29+
30+
You can grab the latest stable version of this package from Julia registries by simply running;
31+
32+
*NB:* Don't forget to invoke Julia's package manager with `]`
33+
34+
```julia
35+
pkg> add SimString
36+
```
37+
38+
The few (and selected) brave ones can simply grab the current experimental features by simply adding the master branch to your development environment after invoking the package manager with `]`:
39+
40+
```julia
41+
pkg> add SimString#master
42+
```
43+
44+
You are good to go with bleeding edge features and breakages!
45+
46+
To revert to a stable version, you can simply run:
47+
48+
```julia
49+
pkg> free SimString
50+
```

‎docs/src/index.md‎

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,76 @@ CurrentModule = SimString
66

77
Documentation for [SimString](https://github.com/PyDataBlog/SimString.jl).
88

9+
A native Julia implementation of the CPMerge algorithm, which is designed for approximate string matching.
10+
This package is be particulary useful for natural language processing tasks which demand the retrieval of strings/texts from a very large corpora (big amounts of texts). Currently, this package supports both Character and Word based N-grams feature generations and there are plans to open the package up for custom user defined feature generation methods.
11+
12+
## Features
13+
14+
- [X] Fast algorithm for string matching
15+
- [X] 100% exact retrieval
16+
- [X] Support for unicodes
17+
- [ ] Custom user defined feature generation methods
18+
- [ ] Mecab-based tokenizer support
19+
20+
## Suported String Similarity Measures
21+
22+
- [X] Dice coefficient
23+
- [X] Jaccard coefficient
24+
- [X] Cosine coefficient
25+
- [X] Overlap coefficient
26+
27+
## Installation
28+
29+
You can grab the latest stable version of this package from Julia registries by simply running;
30+
31+
*NB:* Don't forget to invoke Julia's package manager with `]`
32+
33+
```julia
34+
pkg> add SimString
35+
```
36+
37+
The few (and selected) brave ones can simply grab the current experimental features by simply adding the master branch to your development environment after invoking the package manager with `]`:
38+
39+
```julia
40+
pkg> add SimString#master
41+
```
42+
43+
You are good to go with bleeding edge features and breakages!
44+
45+
To revert to a stable version, you can simply run:
46+
47+
```julia
48+
pkg> free SimString
49+
```
50+
51+
## Usage
52+
53+
```julia
54+
using SimString
55+
56+
# Inilisate database and some strings
57+
db = DictDB(CharacterNGrams(2, " "));
58+
push!(db, "foo");
59+
push!(db, "bar");
60+
push!(db, "fooo");
61+
62+
# Convinient approach is to use an array of strings for multiple entries: `append!(db, ["foo", "bar", "fooo"]);`
63+
64+
# Retrieve the closest match(es)
65+
res = search(Dice(), db, "foo"; α=0.8, ranked=true)
66+
# 2-element Vector{Tuple{String, Float64}}:
67+
# ("foo", 1.0)
68+
# ("fooo", 0.8888888888888888)
69+
70+
71+
```
72+
73+
## TODO: Benchmarks
74+
75+
## Release History
76+
77+
- 0.1.0 Initial release.
78+
979
```@index
1080
```
1181

‎extras/examples.jl‎

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
using SimString
2+
using Faker
3+
using BenchmarkTools
4+
using DataStructures
5+
6+
################################# Benchmark Bulk addition #####################
7+
db = DictDB(CharacterNGrams(3, " "));
8+
Faker.seed(2020)
9+
@time fake_names = [string(Faker.first_name(), " ", Faker.last_name()) for i in 1:100_000];
10+
11+
12+
f(d, x) = append!(d, x)
13+
@time f(db, fake_names)
14+
15+
16+
17+
################################ Simple Addition ###############################
18+
19+
db = DictDB(CharacterNGrams(2, " "));
20+
push!(db, "foo");
21+
push!(db, "bar");
22+
push!(db, "fooo");
23+
24+
f(x, c, s) = search(x, c, s)
25+
test = "foo";
26+
col = db;
27+
sim = Cosine();
28+
29+
f(Cosine(), db, "foo")
30+
31+
@btime f($sim, $col, $test)
32+
@btime search(Cosine(), db, "foo"; α=0.8, ranked=true)
33+
34+
35+
36+
db2 = DictDB(CharacterNGrams(3, " "));
37+
append!(db2, ["foo", "bar", "fooo", "foor"]) # also works via multiple dispatch on a vector
38+
39+
results = search(Cosine(), db, "foo"; α=0.8, ranked=true) # yet to be implemented
40+
41+
bs = ["foo", "bar", "foo", "foo", "bar"]
42+
SimString.extract_features(CharacterNGrams(3, " "), "prepress")
43+
SimString.extract_features(WordNGrams(2, " ", " "), "You are a really really really cool dude.")
44+
45+
db = DictDB(WordNGrams(2, " ", " "))
46+
push!(db, "You are a really really really cool dude.")

‎extras/py_benchmarks.py‎

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
from simstring.feature_extractor.character_ngram import CharacterNgramFeatureExtractor
2+
from simstring.measure.cosine import CosineMeasure
3+
from simstring.database.dict import DictDatabase
4+
from simstring.searcher import Searcher
5+
from faker import Faker
6+
7+
db = DictDatabase(CharacterNgramFeatureExtractor(3))
8+
9+
fake = Faker()
10+
fake_names = [fake.name() for i in range(100_000)]
11+
12+
def f(x):
13+
for i in x:
14+
db.add(i)
15+
16+
# %time f(fake_names)

‎src/SimString.jl‎

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,29 @@
11
module SimString
22

3-
# Write your package code here.
3+
import Base: push!, append!
4+
using DataStructures: DefaultOrderedDict, DefaultDict
5+
# using ProgressMeter
6+
# using CircularArrays
7+
# using OffsetArrays
8+
9+
######### Import modules & utils ################
10+
include("db_collection.jl")
11+
include("dictdb.jl")
12+
include("features.jl")
13+
include("measures.jl")
14+
include("search.jl")
15+
16+
17+
18+
####### Global export of user API #######
19+
export Dice, Jaccard, Cosine, Overlap,
20+
AbstractSimStringDB, DictDB,
21+
CharacterNGrams, WordNGrams,
22+
search
23+
24+
25+
26+
27+
428

529
end

‎src/db_collection.jl‎

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Custom Collections
2+
3+
"""
4+
Base type for all custom db collections.
5+
"""
6+
abstract type AbstractSimStringDB end
7+
8+
9+
"""
10+
Abstract type for feature extraction structs
11+
"""
12+
abstract type FeatureExtractor end
13+
14+
15+
# Feature Extraction Definitions
16+
17+
"""
18+
Feature extraction on character-level ngrams
19+
"""
20+
struct CharacterNGrams{T1<:Int, T2<:AbstractString} <: FeatureExtractor
21+
n::T1 # number of n-grams to extract
22+
padder::T2 # string to use to pad n-grams
23+
end
24+
25+
26+
"""
27+
Feature extraction based on word-level ngrams
28+
"""
29+
struct WordNGrams{T1<:Int, T2<:AbstractString} <: FeatureExtractor
30+
n::T1 # number of n-grams to extract
31+
padder::T2 # string to use to pad n-grams
32+
splitter::T2 # string to use to split words
33+
end
34+
35+

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /