Commit 6f71152

authored

Initial alpha version (#1)

Initial alpha version

1 parent da14c81 commit 6f71152Copy full SHA for 6f71152

File tree

18 files changed

+1041

-15

lines changed

.github/workflows
- CI.yml
.gitignore
Project.toml
README.md
docs/src
- index.md
extras
- examples.jl
- py_benchmarks.py
src
test

18 files changed

+1041

-15

lines changed

`‎.github/workflows/CI.yml‎`

Lines changed: 3 additions & 10 deletions

Original file line number	Diff line number	Diff line change
`@@ -1,10 +1,7 @@`
`1`	`1`	`name: CI`
`2`	`2`	`on:`
`3`		`- push:`
`4`		`- branches:`
`5`		`- - main`
`6`		`- tags: '*'`
`7`		`- pull_request:`
	`3`	`+ - push`
	`4`	`+ - pull_request`
`8`	`5`	`jobs:`
`9`	`6`	`test:`
`10`	`7`	`name: Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }}`
`@@ -13,19 +10,15 @@ jobs:`
`13`	`10`	`fail-fast: false`
`14`	`11`	`matrix:`
`15`	`12`	`version:`
`16`		`- - '1.0'`
`17`	`13`	`- '1.6'`
	`14`	`+ - '1.7'`
`18`	`15`	`- 'nightly'`
`19`	`16`	`os:`
`20`	`17`	`- ubuntu-latest`
`21`	`18`	`- macOS-latest`
`22`	`19`	`- windows-latest`
`23`	`20`	`arch:`
`24`	`21`	`- x64`
`25`		`- - x86`
`26`		`- exclude:`
`27`		`- - os: macOS-latest`
`28`		`- arch: x86`
`29`	`22`	`steps:`
`30`	`23`	`- uses: actions/checkout@v2`
`31`	`24`	`- uses: julia-actions/setup-julia@v1`

`‎.gitignore‎`

Lines changed: 1 addition & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -3,3 +3,4 @@`
`3`	`3`	`*.jl.mem`
`4`	`4`	`/Manifest.toml`
`5`	`5`	`/docs/build/`
	`6`	`+.vscode`

`‎Project.toml‎`

Lines changed: 8 additions & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -3,11 +3,18 @@ uuid = "2e3c4037-312d-4650-b9c0-fcd0fc09aae4"`
`3`	`3`	`authors = ["Bernard Brenyah"]`
`4`	`4`	`version = "0.1.0"`
`5`	`5`
	`6`	`+[deps]`
	`7`	`+CircularArrays = "7a955b69-7140-5f4e-a0ed-f168c5e2e749"`
	`8`	`+DataStructures = "864edb3b-99cc-5e75-8d2d-829cb0a9cfe8"`
	`9`	`+OffsetArrays = "6fe1bfb0-de20-5000-8ca7-80f57d26f881"`
	`10`	`+ProgressMeter = "92933f4c-e287-5a05-a399-4b506db050ca"`
	`11`	`+`
`6`	`12`	`[compat]`
`7`	`13`	`julia = "1"`
`8`	`14`
`9`	`15`	`[extras]`
	`16`	`+Faker = "0efc519c-db33-5916-ab87-703215c3906f"`
`10`	`17`	`Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"`
`11`	`18`
`12`	`19`	`[targets]`
`13`		`-test = ["Test"]`
	`20`	`+test = ["Test", "Faker"]`

`‎README.md‎`

Lines changed: 42 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -6,3 +6,45 @@`
`6`	`6`	`[![Coverage](https://codecov.io/gh/PyDataBlog/SimString.jl/branch/main/graph/badge.svg)](https://codecov.io/gh/PyDataBlog/SimString.jl)`
`7`	`7`	`[![Code Style: Blue](https://img.shields.io/badge/code%20style-blue-4495d1.svg)](https://github.com/invenia/BlueStyle)`
`8`	`8`	`[![ColPrac: Contributor's Guide on Collaborative Practices for Community Packages](https://img.shields.io/badge/ColPrac-Contributor's%20Guide-blueviolet)](https://github.com/SciML/ColPrac)`
	`9`	`+`
	`10`	`+A native Julia implementation of the CPMerge algorithm, which is designed for approximate string matching.`
	`11`	`+This package is be particulary useful for natural language processing tasks which demand the retrieval of strings/texts from a very large corpora (big amounts of texts). Currently, this package supports both Character and Word based N-grams feature generations and there are plans to open the package up for custom user defined feature generation methods.`
	`12`	`+`
	`13`	`+## Features`
	`14`	`+`
	`15`	`+- [X] Fast algorithm for string matching`
	`16`	`+- [X] 100% exact retrieval`
	`17`	`+- [X] Support for unicodes`
	`18`	`+- [ ] Custom user defined feature generation methods`
	`19`	`+- [ ] Mecab-based tokenizer support`
	`20`	`+`
	`21`	`+## Suported String Similarity Measures`
	`22`	`+`
	`23`	`+- [X] Dice coefficient`
	`24`	`+- [X] Jaccard coefficient`
	`25`	`+- [X] Cosine coefficient`
	`26`	`+- [X] Overlap coefficient`
	`27`	`+`
	`28`	`+## Installation`
	`29`	`+`
	`30`	`+You can grab the latest stable version of this package from Julia registries by simply running;`
	`31`	`+`
	`32`	+NB: Don't forget to invoke Julia's package manager with `]`
	`33`	`+`
	`34`	+```julia
	`35`	`+pkg> add SimString`
	`36`	+```
	`37`	`+`
	`38`	+The few (and selected) brave ones can simply grab the current experimental features by simply adding the master branch to your development environment after invoking the package manager with `]`:
	`39`	`+`
	`40`	+```julia
	`41`	`+pkg> add SimString#master`
	`42`	+```
	`43`	`+`
	`44`	`+You are good to go with bleeding edge features and breakages!`
	`45`	`+`
	`46`	`+To revert to a stable version, you can simply run:`
	`47`	`+`
	`48`	+```julia
	`49`	`+pkg> free SimString`
	`50`	+```

`‎docs/src/index.md‎`

Lines changed: 70 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -6,6 +6,76 @@ CurrentModule = SimString`
`6`	`6`
`7`	`7`	`Documentation for [SimString](https://github.com/PyDataBlog/SimString.jl).`
`8`	`8`
	`9`	`+A native Julia implementation of the CPMerge algorithm, which is designed for approximate string matching.`
	`10`	`+This package is be particulary useful for natural language processing tasks which demand the retrieval of strings/texts from a very large corpora (big amounts of texts). Currently, this package supports both Character and Word based N-grams feature generations and there are plans to open the package up for custom user defined feature generation methods.`
	`11`	`+`
	`12`	`+## Features`
	`13`	`+`
	`14`	`+- [X] Fast algorithm for string matching`
	`15`	`+- [X] 100% exact retrieval`
	`16`	`+- [X] Support for unicodes`
	`17`	`+- [ ] Custom user defined feature generation methods`
	`18`	`+- [ ] Mecab-based tokenizer support`
	`19`	`+`
	`20`	`+## Suported String Similarity Measures`
	`21`	`+`
	`22`	`+- [X] Dice coefficient`
	`23`	`+- [X] Jaccard coefficient`
	`24`	`+- [X] Cosine coefficient`
	`25`	`+- [X] Overlap coefficient`
	`26`	`+`
	`27`	`+## Installation`
	`28`	`+`
	`29`	`+You can grab the latest stable version of this package from Julia registries by simply running;`
	`30`	`+`
	`31`	+NB: Don't forget to invoke Julia's package manager with `]`
	`32`	`+`
	`33`	+```julia
	`34`	`+pkg> add SimString`
	`35`	+```
	`36`	`+`
	`37`	+The few (and selected) brave ones can simply grab the current experimental features by simply adding the master branch to your development environment after invoking the package manager with `]`:
	`38`	`+`
	`39`	+```julia
	`40`	`+pkg> add SimString#master`
	`41`	+```
	`42`	`+`
	`43`	`+You are good to go with bleeding edge features and breakages!`
	`44`	`+`
	`45`	`+To revert to a stable version, you can simply run:`
	`46`	`+`
	`47`	+```julia
	`48`	`+pkg> free SimString`
	`49`	+```
	`50`	`+`
	`51`	`+## Usage`
	`52`	`+`
	`53`	+```julia
	`54`	`+using SimString`
	`55`	`+`
	`56`	`+# Inilisate database and some strings`
	`57`	`+db = DictDB(CharacterNGrams(2, " "));`
	`58`	`+push!(db, "foo");`
	`59`	`+push!(db, "bar");`
	`60`	`+push!(db, "fooo");`
	`61`	`+`
	`62`	+# Convinient approach is to use an array of strings for multiple entries: `append!(db, ["foo", "bar", "fooo"]);`
	`63`	`+`
	`64`	`+# Retrieve the closest match(es)`
	`65`	`+res = search(Dice(), db, "foo"; α=0.8, ranked=true)`
	`66`	`+# 2-element Vector{Tuple{String, Float64}}:`
	`67`	`+# ("foo", 1.0)`
	`68`	`+# ("fooo", 0.8888888888888888)`
	`69`	`+`
	`70`	`+`
	`71`	+```
	`72`	`+`
	`73`	`+## TODO: Benchmarks`
	`74`	`+`
	`75`	`+## Release History`
	`76`	`+`
	`77`	`+- 0.1.0 Initial release.`
	`78`	`+`
`9`	`79`	```@index
`10`	`80`	```
`11`	`81`

`‎extras/examples.jl‎`

Lines changed: 46 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,46 @@`
	`1`	`+using SimString`
	`2`	`+using Faker`
	`3`	`+using BenchmarkTools`
	`4`	`+using DataStructures`
	`5`	`+`
	`6`	`+################################# Benchmark Bulk addition #####################`
	`7`	`+db = DictDB(CharacterNGrams(3, " "));`
	`8`	`+Faker.seed(2020)`
	`9`	`+@time fake_names = [string(Faker.first_name(), " ", Faker.last_name()) for i in 1:100_000];`
	`10`	`+`
	`11`	`+`
	`12`	`+f(d, x) = append!(d, x)`
	`13`	`+@time f(db, fake_names)`
	`14`	`+`
	`15`	`+`
	`16`	`+`
	`17`	`+################################ Simple Addition ###############################`
	`18`	`+`
	`19`	`+db = DictDB(CharacterNGrams(2, " "));`
	`20`	`+push!(db, "foo");`
	`21`	`+push!(db, "bar");`
	`22`	`+push!(db, "fooo");`
	`23`	`+`
	`24`	`+f(x, c, s) = search(x, c, s)`
	`25`	`+test = "foo";`
	`26`	`+col = db;`
	`27`	`+sim = Cosine();`
	`28`	`+`
	`29`	`+f(Cosine(), db, "foo")`
	`30`	`+`
	`31`	`+@btime f($sim, $col, $test)`
	`32`	`+@btime search(Cosine(), db, "foo"; α=0.8, ranked=true)`
	`33`	`+`
	`34`	`+`
	`35`	`+`
	`36`	`+db2 = DictDB(CharacterNGrams(3, " "));`
	`37`	`+append!(db2, ["foo", "bar", "fooo", "foor"]) # also works via multiple dispatch on a vector`
	`38`	`+`
	`39`	`+results = search(Cosine(), db, "foo"; α=0.8, ranked=true) # yet to be implemented`
	`40`	`+`
	`41`	`+bs = ["foo", "bar", "foo", "foo", "bar"]`
	`42`	`+SimString.extract_features(CharacterNGrams(3, " "), "prepress")`
	`43`	`+SimString.extract_features(WordNGrams(2, " ", " "), "You are a really really really cool dude.")`
	`44`	`+`
	`45`	`+db = DictDB(WordNGrams(2, " ", " "))`
	`46`	`+push!(db, "You are a really really really cool dude.")`

`‎extras/py_benchmarks.py‎`

Lines changed: 16 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,16 @@`
	`1`	`+from simstring.feature_extractor.character_ngram import CharacterNgramFeatureExtractor`
	`2`	`+from simstring.measure.cosine import CosineMeasure`
	`3`	`+from simstring.database.dict import DictDatabase`
	`4`	`+from simstring.searcher import Searcher`
	`5`	`+from faker import Faker`
	`6`	`+`
	`7`	`+db = DictDatabase(CharacterNgramFeatureExtractor(3))`
	`8`	`+`
	`9`	`+fake = Faker()`
	`10`	`+fake_names = [fake.name() for i in range(100_000)]`
	`11`	`+`
	`12`	`+def f(x):`
	`13`	`+ for i in x:`
	`14`	`+ db.add(i)`
	`15`	`+`
	`16`	`+# %time f(fake_names)`

`‎src/SimString.jl‎`

Lines changed: 25 additions & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -1,5 +1,29 @@`
`1`	`1`	`module SimString`
`2`	`2`
`3`		`-# Write your package code here.`
	`3`	`+import Base: push!, append!`
	`4`	`+using DataStructures: DefaultOrderedDict, DefaultDict`
	`5`	`+# using ProgressMeter`
	`6`	`+# using CircularArrays`
	`7`	`+# using OffsetArrays`
	`8`	`+`
	`9`	`+######### Import modules & utils ################`
	`10`	`+include("db_collection.jl")`
	`11`	`+include("dictdb.jl")`
	`12`	`+include("features.jl")`
	`13`	`+include("measures.jl")`
	`14`	`+include("search.jl")`
	`15`	`+`
	`16`	`+`
	`17`	`+`
	`18`	`+####### Global export of user API #######`
	`19`	`+export Dice, Jaccard, Cosine, Overlap,`
	`20`	`+ AbstractSimStringDB, DictDB,`
	`21`	`+ CharacterNGrams, WordNGrams,`
	`22`	`+ search`
	`23`	`+`
	`24`	`+`
	`25`	`+`
	`26`	`+`
	`27`	`+`
`4`	`28`
`5`	`29`	`end`

`‎src/db_collection.jl‎`

Lines changed: 35 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,35 @@`
	`1`	`+# Custom Collections`
	`2`	`+`
	`3`	`+"""`
	`4`	`+Base type for all custom db collections.`
	`5`	`+"""`
	`6`	`+abstract type AbstractSimStringDB end`
	`7`	`+`
	`8`	`+`
	`9`	`+"""`
	`10`	`+Abstract type for feature extraction structs`
	`11`	`+"""`
	`12`	`+abstract type FeatureExtractor end`
	`13`	`+`
	`14`	`+`
	`15`	`+# Feature Extraction Definitions`
	`16`	`+`
	`17`	`+"""`
	`18`	`+Feature extraction on character-level ngrams`
	`19`	`+"""`
	`20`	`+struct CharacterNGrams{T1<:Int, T2<:AbstractString} <: FeatureExtractor`
	`21`	`+ n::T1 # number of n-grams to extract`
	`22`	`+ padder::T2 # string to use to pad n-grams`
	`23`	`+end`
	`24`	`+`
	`25`	`+`
	`26`	`+"""`
	`27`	`+Feature extraction based on word-level ngrams`
	`28`	`+"""`
	`29`	`+struct WordNGrams{T1<:Int, T2<:AbstractString} <: FeatureExtractor`
	`30`	`+ n::T1 # number of n-grams to extract`
	`31`	`+ padder::T2 # string to use to pad n-grams`
	`32`	`+ splitter::T2 # string to use to split words`
	`33`	`+end`
	`34`	`+`
	`35`	`+`

0 commit comments

Comments

(0)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 6f71152

File tree

18 files changed

18 files changed

`‎.github/workflows/CI.yml‎`

`‎.gitignore‎`

`‎Project.toml‎`

`‎README.md‎`

`‎docs/src/index.md‎`

`‎extras/examples.jl‎`

`‎extras/py_benchmarks.py‎`

`‎src/SimString.jl‎`

`‎src/db_collection.jl‎`

0 commit comments