Name	Name	Last commit message	Last commit date
Latest commit History 292 Commits
.cargo	.cargo
.github	.github
benches	benches
demo	demo
dockerfiles	dockerfiles
img	img
note	note
python	python
src	src
.dockerignore	.dockerignore
.gitignore	.gitignore
CONTRIBUTING.md	CONTRIBUTING.md
Cargo.toml	Cargo.toml
LICENSE	LICENSE
README.md	README.md

Name

Last commit message

Last commit date

Latest commit

History

img

src

Robust and Fast tokenizations alignment library for Rust and Python

creates.io pypi Actions Status

sample

Demo: demo
Rust document: docs.rs
Blog post: How to calculate the alignment between BERT and spaCy tokens effectively and robustly

Usage (Python)

Installation

$ pip install -U pip # update pip
$ pip install pytokenizations

Or, install from source

This library uses maturin to build the wheel.

$ git clone https://github.com/tamuhey/tokenizations
$ cd tokenizations/python
$ pip install maturin
$ maturin build

Now the wheel is created in python/target/wheels directory, and you can install it with pip install *whl.

`get_alignments`

def get_alignments(a: Sequence[str], b: Sequence[str]) -> Tuple[List[List[int]], List[List[int]]]: ...

Returns alignment mappings for two different tokenizations:

>>> tokens_a = ["å", "BC"]
>>> tokens_b = ["abc"] # the accent is dropped (å -> a) and the letters are lowercased(BC -> bc)
>>> a2b, b2a = tokenizations.get_alignments(tokens_a, tokens_b)
>>> print(a2b)
[[0], [0]]
>>> print(b2a)
[[0, 1]]

a2b[i] is a list representing the alignment from tokens_a to tokens_b.

Usage (Rust)

See here: docs.rs

Algorithm overview
Blog post
seqdiff is used for the diff process.
textspan
explosion/spacy-alignments: 💫 A spaCy package for Yohei Tamura's Rust tokenizations library
- Python bindings for this library, maintained by Explosion, author of spaCy. If you feel difficult to install pytokenizations, please try this.

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

explosion/tokenizations

Folders and files

Latest commit

History

Repository files navigation

Robust and Fast tokenizations alignment library for Rust and Python

Usage (Python)

`get_alignments`

Usage (Rust)

Related

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Robust and Fast tokenizations alignment library for Rust and Python

Usage (Python)

get_alignments

Usage (Rust)

Related

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`get_alignments`

Packages