Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
This repository was archived by the owner on Dec 7, 2023. It is now read-only.

explosion/tokenizations

Repository files navigation

Robust and Fast tokenizations alignment library for Rust and Python

creates.io pypi Actions Status

sample

Demo: demo
Rust document: docs.rs
Blog post: How to calculate the alignment between BERT and spaCy tokens effectively and robustly

Usage (Python)

  • Installation
$ pip install -U pip # update pip
$ pip install pytokenizations
  • Or, install from source

This library uses maturin to build the wheel.

$ git clone https://github.com/tamuhey/tokenizations
$ cd tokenizations/python
$ pip install maturin
$ maturin build

Now the wheel is created in python/target/wheels directory, and you can install it with pip install *whl.

get_alignments

def get_alignments(a: Sequence[str], b: Sequence[str]) -> Tuple[List[List[int]], List[List[int]]]: ...

Returns alignment mappings for two different tokenizations:

>>> tokens_a = ["å", "BC"]
>>> tokens_b = ["abc"] # the accent is dropped (å -> a) and the letters are lowercased(BC -> bc)
>>> a2b, b2a = tokenizations.get_alignments(tokens_a, tokens_b)
>>> print(a2b)
[[0], [0]]
>>> print(b2a)
[[0, 1]]

a2b[i] is a list representing the alignment from tokens_a to tokens_b.

Usage (Rust)

See here: docs.rs

Related

About

Robust and Fast tokenizations alignment library for Rust and Python https://tamuhey.github.io/tokenizations/

Resources

License

Contributing

Stars

Watchers

Forks

Packages

Contributors

AltStyle によって変換されたページ (->オリジナル) /