Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

phonyland/ngram

Repository files navigation

πŸ§ͺ
N-Gram Tools

This repository contains the N-Gram Tools for πŸ™ƒ Phony Language that includes features like sanitizing, tokenization, n-gram extraction, frequency mapping.

πŸš€ Installation

Requires PHP >= 8.0.

You can install the package via composer:

composer require phonyland/ngram

⌨️ Usage

Tokenizer

Word Tokenization

$tokenizer->tokenize($text);
⌨️ Usage
use Phonyland\NGram\Tokenizer;
use Phonyland\NGram\TokenizerFilter;
$tokenizer = new Tokenizer();
$tokenizer
 ->addWordSeparatorPattern(';')
 ->addWordSeparatorPattern('\s')
 ->addWordFilterRule(TokenizerFilterType::NO_SYMBOLS);
$text = 'sample text;sample;text';
$tokenizer->tokenize($text);
πŸ–₯ Output
[
 "sample",
 "text",
 "sample",
 "text",
];

Sentence Tokenization

$tokenizer->sentences($text);
⌨️ Usage
use Phonyland\NGram\Tokenizer;
$tokenizer = new Tokenizer();
$tokenizer
 ->addSentenceSeparatorPattern('.')
 ->addSentenceSeparatorPattern('!')
 ->addSentenceSeparatorPattern('?');
$text = 'Sample Sentence. Sample Sentence! Sample Sentence? Sample Sentence no. 4?! Sample sample sentence... End';
$tokenizer->sentences($text);
πŸ–₯ Output
[
 "Sample Sentence.",
 "Sample Sentence!",
 "Sample Sentence?",
 "Sample Sentence no.",
 "4?!",
 "Sample sample sentence...",
 "End",
];

Word Tokenization by Sentences

$tokenizer->tokenizeBySentences($text);
⌨️ Usage
use Phonyland\NGram\Tokenizer;
use Phonyland\NGram\TokenizerFilter;
$tokenizer = new Tokenizer();
$tokenizer
 ->addSentenceSeparatorPattern('.')
 ->addSentenceSeparatorPattern('!')
 ->addSentenceSeparatorPattern('?')
 ->addWordFilterRule(TokenizerFilterType::NO_SYMBOLS)
 ->addWordSeparatorPattern(TokenizerFilterType::WHITESPACE_SEPARATOR);
$text = 'Sample Sentence. Sample Sentence! Sample Sentence? Sample Sentence no. 4?! Sample sample sentence... End';
$tokenizer->tokenizeBySentences($text);
πŸ–₯ Output
[
 ["Sample", "Sentence"],
 ["Sample", "Sentence"],
 ["Sample", "Sentence"],
 ["Sample", "Sentence", "no"],
 ["Sample", "sample", "sentence"],
 ["End"],
];

N-Gram

N-Gram Sequence

NGramSequence::multigram($n, $tokens, $isUnique);
NGramSequence::trigram($tokens, $isUnique);
NGramSequence::bigram($tokens, $isUnique);
NGramSequence::unigram($tokens, $isUnique);
⌨️ Usage
use Phonyland\NGram\Tokenizer;
use Phonyland\NGram\NGramSequence;
use Phonyland\NGram\TokenizerFilter;
$tokenizer = new Tokenizer();
$tokenizer->addWordSeparatorPattern(TokenizerFilterType::WHITESPACE_SEPARATOR);
$tokens = $tokenizer->tokenize('sample text');
NGramSequence::multigram(4, $tokens);
// ['samp', 'ampl', 'mple', 'text'];
// Generate Unique N-Grams 
NGramSequence::unigram($tokens, true);
// ['s', 'a', 'm', 'p', 'l', 'e', 't', 'x'];

N-Gram Sequences with Count

NGramCount::multigram(4, $tokens);
NGramCount::trigram($tokens);
NGramCount::bigram($tokens);
NGramCount::unigram($tokens);
NGramCount::incrementElementCount($element, $elements);
⌨️ Usage
use Phonyland\NGram\Tokenizer;
use Phonyland\NGram\NGramCount;
$tokenizer = new Tokenizer();
$tokenizer->addWordSeparatorPattern(TokenizerFilterType::WHITESPACE_SEPARATOR);
$tokens = $tokenizer->tokenize('sample text');
NGramCount::multigram(4, $tokens);
// [
// 'samp' => 1,
// 'ampl' => 1,
// 'mple' => 1,
// 'text' => 1,
// ];

N-Gram Frequency

NGramFrequency::multigram(4, $tokens);
NGramFrequency::multigram($tokens);
NGramFrequency::bigram($tokens);
NGramFrequency::unigram($tokens);
NGramFrequency::frequencyFromCount($countArray);
⌨️ Usage
use Phonyland\NGram\Tokenizer;
use Phonyland\NGram\NGramFrequency;
use Phonyland\NGram\TokenizerFilter;
$tokenizer = new Tokenizer();
$tokenizer->addWordSeparatorPattern(TokenizerFilterType::WHITESPACE_SEPARATOR);
$tokenizer->addWordFilterRule(TokenizerFilterType::ALPHABETICAL);
$tokens = $tokenizer->tokenize('bombadil! bombadillo!');
NGramFrequency::multigram(4, $tokens);
//[
// 'bomb' => 0.16666666666666666,
// 'omba' => 0.16666666666666666,
// 'mbad' => 0.16666666666666666,
// 'badi' => 0.16666666666666666,
// 'adil' => 0.16666666666666666,
// 'dill' => 0.08333333333333333,
// 'illo' => 0.08333333333333333,
//]

πŸ™ƒ

Start generating fake data with πŸ™ƒ Phony Framework,
visit the main Phony Repository .

Explore the docs Β» https://phony.land
Follow us on Twitter Β» @phony_land

πŸ™ƒ Phony
Fake Data Generation Framework

was created by
Yunus Emre DeligΓΆz
under
MIT license .

About

πŸ§ͺ N-Gram Tools for πŸ™ƒ Phony Language that includes features like sanitizing, tokenization, n-gram extraction, frequency mapping.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Contributors

Languages

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /