Doc as their first positional arg, suitable for use as custom doc extensions (see below)TextStats class, since other methods for accessing the underlying functionality were made more accessible and convenient, and there's no longer need for a third method.Standardized functionality for getting/setting/removing doc extensions (PR #352)
>>> import textacy
>>> from textacy import extract, text_stats
>>> textacy.set_doc_extensions("extract")
>>> textacy.set_doc_extensions("text_stats.readability")
>>> textacy.remove_doc_extensions("extract.matches")
>>> textacy.make_spacy_doc("This is a test.", "en_core_web_sm")._.flesch_reading_ease()
118.17500000000001
spacier.core and extract.bagsextract and text_stats subpackage extensions to use the new setup, and made them more customizableπ Improved package code, tests, and docs
pytest conftest file to improve maintainability and consistency of unit test suite (PR #353)setup.py and switched from setuptools to build for buildspyproject.tomlMakefileTextStats docs (PR #331, Issue #334)ConceptNet data on Windows systems (Issue #345)Thanks to @austinjp, @scarroll32, @MirkoLenz for their help!
textacy.preprocessing)normalize.bullet_points()), removing HTML tags (remove.html_tags()), and removing bracketed contents such as in-line citations (remove.brackets()).make_pipeline() function for combining multiple preprocessors applied sequentially to input text into a single callable.preprocessing.normalize_whitespace() => preprocessing.normalize.whitespace().replace_with => repl, and remove.punctuation(text, marks=".?!") => remove.punctuation(text, only=[".", "?", "!"]).textacy.extract)extract.py and text_utils.py modules and ke subpackage. For the latter two, imports have changed:
from textacy import ke; ke.textrank() => from textacy import extract; extract.keyterms.textrank()from textacy import text_utils; text_utils.keywords_in_context() => from textacy import extract; extract.keywords_in_context()extract.regex_matches(): For matching regex patterns in a document's text that cross spaCy token boundaries, with various options for aligning matches back to tokens.extract.acronyms(): For extracting acronym-like tokens, without looking around for related definitions.extract.terms(): For flexibly combining n-grams, entities, and noun chunks into a single collection, with optional deduplication.("I", "like", "movie") which is... misleading. The new approach uses lists of tokens that need not be adjacent; in this case, it produces (["I"], ["did", "not", "like"], ["movie"]). For convenience, triple results are all named tuples, so elements may be accessed by name or index (e.g. svo.subject == svo[0]).extract.keywords_in_context() to always yield results, with optional padding of contexts, leaving printing of contexts up to users; also extended it to accept Doc or str objects as input.extract.pos_regex_matches() function, which is superseded by the more powerful extract.token_matches().textacy.similarity)similarity.py module into a subpackage, with metrics split out into categories: edit-, token-, and sequence-based approaches, as well as hybrid metrics.similarity.jaro())similarity.cosine()), Bag (similarity.bag()), and Tversky (similarity.tvserky())similarity.matching_subsequences_ratio())similarity.monge_elkan())Doc.similarity.textacy.representations)representations.network module
build_cooccurrence_network() function to represent a sequence of strings (or a sequence of such sequences) as a graph with nodes for each unique string and edges to other strings that co-occurred.build_similarity_network() function to represent a sequence of strings (or a sequence of such sequences) as a graph with nodes as top-level elements and edges to all others weighted by pairwise similarity.network.py module and duplicative extract.keyterms.graph_base.py module.vsm.vectorizers to representations.vectorizers module.
Vectorizer and GroupVectorizer, applying global inverse document frequency weights is now handled by a single arg: idf_type: Optional[str], rather than a combination of apply_idf: bool, idf_type: str; similarly, applying document-length weight normalizations is handled by dl_type: Optional[str] instead of apply_dl: bool, dl_type: strrepresentations.sparse_vec module for higher-level access to document vectorization via build_doc_term_matrix() and build_grp_term_matrix() functions, for cases when a single fit+transform is all you need.textacy.lang_id)lang_utils.py module into a subpackage, and added the primary user interface (identify_lang() and identify_topn_langs()) as package-level imports.thinc-based language identification model that's closer to the original CLD3 inspiration, replacing the simpler sklearn-based pipeline.textacy.load_spacy_lang() to only accept full spaCy language pipeline names or paths, in accordance with v3's removal of pipeline aliases and general tightening-up on this front. Unfortunately, textacy can no longer play fast and loose with automatic language identification => pipeline loading...textacy.make_spacy_doc() to accept a chunk_size arg that splits input text into chunks, processes each individually, then joins them into a single Doc; supersedes spacier.utils.make_doc_from_text_chunks(), which is now deprecated.Doc extensions into a top-level extensions.py module, and improved/streamlined the collectionDoc._.to_bag_of_words() and Doc._.to_bag_of_terms(), leveraging related functionality in extract.words() and extract.terms()Doc._.lang => use Doc.lang_Doc._.tokens => use iter(Doc)Doc._.n_tokens => len(Doc)Doc._.to_terms_list() => extract.terms(doc) or Doc._.extract_terms()Doc._.to_tagged_text() => NA, this was an old holdover that's not used in practice anymoreDoc._.to_semantic_network() => NA, use a function in textacy.representations.networksDoc extensions for textacy.extract functions (see above for details), with most functions having direct analogues; for example, to extract acronyms, use either textacy.extract.acronyms(doc) or doc._.extract_acronyms(). Keyterm extraction functions share a single extension: textacy.extract.keyterms.textrank(doc) <> doc._.extract_keyterms(method="textrank")DocBin for efficiently saving/loading Docs in binary format, with corresponding arg changes in io.write_spacy_docs() and Corpus.save()+.load()pyemd and srslynumpy and scikit-learncytoolz, jellyfish, matplotlib, pyphen, and spacy (v3.0+ only!)textacy.export module, which had functions for exporting spaCy docs into other external formats; this was a soft dependency on gensim and CONLL-U that wasn't enforced or guaranteed, so better to remove.types.py module for shared types, and used them everywhere. Also added/fixed type annotations throughout the code base.π Many thanks to @timgates42, @datanizing, @8W9aG, @0x2b3bfa0, and @gryBox for submitting PRs, either merged or used as inspiration for my own rework-in-progress.