sd44/GlossaryGenerator

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
ECDICT @ 4590167		ECDICT @ 4590167
books		books
dict		dict
.gitignore		.gitignore
.gitmodules		.gitmodules
.pdm-python		.pdm-python
FOO_test.txt		FOO_test.txt
FOO_test_dict.xlsx		FOO_test_dict.xlsx
README.md		README.md
TODO.md		TODO.md
generator.py		generator.py
genxlsx_from_wordstxt.py		genxlsx_from_wordstxt.py
pdm.lock		pdm.lock
pdm.toml		pdm.toml
pyproject.toml		pyproject.toml

Repository files navigation

自用小工具

单词表生成器,generator.py,使用方法见下方
根据单词表生成相应单词原形、音标和中文释义等xlsx文件。详情见genxlsx_from_wordstxt.py,注释详细

单词表生成器

Fork自 https://github.com/rfg1024/GlossaryGenerator

如今AI大模型快速发展,本脚本相比较本地部署AI大模型或其VIP,几乎毫无先进性,哈哈。

原理

改用 spacy 库进行处理

读取一本小说的文本,干掉复数、时态这些东西(词形还原 Lemmatisation),得到一本小说的词汇表;

和常用(高频)词库dict对比,去掉词库中排名前num个单词,生成你可能不认识的词表。

目前预置词典有:

 - `COCA20000.txt`: COCA语料库20000词,高频排序
 - `collins.txt`: 柯林斯语料库14148词,高频排序
 - `common30k.txt`: 通常30000词,高频排序
 - `middleschool1600.txt`: 中国初中1600词,字母排序

生词表生成后可导入GoldenDict,欧陆词典一类app,快速预习一下,可以大幅提升阅读原版书籍时的体验。

依赖python库

参照pyproject.toml中内容。其中en-core-web-sm可能需科学上网。

dependencies = [
 "textract-py3>=2.1.0",
 "spacy>=3.8.3",
 "en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl",
]

支持格式

txt
pdf(文字版)
epub
doc/docx
csv
xls
xlsx

非txt文件花的时间会久一点,对其他格式的支持不一定好,我没有测试特别多文件。

使用方法

命令行方法

generator.py -h

usage: generator [-h] -f FILENAME [-d DICT_EXCLUDE] [-n NUM]
generator text glossaries
options:
 -h, --help show this help message and exit
 -f FILENAME, --filename FILENAME
 The text filename (default: None)
 -d DICT_EXCLUDE, --dict-exclude DICT_EXCLUDE
 Exclude the words from the dictionary (default:
 dict/middleschool1600.txt)
 -n NUM, --num NUM Exclude the first n words from the dictionary。Special
 Value:-1, All the words; 0, None of all (default: -1)
https://github.com/sd44/generator

函数方法

见generator.py,有详细注释。

About

英文原版书生词本生成器

Releases

No releases published

Packages

No packages published

Languages

Python 100.0%

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sd44/GlossaryGenerator

Folders and files

Latest commit

History

Repository files navigation

自用小工具

单词表生成器

原理

依赖python库

支持格式

使用方法

命令行方法

函数方法

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

sd44/GlossaryGenerator

Folders and files

Latest commit

History

Repository files navigation

自用小工具

单词表生成器

原理

依赖python库

支持格式

使用方法

命令行方法

函数方法

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages