G2-star/docx-knowledge-builder

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
check_data.py		check_data.py
docx_knowledge_extractor.py		docx_knowledge_extractor.py
requirements.txt		requirements.txt
run_extraction.py		run_extraction.py
使用指南.md		使用指南.md
训练数据说明.md		训练数据说明.md

Repository files navigation

DOCX Knowledge Base Builder

English | 简体中文

A powerful tool for converting DOCX technical documents into LLM training datasets. This project helps you build high-quality knowledge bases from technical documentation.

Python Version License: MIT Documentation

🌟 Features

📄 Smart DOCX document structure parsing
🤖 Automatic Q&A pair generation
🔄 Multiple output formats (Alpaca, Conversation)
📦 Batch processing support
✅ Data quality validation
📊 Document structure analysis

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/yourusername/docx-knowledge-builder.git
cd docx-knowledge-builder
# Install dependencies
pip install -r requirements.txt

Basic Usage

Place your DOCX files in the project root directory
Run the extraction script:

python run_extraction.py

Check the generated data in training_data/ directory

📁 Project Structure

.
├── docx_knowledge_extractor.py # Core extractor
├── run_extraction.py # Main script
├── check_data.py # Data quality checker
├── requirements.txt # Dependencies
├── README.md # This file
└── training_data/ # Output directory
 ├── combined_training_data_alpaca.json
 ├── combined_training_data_conversation.json
 └── *_structure.json

🔧 Advanced Usage

Single File Processing

python docx_knowledge_extractor.py -i "document.docx" -o output_dir

Batch Processing

python docx_knowledge_extractor.py -i documents_folder -o training_data --batch

📊 Output Formats

Alpaca Format

[
 {
 "instruction": "What is the main content?",
 "input": "",
 "output": "The main content is..."
 }
]

Conversation Format

[
 {
 "conversations": [
 {"from": "human", "value": "What is the main content?"},
 {"from": "gpt", "value": "The main content is..."}
 ]
 }
]

📚 Supported Document Types

Technical Specifications
Construction Plans
Quality Management Plans
Safety Management Plans
Other Technical Documentation

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Thanks to all contributors
Inspired by the need for high-quality LLM training data
Built with ❤️ for the open-source community

📞 Contact

GitHub Issues: Create an issue
Email: agaid1mnjh45@gmail.com

🌟 Star History

Star History Chart

About

A powerful tool for converting DOCX technical documents into LLM training datasets

Releases

No releases published

Packages

No packages published

Languages

Python 100.0%

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

G2-star/docx-knowledge-builder

Folders and files

Latest commit

History

Repository files navigation

DOCX Knowledge Base Builder

🌟 Features

🚀 Quick Start

Installation

Basic Usage

📁 Project Structure

🔧 Advanced Usage

Single File Processing

Batch Processing

📊 Output Formats

Alpaca Format

Conversation Format

📚 Supported Document Types

🤝 Contributing

📝 License

🙏 Acknowledgments

📞 Contact

🌟 Star History

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

License

G2-star/docx-knowledge-builder

Folders and files

Latest commit

History

Repository files navigation

DOCX Knowledge Base Builder

🌟 Features

🚀 Quick Start

Installation

Basic Usage

📁 Project Structure

🔧 Advanced Usage

Single File Processing

Batch Processing

📊 Output Formats

Alpaca Format

Conversation Format

📚 Supported Document Types

🤝 Contributing

📝 License

🙏 Acknowledgments

📞 Contact

🌟 Star History

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages