Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

A powerful tool for converting DOCX technical documents into LLM training datasets

License

Notifications You must be signed in to change notification settings

G2-star/docx-knowledge-builder

Repository files navigation

DOCX Knowledge Base Builder

English | 简体中文

A powerful tool for converting DOCX technical documents into LLM training datasets. This project helps you build high-quality knowledge bases from technical documentation.

Python Version License: MIT Documentation

🌟 Features

  • 📄 Smart DOCX document structure parsing
  • 🤖 Automatic Q&A pair generation
  • 🔄 Multiple output formats (Alpaca, Conversation)
  • 📦 Batch processing support
  • ✅ Data quality validation
  • 📊 Document structure analysis

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/yourusername/docx-knowledge-builder.git
cd docx-knowledge-builder
# Install dependencies
pip install -r requirements.txt

Basic Usage

  1. Place your DOCX files in the project root directory
  2. Run the extraction script:
python run_extraction.py
  1. Check the generated data in training_data/ directory

📁 Project Structure

.
├── docx_knowledge_extractor.py # Core extractor
├── run_extraction.py # Main script
├── check_data.py # Data quality checker
├── requirements.txt # Dependencies
├── README.md # This file
└── training_data/ # Output directory
 ├── combined_training_data_alpaca.json
 ├── combined_training_data_conversation.json
 └── *_structure.json

🔧 Advanced Usage

Single File Processing

python docx_knowledge_extractor.py -i "document.docx" -o output_dir

Batch Processing

python docx_knowledge_extractor.py -i documents_folder -o training_data --batch

📊 Output Formats

Alpaca Format

[
 {
 "instruction": "What is the main content?",
 "input": "",
 "output": "The main content is..."
 }
]

Conversation Format

[
 {
 "conversations": [
 {"from": "human", "value": "What is the main content?"},
 {"from": "gpt", "value": "The main content is..."}
 ]
 }
]

📚 Supported Document Types

  • Technical Specifications
  • Construction Plans
  • Quality Management Plans
  • Safety Management Plans
  • Other Technical Documentation

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Thanks to all contributors
  • Inspired by the need for high-quality LLM training data
  • Built with ❤️ for the open-source community

📞 Contact

🌟 Star History

Star History Chart

About

A powerful tool for converting DOCX technical documents into LLM training datasets

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

AltStyle によって変換されたページ (->オリジナル) /