A powerful tool for converting DOCX technical documents into LLM training datasets. This project helps you build high-quality knowledge bases from technical documentation.
Python Version License: MIT Documentation
- 📄 Smart DOCX document structure parsing
- 🤖 Automatic Q&A pair generation
- 🔄 Multiple output formats (Alpaca, Conversation)
- 📦 Batch processing support
- ✅ Data quality validation
- 📊 Document structure analysis
# Clone the repository git clone https://github.com/yourusername/docx-knowledge-builder.git cd docx-knowledge-builder # Install dependencies pip install -r requirements.txt
- Place your DOCX files in the project root directory
- Run the extraction script:
python run_extraction.py
- Check the generated data in
training_data/directory
.
├── docx_knowledge_extractor.py # Core extractor
├── run_extraction.py # Main script
├── check_data.py # Data quality checker
├── requirements.txt # Dependencies
├── README.md # This file
└── training_data/ # Output directory
├── combined_training_data_alpaca.json
├── combined_training_data_conversation.json
└── *_structure.json
python docx_knowledge_extractor.py -i "document.docx" -o output_dirpython docx_knowledge_extractor.py -i documents_folder -o training_data --batch
[
{
"instruction": "What is the main content?",
"input": "",
"output": "The main content is..."
}
][
{
"conversations": [
{"from": "human", "value": "What is the main content?"},
{"from": "gpt", "value": "The main content is..."}
]
}
]- Technical Specifications
- Construction Plans
- Quality Management Plans
- Safety Management Plans
- Other Technical Documentation
We welcome contributions! Please see our Contributing Guidelines for details.
This project is licensed under the MIT License - see the LICENSE file for details.
- Thanks to all contributors
- Inspired by the need for high-quality LLM training data
- Built with ❤️ for the open-source community
- GitHub Issues: Create an issue
- Email: agaid1mnjh45@gmail.com