shallowManica/doc-layout-parser

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
parsing.ipynb		parsing.ipynb

Repository files navigation

doc-layout-parser

This project develops a layout parsing pipeline to extract key components (e.g., abstract, context, table, reference) from academic PDFs using a Detectron2-based model trained on annotations from Label Studio.

🔍 Purpose

To identify and segment document elements like titles, authors, abstracts, tables, figures, and references using object detection techniques, improving downstream analysis and semantic classification with LLMs.

⚙️ Features

Fast R-CNN architecture (Detectron2) for layout detection
Layout categories: Abstract, Author, Context, Header, Image, Reference, Sub-title, Table, Title
Integration-ready with LLMs for content-based filtering or labeling
Configuration through config.yaml

🗃 File Structure

config.yaml - Detectron2 configuration for the layout model
result.json - Output annotations from model inference
parsing.ipynb - Sample notebook to run detection and visualize results

📦 Dependencies

Install via pip:

!pip install pycocotools
!pip install layoutparser
!pip install "layoutparser[effdet]"
!pip install layoutparser torchvision
!python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
!pip install layoutparser torchvision && pip install "git+https://github.com/facebookresearch/detectron2.git@v0.5#egg=detectron2"
!pip install "layoutparser[paddledetection]"
!pip install "layoutparser[ocr]"

Install via Conda:

conda install detectron2 pytorch opencv omegaconf hydra-core -c conda-forge

🚀 How to Run

# Inside parsing.ipynb
from layoutparser.models import Detectron2LayoutModel
model = Detectron2LayoutModel(
 config_path='config.yaml',
 extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.8],
 label_map={0: "Abstract", 1: "Author", ...}
)

📄 Annotation Categories
	•	Abstract
	•	Author
	•	Context
	•	Header
	•	Image
	•	Reference
	•	Sub-title
	•	Table
	•	Title

About

A layout analysis pipeline using Detectron2 + Label Studio for extracting based on Fast R-CNN and annotating sections like abstract, table, figure, and references from academic PDFs

Releases

No releases published

Packages

No packages published

Languages

Jupyter Notebook 100.0%

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

shallowManica/doc-layout-parser

Folders and files

Latest commit

History

Repository files navigation

doc-layout-parser

🔍 Purpose

⚙️ Features

🗃 File Structure

📦 Dependencies

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

shallowManica/doc-layout-parser

Folders and files

Latest commit

History

Repository files navigation

doc-layout-parser

🔍 Purpose

⚙️ Features

🗃 File Structure

📦 Dependencies

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages