Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

A layout analysis pipeline using Detectron2 + Label Studio for extracting based on Fast R-CNN and annotating sections like abstract, table, figure, and references from academic PDFs

Notifications You must be signed in to change notification settings

shallowManica/doc-layout-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

6 Commits

Repository files navigation

doc-layout-parser

This project develops a layout parsing pipeline to extract key components (e.g., abstract, context, table, reference) from academic PDFs using a Detectron2-based model trained on annotations from Label Studio.

πŸ” Purpose

To identify and segment document elements like titles, authors, abstracts, tables, figures, and references using object detection techniques, improving downstream analysis and semantic classification with LLMs.

βš™οΈ Features

  • Fast R-CNN architecture (Detectron2) for layout detection
  • Layout categories: Abstract, Author, Context, Header, Image, Reference, Sub-title, Table, Title
  • Integration-ready with LLMs for content-based filtering or labeling
  • Configuration through config.yaml

πŸ—ƒ File Structure

  • config.yaml - Detectron2 configuration for the layout model
  • result.json - Output annotations from model inference
  • parsing.ipynb - Sample notebook to run detection and visualize results

πŸ“¦ Dependencies

Install via pip:

!pip install pycocotools
!pip install layoutparser
!pip install "layoutparser[effdet]"
!pip install layoutparser torchvision
!python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
!pip install layoutparser torchvision && pip install "git+https://github.com/facebookresearch/detectron2.git@v0.5#egg=detectron2"
!pip install "layoutparser[paddledetection]"
!pip install "layoutparser[ocr]"

Install via Conda:

conda install detectron2 pytorch opencv omegaconf hydra-core -c conda-forge

πŸš€ How to Run

# Inside parsing.ipynb
from layoutparser.models import Detectron2LayoutModel
model = Detectron2LayoutModel(
 config_path='config.yaml',
 extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.8],
 label_map={0: "Abstract", 1: "Author", ...}
)
πŸ“„ Annotation Categories
	β€’	Abstract
	β€’	Author
	β€’	Context
	β€’	Header
	β€’	Image
	β€’	Reference
	β€’	Sub-title
	β€’	Table
	β€’	Title

About

A layout analysis pipeline using Detectron2 + Label Studio for extracting based on Fast R-CNN and annotating sections like abstract, table, figure, and references from academic PDFs

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /