-4

I was converting pdf to docx using python along with tagging the Paragraphs(for e.g. [P20],[B44] like this ), emphasis( which are <EM>, <EMB>... like this has to be tagged) but I'm unable to capture the superscripts (footnotes) and tag them as well, superscripts are automatically converted into subscripts during conversion so unable to recognize them, also they have to be tagged as <FOOTNOTE 1>... .

1>Since footnotes descriptions are usually found at the end of the Page with corresponding number i tried capturing that 1st and searing for the corresponding number in the same page and compare them if they are same then we can tag but it tagged each and every number in that page. so it's unsuccessful.

def tag_emphasis(span, text):
 font = span.get("font", "").lower()
 flags = span.get("flags", 0)
 is_bold = "bold" in font
 is_italic = "italic" in font or "oblique" in font
 is_underlined = bool(flags & 4)
 if is_bold and is_underlined:
 return f"<EMBU>{text}</EMBU>"
 elif is_bold and is_italic:
 return f"<EMBI>{text}</EMBI>"
 elif is_bold:
 return f"<EMB>{text}</EMB>"
 elif is_italic:
 return f"<EM>{text}</EM>"
 elif is_italic:
 return f"<EMI>{text}</EMI>"
 return text
def determine_indent_from_bbox(span):
 left = span["bbox"][0]
 return int(left // 10) * 10
def count_leading_indent_chars(text):
 text = text.replace("\t", " ")
 return len(text) - len(text.lstrip(" "))
def is_superscript(span, line_y0):
 return span["size"] < 9 and span["bbox"][1] < line_y0 - 2
def tag_entire_pdf(pdf_path):
 doc = fitz.open(pdf_path)
 tagged_lines = []
 is_first_page_header_tagged = False
 previous_indent = None
 previous_blank = True
 for page_index, page in enumerate(doc):
 blocks = page.get_text("dict")["blocks"]
 for block in blocks:
 if block["type"] != 0:
 continue
 for line in block["lines"]:
 raw_line_text = ""
 tagged_line_text = ""
 line_indent = None
 line_y0 = line["bbox"][1]
 for span in line["spans"]:
 raw = span["text"]
 if not raw.strip():
 continue
 raw_line_text += raw
 clean = span["text"].strip()
 
 if is_superscript(span, line_y0) and clean.isdigit():
 tagged = f"<sup>{clean}</sup>"
 else:
 tagged = tag_emphasis(span, clean)
 if line_indent is None:
 line_indent = determine_indent_from_bbox(span)
 tagged_line_text += tagged + " "
 if not tagged_line_text.strip():
 tagged_lines.append("")
 previous_blank = True
 previous_indent = None
 continue
 if not is_first_page_header_tagged and page_index == 0:
 space_count = count_leading_indent_chars(raw_line_text)
 tagged_lines.append(f"<P{space_count}>{tagged_line_text.strip()}")
 is_first_page_header_tagged = True
 previous_blank = False
 previous_indent = line_indent
 continue
 if previous_blank or (previous_indent is not None and line_indent != previous_indent):
 tagged_lines.append(f"<P{line_indent}>{tagged_line_text.strip()}")
 else:
 tagged_lines.append(tagged_line_text.strip())
 previous_blank = False
 previous_indent = line_indent
 return tagged_lines
3
  • Please specify what have you got and what do you expect to get? Commented Nov 14 at 11:54
  • adding tags to PDF is called remediation as native PDF has no real concepts of what the source contained thus for adding them into the PDF you need tools driven by Humans as that is a critical part of PDF Tagging (it cannot be fully automatic) thus there are many commercial products to do different parts like continualengine.com/blog/… OR allyant.com/blog/how-to-tag-footnotes-and-endnotes-in-pdf but in all cases PDFUA MUST be in part manual. Logically it is probably easier to do that task in the DocX Editor Commented Nov 14 at 13:06
  • Basically, there are often no difference between raised or lowered text as all text is higher or lower than other glyphs. You need to show how your Human Intelligence decides if a number is a footnote, end note or subscript etc. as all numbers are equally just page content binary tokens between 00 and FF there is no area called footnotes or headnotes or sidenotes Commented Nov 14 at 13:13

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.