Convert PDF to DOCX in Python with tagging, but superscripts (FN) are misrecognized as subscripts during the process

Question 1

I was converting pdf to docx using python along with tagging the Paragraphs(for e.g. [P20],[B44] like this ), emphasis( which are <EM>, <EMB>... like this has to be tagged) but I'm unable to capture the superscripts (footnotes) and tag them as well, superscripts are automatically converted into subscripts during conversion so unable to recognize them, also they have to be tagged as <FOOTNOTE 1>... .

1>Since footnotes descriptions are usually found at the end of the Page with corresponding number i tried capturing that 1st and searing for the corresponding number in the same page and compare them if they are same then we can tag but it tagged each and every number in that page. so it's unsuccessful.

def tag_emphasis(span, text):
 font = span.get("font", "").lower()
 flags = span.get("flags", 0)
 is_bold = "bold" in font
 is_italic = "italic" in font or "oblique" in font
 is_underlined = bool(flags & 4)
 if is_bold and is_underlined:
 return f"<EMBU>{text}</EMBU>"
 elif is_bold and is_italic:
 return f"<EMBI>{text}</EMBI>"
 elif is_bold:
 return f"<EMB>{text}</EMB>"
 elif is_italic:
 return f"<EM>{text}</EM>"
 elif is_italic:
 return f"<EMI>{text}</EMI>"
 return text
def determine_indent_from_bbox(span):
 left = span["bbox"][0]
 return int(left // 10) * 10
def count_leading_indent_chars(text):
 text = text.replace("\t", " ")
 return len(text) - len(text.lstrip(" "))
def is_superscript(span, line_y0):
 return span["size"] < 9 and span["bbox"][1] < line_y0 - 2
def tag_entire_pdf(pdf_path):
 doc = fitz.open(pdf_path)
 tagged_lines = []
 is_first_page_header_tagged = False
 previous_indent = None
 previous_blank = True
 for page_index, page in enumerate(doc):
 blocks = page.get_text("dict")["blocks"]
 for block in blocks:
 if block["type"] != 0:
 continue
 for line in block["lines"]:
 raw_line_text = ""
 tagged_line_text = ""
 line_indent = None
 line_y0 = line["bbox"][1]
 for span in line["spans"]:
 raw = span["text"]
 if not raw.strip():
 continue
 raw_line_text += raw
 clean = span["text"].strip()
 
 if is_superscript(span, line_y0) and clean.isdigit():
 tagged = f"<sup>{clean}</sup>"
 else:
 tagged = tag_emphasis(span, clean)
 if line_indent is None:
 line_indent = determine_indent_from_bbox(span)
 tagged_line_text += tagged + " "
 if not tagged_line_text.strip():
 tagged_lines.append("")
 previous_blank = True
 previous_indent = None
 continue
 if not is_first_page_header_tagged and page_index == 0:
 space_count = count_leading_indent_chars(raw_line_text)
 tagged_lines.append(f"<P{space_count}>{tagged_line_text.strip()}")
 is_first_page_header_tagged = True
 previous_blank = False
 previous_indent = line_indent
 continue
 if previous_blank or (previous_indent is not None and line_indent != previous_indent):
 tagged_lines.append(f"<P{line_indent}>{tagged_line_text.strip()}")
 else:
 tagged_lines.append(tagged_line_text.strip())
 previous_blank = False
 previous_indent = line_indent
 return tagged_lines

Question 2

Please specify what have you got and what do you expect to get?

Question 3

adding tags to PDF is called remediation as native PDF has no real concepts of what the source contained thus for adding them into the PDF you need tools driven by Humans as that is a critical part of PDF Tagging (it cannot be fully automatic) thus there are many commercial products to do different parts like continualengine.com/blog/… OR allyant.com/blog/how-to-tag-footnotes-and-endnotes-in-pdf but in all cases PDFUA MUST be in part manual. Logically it is probably easier to do that task in the DocX Editor

Question 4

Basically, there are often no difference between raised or lowered text as all text is higher or lower than other glyphs. You need to show how your Human Intelligence decides if a number is a footnote, end note or subscript etc. as all numbers are equally just page content binary tokens between 00 and FF there is no area called footnotes or headnotes or sidenotes

CollectivesTM on Stack Overflow

Convert PDF to DOCX in Python with tagging, but superscripts (FN) are misrecognized as subscripts during the process

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions