I was converting pdf to docx using python along with tagging the Paragraphs(for e.g. [P20],[B44] like this ), emphasis( which are <EM>, <EMB>... like this has to be tagged) but I'm unable to capture the superscripts (footnotes) and tag them as well, superscripts are automatically converted into subscripts during conversion so unable to recognize them, also they have to be tagged as <FOOTNOTE 1>... .
1>Since footnotes descriptions are usually found at the end of the Page with corresponding number i tried capturing that 1st and searing for the corresponding number in the same page and compare them if they are same then we can tag but it tagged each and every number in that page. so it's unsuccessful.
def tag_emphasis(span, text):
font = span.get("font", "").lower()
flags = span.get("flags", 0)
is_bold = "bold" in font
is_italic = "italic" in font or "oblique" in font
is_underlined = bool(flags & 4)
if is_bold and is_underlined:
return f"<EMBU>{text}</EMBU>"
elif is_bold and is_italic:
return f"<EMBI>{text}</EMBI>"
elif is_bold:
return f"<EMB>{text}</EMB>"
elif is_italic:
return f"<EM>{text}</EM>"
elif is_italic:
return f"<EMI>{text}</EMI>"
return text
def determine_indent_from_bbox(span):
left = span["bbox"][0]
return int(left // 10) * 10
def count_leading_indent_chars(text):
text = text.replace("\t", " ")
return len(text) - len(text.lstrip(" "))
def is_superscript(span, line_y0):
return span["size"] < 9 and span["bbox"][1] < line_y0 - 2
def tag_entire_pdf(pdf_path):
doc = fitz.open(pdf_path)
tagged_lines = []
is_first_page_header_tagged = False
previous_indent = None
previous_blank = True
for page_index, page in enumerate(doc):
blocks = page.get_text("dict")["blocks"]
for block in blocks:
if block["type"] != 0:
continue
for line in block["lines"]:
raw_line_text = ""
tagged_line_text = ""
line_indent = None
line_y0 = line["bbox"][1]
for span in line["spans"]:
raw = span["text"]
if not raw.strip():
continue
raw_line_text += raw
clean = span["text"].strip()
if is_superscript(span, line_y0) and clean.isdigit():
tagged = f"<sup>{clean}</sup>"
else:
tagged = tag_emphasis(span, clean)
if line_indent is None:
line_indent = determine_indent_from_bbox(span)
tagged_line_text += tagged + " "
if not tagged_line_text.strip():
tagged_lines.append("")
previous_blank = True
previous_indent = None
continue
if not is_first_page_header_tagged and page_index == 0:
space_count = count_leading_indent_chars(raw_line_text)
tagged_lines.append(f"<P{space_count}>{tagged_line_text.strip()}")
is_first_page_header_tagged = True
previous_blank = False
previous_indent = line_indent
continue
if previous_blank or (previous_indent is not None and line_indent != previous_indent):
tagged_lines.append(f"<P{line_indent}>{tagged_line_text.strip()}")
else:
tagged_lines.append(tagged_line_text.strip())
previous_blank = False
previous_indent = line_indent
return tagged_lines
-
Please specify what have you got and what do you expect to get?Vladimir– Vladimir2025年11月14日 11:54:58 +00:00Commented Nov 14 at 11:54
-
adding tags to PDF is called remediation as native PDF has no real concepts of what the source contained thus for adding them into the PDF you need tools driven by Humans as that is a critical part of PDF Tagging (it cannot be fully automatic) thus there are many commercial products to do different parts like continualengine.com/blog/… OR allyant.com/blog/how-to-tag-footnotes-and-endnotes-in-pdf but in all cases PDFUA MUST be in part manual. Logically it is probably easier to do that task in the DocX EditorK J– K J2025年11月14日 13:06:43 +00:00Commented Nov 14 at 13:06
-
Basically, there are often no difference between raised or lowered text as all text is higher or lower than other glyphs. You need to show how your Human Intelligence decides if a number is a footnote, end note or subscript etc. as all numbers are equally just page content binary tokens between 00 and FF there is no area called footnotes or headnotes or sidenotesK J– K J2025年11月14日 13:13:11 +00:00Commented Nov 14 at 13:13