-
-
Notifications
You must be signed in to change notification settings - Fork 8k
Consistently use realpaths to build XObject names #15629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Before this comment, the PDF backend used two inconsistent ways to build the name for XObjects required for Unicode glyphs above code point 255. The first method is based on the file name of the font returned by `font_manager.findfont()`. This method is used to refer to the XObject glyphs when drawing text. When the XObjects are created inside `writeFonts()` the realpath of the font returned by `cbook.get_realpath_and_stat()` is used instead. The realpath differs from the file name of the font if the font is a symbolic link. In this case, the produced PDF contains invalid references to XObjects. This commit adds a safe-guard in `PdfFile._get_xobject_symbol_name()`. The returned XObject names are always based on the realpath, even if the supplied path is a link. This achieves consistent XObject names and prevents invalid PDF files.
This does fix it, but do you know if all callers of _get_xobject_symbol_name
process filenames in this way?
The method _get_xobject_symbol_name
is used in three places:
-
Within
embedTTF()/embedTTFType3()
:embedTTF()
gets the filename from its caller,writeFonts()
, which uses the realpath. So,_get_xobject_symbol_name
is called with the realpath already. The PR has not effect here. -
Within
draw_text()
: Here, the_get_xobject_symbol_name
is called withfont.fname
, which could be a link. With the fix, the_get_xobject_symbol_name
, would resolve the link to its realpath. -
Within
draw_mathtext()
: Here, it is harder to trace.MathTextParser
creates a list of glyphs and their file names. As far as I can tell this is also based onfont.fname
, which again could be a link. With the fix, the_get_xobject_symbol_name
, would resolve the link to its realpath.
Since the _get_xobject_symbol_name
doesn't modify its arguments nor returns the path, it should not affect anything else besides the XObject names.
While the fix seems reasonable, do we really need to resolve font realpaths? Can't we just use fonts under whatever path they are found by findfont(), without ever resolving them? Note that for a given request, findfont() should not ever return the same font under two different (symlinked) paths, so the only case we can end up with the same font embedded twice is if the user specifies two symlinks to the same font by explicit paths -- which should be rare enough, and just results in a file with the same font embedded under two different names, i.e. just a missed optimization on the file size.
Looking at the history, two symlinks pointing to the same file was exactly why this link-resolving code was added in the first place: c8ae4e2. Has findfont
changed since then to make this redundant?
- findfont could well call realpath() itself on its output, saving the need for the callers to do so (but this would fail with hardlinks -- in fact the use of stat() rather than realpath() in the linked commit suggests that it intended to handle hardlinks as well).
- the linked commit does not explain why this is an issue.
Actually I just realized that I already implemented my suggestion as #15320; @sauerburger can you check whether that PR fixes your issue?
the linked commit does not explain why this is an issue.
Well, it came in around the same time as subsetting itself, so I assume the point is to make the subset as minimal as possible
I simply don't believe the same font file being accessed from two different paths occurs frequently enough (or at all...) to make the path canonicalization worthwhile. One can for example check that luatex/xetex (for which an important selling point is indeed better font support than pdftex...) don't bother with this optimization: compiling
\documentclass{article} \usepackage{fontspec} \begin{document} \setmainfont{DejaVuSans.ttf} Hello \setmainfont{times.ttf} world! \setmainfont{altdejavu.ttf} And hello \setmainfont{times.ttf} again! \end{document}
(where DejaVuSans.ttf and times.ttf are in the current directory, and altdejavu.ttf is either a hardlink or a symlink to DejaVuSans.ttf) with lualatex/xelatex and inspecting the resulting pdf using pdffonts
shows that luatex embedded the two "copies" of DejaVu separately.
I've tested the implementation of #15320. There seems to be another issue. The PDF contains errors, and now, no character is shown correctly. example.pdf, CI run.
Correct me if I'm wrong, #15320 tracks fonts by their full path (without resolving links) and uses the filename (basename) of the font to build XObject names. Doesn't this lead to problems if two fonts have the same filename, such as Arial/truetype.ttf
and OpenSans/truetype.ttf
?
I can only think of two truely unique identifiers for the fonts:
- the (resolved or unresolved) full path of the font and
- the index of the font tracked in
self.fontNames
.
(The inode isn't unique if there are multiple mounted file systems.)
What do you think about using stub font names for XObjects such as font1-minus
, font2-sum
, ... based on the internal index of the font? If this is desired, I could prepare another PR/update this PR based on #15320.
Fair point re duplicate basenames; I guess using e.g. {index}-{basename}
would work?
PR Summary
This PR addresses an issue in the PDF backend which lead to invalid PDFs. The PDF backend used two inconsistent ways to build the name for XObjects required for Unicode glyphs above code point 255.
writeFonts()
the realpath of the font returned bycbook.get_realpath_and_stat()
is used instead.The realpath differs from the file name of the font if the font is a symbolic link. In this case, the produced PDF contains invalid references to XObjects.
This commit adds a safe-guard in
PdfFile._get_xobject_symbol_name()
. The returned XObject names are always based on the realpath, even if the supplied path is a link. This achieves consistent XObject names and prevents invalid PDF files.This PR resolves #15628.
PR Checklist