Search code, repositories, users, issues, pull requests...

Copy link

Contributor Author

pawamoy commented Feb 21, 2025

Hmm, the list->set change could be seen as breaking. We can instead create a new Markdown._block_level_elements attribute, and use that in isblocklevel(). Let me know if you think that's best.

This comment was marked as resolved.

pawamoy

pawamoy commented

Feb 22, 2025

markdown/postprocessors.py Outdated Show resolved Hide resolved

markdown/postprocessors.py Show resolved Hide resolved

pawamoy

pawamoy commented

Feb 23, 2025

markdown/postprocessors.py Outdated Show resolved Hide resolved

waylan

waylan requested changes

Feb 26, 2025

Copy link

Member

@waylan waylan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good generally. However, I haven't tested it or done any performance comparisons of my own. From a quick reading of the code only the few things you pointed out stand out to me. I'll need to take a closer look when I have more time.

markdown/postprocessors.py Outdated Show resolved Hide resolved

markdown/postprocessors.py Show resolved Hide resolved

Copy link

Collaborator

facelessuser commented Feb 26, 2025

I'll likely try this branch on some projects to inspect for regressions, just to be sure

@waylan

Copy link

Member

waylan commented Feb 26, 2025

Hmm, the list->set change could be seen as breaking. We can instead create a new Markdown._block_level_elements attribute, and use that in isblocklevel(). Let me know if you think that's best.

This is a valid concern. I'm not sure what to think about it.

@waylan waylan added needs-review requires-changes labels

Feb 26, 2025

Copy link

Collaborator

facelessuser commented Feb 26, 2025

Yes, I saw the change of blocks using a list to set and meant to comment on it. I'm pretty sure it'll break some of my plugins, so keeping the original but deprecating it in favor of a new structure is probably the way to go.

@pawamoy pawamoy force-pushed the optimize-rawhtml-postprocessor branch from ce408be to 0841396 Compare

February 26, 2025 17:17

Copy link

Contributor Author

pawamoy commented Feb 26, 2025

Thanks for confirming @facelessuser! I'll change it.

@pawamoy pawamoy force-pushed the optimize-rawhtml-postprocessor branch from 0841396 to e05cce7 Compare

February 26, 2025 17:26

Copy link

Contributor Author

pawamoy commented Feb 26, 2025 •

edited

Loading

I just added a new private _block_level_elements attribute for now, but we should probably deprecate the old one and make the new one public. I'll wait for your input on this @waylan.

facelessuser

facelessuser reviewed

Feb 26, 2025

markdown/core.py Outdated Show resolved Hide resolved

@pawamoy pawamoy force-pushed the optimize-rawhtml-postprocessor branch from e05cce7 to 1bc8c54 Compare

February 26, 2025 20:20

pawamoy

pawamoy commented

Feb 27, 2025

markdown/util.py Outdated Show resolved Hide resolved

waylan

waylan reviewed

Feb 27, 2025

markdown/util.py Outdated

BLOCK_LEVEL_ELEMENTS: list[str] = [

# TODO: Raise errors from list methods in the future.

# Later, remove this class entirely and use a regular set.

class _BlockLevelElements:

Copy link

Member

@waylan waylan Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere the type is specified as set. However, as this is not a subclass of set, that would be incorrect. Also, later when we remove this, then we will have a set. Perhaps this class should properly reflect that now. Unfortunately, there are no specialized container datatypes for sets in collections. And sets are not a Sequence Type either. Not really sure how best to address this.

Copy link

Contributor Author

@pawamoy pawamoy Feb 27, 2025 •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feeling was that it was OK to lie about the type, so that users actually get type warnings when they use it as a list instead of a set. That's one more (very effective) way of communicating the change. And since we implement all set functionality, the lie is not too big 🙂

Copy link

Member

@waylan waylan Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that, but some users run type validation on their code and this could result in bug reports. Either we need to fully document our reasoning for the inconsistency in the code comments or we need to make a change to the code.

Copy link

Contributor Author

@pawamoy pawamoy Mar 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users running type checkers on their code is exactly what I expect, yes 😄 So, my opinion is, yes, lets keep lying about the type, and document that prominently (both with code comments and release/changelog notes).

Type correctness would otherwise mean that:

we type the variables as _BlockLevelElements
which in turn means we'd have to document this class properly
which in turn (probably) means we'd have to make it public

(also we'd likely want to use Self from typing_extensions as return annotation of methods retuning self, which means an additional dependency depending on the Python version)

I know it's far from being exhaustive but a search on GitHub returns only one result for use of md.block_level_elements (apart from @facelessuser's pymdownx which creates a new set from it, and discarding forks and venvs), and it's using remove, which is both a list and set method. That's IMO a strong indicator that md.block_level_elements has very few uses in the wild, and so immediately advertising its type as set should have a very small impact. I could be wrong though 🤷

Copy link

Member

@waylan waylan Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just type validators that I'm thinking about here. What about someone who is using isinstance (or similar) on the object in their code? They could reasonably expect isinstance(set) to return True, which is does not. For that matter existing code could reasonable expect isisntance(list) to return True. My objection is that neither return True. However, if _BlockLevelElements was a subclass of something, then it could return True for one of either set or list. That's why I pointed out above that set is not a Sequence Type. I was hoping it was and then _BlockLevelElements could just subclass the relevant baseclass and return True for both. I think we need to return True for at least one. The question is: which one? Using set is a breaking change but what it will end up being when the transition is complete. Using list avoids the breaking change, but it not forward looking. And I don't think there is any way to raise a deprecation warning if a user calls isinstance on the objection.

My point is that I think _BlockLevelElements needs to be either a set or a list at a minimum. Ideally it would be both, but that is probably not sensible. Any solution is going to be a compromise. I'm saying that having _BlockLevelElements not be either a set or a list is not the compromise I want to make.

Copy link

Contributor Author

@pawamoy pawamoy Mar 13, 2025 •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK __instancecheck__ works the other way unfortunately, it's not meant for this use-case. Inheriting from both set and list is not possible: CPython doesn't allow it. Inheriting from set breaks a lot of tests (121 of them), not sure exactly why. Inheriting from list works, but as you pointed, isinstance(ref, set) will return False.

Copy link

Member

@waylan waylan Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inheriting from both set and list is not possible: CPython doesn't allow it.

That is what I thought. Not sure why, but I had a vague recollection that that was the case.

Inheriting from set breaks a lot of tests (121 of them), not sure exactly why.

Do these tests fail if we use set directly rather than out custom class? If so, we really need to track this down. We don't want to encounter this issue for the first time in a future step of the deprecation...

That is if we ever make the deprecation. I'm thinking that this may be a blocker for this change moving forward. Can you provide some clear date on the improvements this one change makes?

Copy link

Contributor Author

@pawamoy pawamoy Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these tests fail if we use set directly rather than out custom class? If so, we really need to track this down. We don't want to encounter this issue for the first time in a future step of the deprecation...

Tests pass when block level elements variable are sets. I ran a debugging session and it turns out set(md.block_level_elements) returns an empty set for some reason (while list(md.block_level_elements) returns a valid list with all elements). I think it's because by inheriting from set, _BlockLevelElements indeed becomes a set, but an empty one, since we store elements in self._set (and self._list). CPython probably then takes a shortcut and doesn't use __iter__. I believe we can solve this by getting rid of self._set and calling parent methods instead of self._set.METHOD.

That is if we ever make the deprecation. I'm thinking that this may be a blocker for this change moving forward. Can you provide some clear date on the improvements this one change makes?

Yep, totally understandable, 700 lines for changing a variable from list to set is not the most elegant backward compatible change. I don't have clear data, but I can tell you this:

without the post-processor change, the list->set would have been very important
but now that the post-processor is much more efficient, the list->set change brings a very, very tiny perf gain (at least in the context of the raw HTML post-processor)

To be honest I wouldn't mind if we dropped the list->set change entirely, to make this PR easier to merge. We can always revisit later.

Copy link

Member

@waylan waylan Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but now that the post-processor is much more efficient, the list->set change brings a very, very tiny perf gain (at least in the context of the raw HTML post-processor)

That is pretty significant.

To be honest I wouldn't mind if we dropped the list->set change entirely, to make this PR easier to merge. We can always revisit later.

I agree. Let's do that.

Copy link

Contributor Author

@pawamoy pawamoy Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I squashed all fixups, and added a revert commit, so that we keep a trace of it in this PR (while you just have to squash when merging to get rid of both).

pawamoy

pawamoy commented

Copy link

Contributor Author

@pawamoy pawamoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First batch of tests added. It's pretty basic. A second batch should be added to test lists with duplicate elements within them. I updated the code to try and reduce the possibility of duplicates but it's not thorough.

pawamoy

pawamoy commented

markdown/util.py Outdated Show resolved Hide resolved

pawamoy

pawamoy commented

markdown/util.py Outdated Show resolved Hide resolved

pawamoy

pawamoy commented

markdown/util.py Outdated Show resolved Hide resolved

waylan

waylan reviewed

markdown/extensions/md_in_html.py Outdated Show resolved Hide resolved

waylan

waylan reviewed

tests/test_block_level_elements.py Outdated Show resolved Hide resolved

Copy link

Contributor Author

pawamoy commented Mar 1, 2025 •

edited

Loading

About duplicates: I don't prevent their insertion, otherwise we lose backwards compatibility again. We'll just have to add test cases with duplicates 🙂

A few other remarks/suggestions:

we could use warnings.deprecated to decorate deprecated methods (instead of calling warnings.warn ourselves), so that type checkers will warn about their use, and IDEs will strike them through (visual indication of deprecation). For Python less than 3.13 we'd have to depend on typing-extensions.
in any case, we could pass stacklevel=2 to warnings, so that they show the user line calling these deprecated methods.
I've developed a small tool called Yore that aims at helping manage legacy code. Instead of adding various # TODO comments throughout the code, you instead add # YORE structured comments and can run yore check to check if legacy code should be removed/transformed (Python version just reached its end of life / you're ready to bump the major version) and yore fix to automatically apply prepared transforms. Just sharing, can help setting it up if you're interested 😄 (it's just a matter of adding the dev-dep and maybe calling it in tox/CI).

Copy link

Contributor Author

pawamoy commented Mar 6, 2025

I added tests with duplicates.

pawamoy

pawamoy commented

Mar 6, 2025

feature: Back links! mkdocstrings/mkdocstrings#723

markdown/util.py Outdated Show resolved Hide resolved

@pawamoy pawamoy mentioned this pull request

Mar 10, 2025

Closed

@waylan

Copy link

Member

waylan commented Mar 12, 2025

Sorry, but I was out for a couple of weeks. Back now and trying to catch up.

we could use warnings.deprecated to decorate deprecated methods (instead of calling warnings.warn ourselves

Yes, let's do that. I would have mentioned this earlier but missed that we weren't using warn.deprecated already.

we could pass stacklevel=2 to warnings

We have not traditionally done this in the past. I have always taken the position that we are not responsible for Python's default behavior. As Python by default hides deprecation warnings, then users should expect that and act accordingly. But I'm not opposed to this either.

I've developed a small tool called Yore that aims at helping manage legacy code. Instead of adding various # TODO comments throughout the code, you instead add # YORE structured comments and can run yore check

It is easy enough to search for #TODO. Adding dependencies just creates more complexity.

Copy link

Contributor Author

pawamoy commented Mar 12, 2025

Sorry, but I was out for a couple of weeks. Back now and trying to catch up.

Absolutely no worries, and no need to apologize, thank you very much for your answers 🙂 I'll address the various points you've answered.

Copy link

Contributor Author

pawamoy commented Mar 13, 2025 •

edited

Loading

Oh, I noticed again that you have your own deprecated implementation. I'll use it instead of warnings.deprecated or typing_extensions.deprecated under 3.13, and we can later remove this custom implementation in favor of the official one, to reduce the amount of changes in this PR. I'll just have to move the function definition up, since we'll have to use it at the module-level.

pawamoy

pawamoy commented

Mar 13, 2025

markdown/util.py Outdated Show resolved Hide resolved

pawamoy added 4 commits

March 26, 2025 13:49


 Use set instead of list for block level elements

eae8169

Using a set allows for better performances when checking for membership of a tag within block level elements.
Issue-1507: Python-Markdown#1507


 Don't precompute placeholder replacements in raw HTML post-processor

84e3b2a

Previously, the raw HTML post-processor would precompute all possible replacements for placeholders in a string, based on the HTML stash.
It would then apply a regular expression substitution using these replacements.
Finally, if the text changed, it would recurse, and do all that again.
This was inefficient because placeholders were re-computed each time it recursed, and because only a few replacements would be used anyway.
This change moves the recursion into the regular expression substitution, so that:
1. the regular expression does minimal work on the text (contrary to re-scanning text already scanned in previous frames);
2. but more importantly, replacements aren't computed ahead of time anymore (and even less *several times*), and only fetched from the HTML stash as placeholders are found in the text.
The substitution function relies on the regular expression groups ordering: we make sure to match `<p>PLACEHOLDER</p>` first, before `PLACEHOLDER`. The presence of a wrapping `p` tag indicates whether to wrap again the substitution result, or not (also depending on whether the substituted HTML is a block-level tag).
Issue-1507: Python-Markdown#1507


 Add changelog entry for improved raw HTML post-processor perfs

b1fad82


 Revert "Use set instead of list for block level elements"

e6b086b

This reverts commit eae8169.

@pawamoy pawamoy force-pushed the optimize-rawhtml-postprocessor branch from 0fbc19d to e6b086b Compare

March 26, 2025 12:50

waylan

waylan approved these changes

Mar 26, 2025

Copy link

Member

@waylan waylan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That greatly simplifies this PR. Looks good to me. @facelessuser could you take another look at this. I think it is probably ready to go but would like another set of eyes on it.

Copy link

Collaborator

facelessuser commented Mar 26, 2025

@waylan I'll take a look this evening and update with my review.

facelessuser

facelessuser approved these changes

Mar 26, 2025

Copy link

Collaborator

@facelessuser facelessuser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@waylan waylan merged commit 1caf028 into Python-Markdown:master

Mar 27, 2025

14 checks passed

@waylan waylan added approved and removed needs-review requires-changes labels

Mar 27, 2025

bug: Performance issue 0.29 mkdocstrings/mkdocstrings#756

Copy link

Contributor Author

pawamoy commented Mar 27, 2025

Thank you both for your patience and reviews 🙏

@pawamoy pawamoy mentioned this pull request

Apr 7, 2025

Closed

@ofek

Copy link

ofek commented Apr 9, 2025

When might this be released?