Changing every link on page, in fastest way possible

Question 1

I have created a script (mediawiki gadget), which iterates over every redlink in a rendered wiki-page (that is, over every href with class new, signifying an internal link to an article that does not exist), and checks if the title exists in a set of known pages. If it does, a few attributes are changed. The execution is extremely simple, but I need it to be as fast as possible.

The set of known pages contains nearly 20K strings, here omitted:

const rawPages=new Set([...])
$("#mw-content-text a.new").each(function() {
 var b=$(this),a=b.attr("title");
 !a||
 a.startsWith("קאַטעגאָריע:") ||
 a.startsWith("טעקע:") ||
 a.startsWith("שמועס:") ||
 a.startsWith("באניצער:") ||
 a.startsWith("המכלול:") ||
 a.startsWith("הילף:") ||
 a.startsWith("מעדיעװיקי:") ||
 a.startsWith("מוסטער:") ||
 a.startsWith("יחידה:") ||
 a.startsWith("דרעפט:") ||
 a.startsWith("טעמע:") ||
 (a=a.replace(" (בלאַט עקזיסטירט נאכנישט)", ""),
 rawPages.has(a)&&
 b.attr({
 "class": "raw",
 "title": a + " (רוי)",
 "href": "/"+encodeURIComponent("רוי")+":"+encodeURIComponent(a)
 }))});

At first, the script checks if the title begins with some prefixes, in which case we are not interested in modifying the link. But the rawPages set does not contain any such title anyway; would it be faster to check directly for membership, instead of trying to short-circuit with multiple startsWith calls?

Anything else I can do, to speed it up?

Thanks!

Question 2

What is the reason that this is being done on the frontend and not the backend? Wikimedia has an API for this kind of query.

Question 3

@Reinderien, can you elaborate or point me to what API you have in mind? Thanks!

Question 4

Please do not edit the question, especially the code, after an answer has been posted. Changing the question may cause answer invalidation. Everyone needs to be able to see what the reviewer was referring to. What to do after the question has been answered.

Question 5

api.wikimedia.org/wiki/Core_REST_API/Reference

Question 6

@Reinderien, the set contains a list of pages that do exist, just not in the main namespace (and in the processes of being replaced with main-namespace articles) , and they should therefore be handled as semi-existant, so to speak - including changing the href. I don't see how this can be done more efficiently via API calls; surely sending a request for every redlink is not faster than a prepared list...

Question 7

The easiest way to get a performance boost in your code is to not use jQuery:

const rawPages = new Set([...]);
const anchors = document.querySelectorAll("#mw-content-text a.new");
for (let i = 0; i < anchors.length; i++) {
 const a = anchors[i];
 const title = a.title;
 if (!rawPages.has(title)) {
 continue;
 }
 a.title = title + " (רוי)";
 a.classList.add("raw");
 // concatenation might be faster than this interpolation, test it.
 a.href = `/${encodeURIComponent("רוי"):${encodeURIComponent(title)}`;
}

I've also removed the startsWith calls as I'm not sure whether that will be saving computation on average.

Question 8

+1 for removal of jQuery... Minor points, not a performance boost, the modern version would use for of loop rather than the noisey for ;; loop where possible. Eg for (const a of anchors) {, and generally it's always best to avoid continue especially if the alternative 'if (rawPages.has(title)) { .... a.href = ...; }`code is simpler

Question 9

Agreed that for of is generally preferred, modern or not. However, it's hard to beat a for loop for performance-critical code. I think the continue vs nesting is a personal preference but I agree with you that nesting is probably clearer here.

Question 10

You must match T titles against a dictionary of W words, and your current code runs in O(T ×ばつ W) time.

For at least some of your thousands of strings (and for all eleven that appeared in your example) the spec is to find each title that .startsWith this single-word regex: ^(\w+):

So create a Set() lexicon containing all those words. Then for each title, use that regex to parse out the leading word (if any), and look it up in the the lexicon. If found, perform the .replace operation. That's O(1) constant time per title, total of O(T) time for the page.

If you need a leading phrase rather than leading word, use ^([ \w]+): or similar expression.

Question 11

Perhaps I have not expressed myself clearly (or I don't get what you are saying); English is not my native language. The spec is not to find those that startsWith the provided examples - quite the opposite. The spec is to find those which exist in the rawPages set. I check those that startsWith before, to avoid searching the big set when possible.

Question 12

I have now added the first few strings of the set to be searched. The task is to find redlinks pointing to articles existing in this set, and modify said redlinks.

Question 13

The code in the question already uses a set for lookups so I'd say it's O(T) already.

Question 14

@RobH I took W in the example to be eleven, and in the described problem to be closer to twenty thousand. Subsequent comments suggest that was not the original intent.

RobH RobH 17.1k6 gold badges38 silver badges73 bronze badges · Answer 1 · 2023-07-20 07:27:37Z

The easiest way to get a performance boost in your code is to not use jQuery:

const rawPages = new Set([...]);
const anchors = document.querySelectorAll("#mw-content-text a.new");
for (let i = 0; i < anchors.length; i++) {
 const a = anchors[i];
 const title = a.title;
 if (!rawPages.has(title)) {
 continue;
 }
 a.title = title + " (רוי)";
 a.classList.add("raw");
 // concatenation might be faster than this interpolation, test it.
 a.href = `/${encodeURIComponent("רוי"):${encodeURIComponent(title)}`;
}

I've also removed the startsWith calls as I'm not sure whether that will be saving computation on average.

+1 for removal of jQuery... Minor points, not a performance boost, the modern version would use for of loop rather than the noisey for ;; loop where possible. Eg for (const a of anchors) {, and generally it's always best to avoid continue especially if the alternative 'if (rawPages.has(title)) { .... a.href = ...; }`code is simpler
Agreed that for of is generally preferred, modern or not. However, it's hard to beat a for loop for performance-critical code. I think the continue vs nesting is a personal preference but I agree with you that nesting is probably clearer here.

J_H J_H 41.4k3 gold badges38 silver badges157 bronze badges · Answer 2 · 2023-07-19 14:38:16Z

1

\$\begingroup\$

You must match T titles against a dictionary of W words, and your current code runs in O(T ×ばつ W) time.

For at least some of your thousands of strings (and for all eleven that appeared in your example) the spec is to find each title that .startsWith this single-word regex: ^(\w+):

So create a Set() lexicon containing all those words. Then for each title, use that regex to parse out the leading word (if any), and look it up in the the lexicon. If found, perform the .replace operation. That's O(1) constant time per title, total of O(T) time for the page.

If you need a leading phrase rather than leading word, use ^([ \w]+): or similar expression.

Share

edited Jul 19, 2023 at 15:23

answered Jul 19, 2023 at 14:38

J_H's user avatar

J_H J_H

41.4k3 gold badges38 silver badges157 bronze badges

\$\endgroup\$

4

\$\begingroup\$ Perhaps I have not expressed myself clearly (or I don't get what you are saying); English is not my native language. The spec is not to find those that startsWith the provided examples - quite the opposite. The spec is to find those which exist in the rawPages set. I check those that startsWith before, to avoid searching the big set when possible. \$\endgroup\$

philo math
– philo math

2023年07月19日 15:21:29 +00:00
Commented Jul 19, 2023 at 15:21
\$\begingroup\$ I have now added the first few strings of the set to be searched. The task is to find redlinks pointing to articles existing in this set, and modify said redlinks. \$\endgroup\$

philo math
– philo math

2023年07月19日 15:31:19 +00:00
Commented Jul 19, 2023 at 15:31
\$\begingroup\$ The code in the question already uses a set for lookups so I'd say it's O(T) already. \$\endgroup\$

RobH
– RobH

2023年07月20日 07:19:35 +00:00
Commented Jul 20, 2023 at 7:19
\$\begingroup\$ @RobH I took W in the example to be eleven, and in the described problem to be closer to twenty thousand. Subsequent comments suggest that was not the original intent. \$\endgroup\$

J_H
– J_H

2023年07月20日 15:08:40 +00:00
Commented Jul 20, 2023 at 15:08

Add a comment |

Stack Exchange Network

Changing every link on page, in fastest way possible

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Changing every link on page, in fastest way possible

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions