Module:Wikitext Parsing/sandbox
See also the companion subpage for test cases (run).
To avoid major disruption and server load, any changes should be tested in the module's /sandbox or /testcases subpages, or in your own module sandbox. The tested changes can be added to this page in a single edit. Consider discussing changes on the talk page before implementing them.
This module provides some functions to help with the complex edge cases involved in modules like Module:Template parameter value which intend to process the raw wikitext of a page while respecting nowiki tags or similar content reliably. This module is designed to be called by other modules, and does not support invoking.
PrepareText
[edit ]PrepareText(text, keepComments)
will run any content within certain tags that normally disable processing (<nowiki>
, <pre>
, <syntaxhighlight>
, <source>
, <math>
) through mw.text.nowiki and remove HTML comments. This allows for tricky syntax to be parsed through more basic means such as %b{}
by other modules without worrying about edge cases.
If the second parameter, keepComments
, is set to true, the content of HTML comments will be passed through mw.text.nowiki instead of being removed entirely.
Any code using this function directly should consider using mw.text.decode to correct the output at the end if part of the processed text is returned, though this will also decode any input that was encoded but not inside a no-processing tag, which likely isn't a significant issue but still something worth noting.
ParseTemplates
[edit ]ParseTemplates(InputText, dontEscape)
will attempt to parse all {{Templates}}
on a page, handling multiple factors such as [[Wikilinks]]
and {{{Variables}}}
among other complex syntax. Due to the complexity of the function, it is considerably slow, and should be used carefully. The function returns a list of template objects in order of appearance, which have the following properties:
- Args: A key-value set of arguments, not in order
- ArgOrder: A list of keys in the order they appear in the template
- Children: A list of template objects that are contained within the existing template, in order of appearance. Only immediate children are listed
- Name: The name of the template
- Text: The raw text of the template
If the second parameter, dontEscape
, is set to true, the inputted text won't be ran through the PrepareText
function.
Editors can experiment in this module's sandbox (edit | diff) and testcases (edit | run) pages.
Add categories to the /doc subpage. Subpages of this module.
require("strict") -- Helper functions for PrepareText -- localfunctionstartswith(text,subtext) returnstring.sub(text,1,#subtext)==subtext end localfunctionendswith(text,subtext) returnstring.sub(text,-#subtext,-1)==subtext end localfunctionallcases(s) returns:gsub("%a",function(c) return"["..c:upper()..c:lower().."]" end) end --[=[ Implementation notes ---- NORMAL HTML TAGS ---- Tags are very strict on how they want to start, but loose on how they end. The start must strictly follow <[tAgNaMe](%s|>) with no room for whitespace in the tag's name, but may then flow as they want afterwards, making <div\nclass\n=\n"\nerror\n"\n> valid. If a tag has no end, it will consume all text instead of not processing. There's no sense of escaping < or > E.g. <div class="error\>"> will end at \> despite it being inside a quote <div class="<span class="error">error</span>"> will not process the larger div ---- NOPROCESSING TAGS (nowiki, pre, syntaxhighlight, source, etc.) ---- (Note: <source> is the deprecated version of <syntaxhighlight>) No-processing tags have some differences to the above rules. Specifically, their syntax is a lot stricter. While an opening tag follows the same set of rules, A closing tag can't have any sort of extra formatting. </div a/a> is valid, </nowiki a/a> is not. Only newlines and spaces/tabs are allowed in closing tags. Note that, even though tags may cause a visual change without an ending tag like <pre>, one is required for the no-processing effects. Both the content inside the tag pair and the text in the tags will not be processed. E.g. <nowiki |}}>|}}</nowiki> would have both of the |}} escaped. Since we only care about these no-processing tags, we can ignore the idea of an intercepting tag messing us up, and just go for the first ending we can find. If there is no ending, the tag will NOT consume the rest of the text. Even if there is no ending tag, the content inside the opening tag will still be unprocessed, meaning {{X20|<nowiki }}>}} wouldn't end at the first }} despite there being no ending tag. There are some tags, like <math>, which also function like <nowiki> for our purposes, and are handled here. Some other tags, like <ref>, have more complex behaviour that can't be reasonably implemented in this, and so are ignored. I suspect that every tag listed in [[Special:Version]] may behave somewhat like this, but that's far too many cases worth checking for rarely used tags that may not even have a good reason to contain {{ or }} anyways, so we leave them alone. ---- INCLUDEONLY ---- While includeonly tags do technically serve the same purpose as a nowiki tag for what this module does, contextually they don't always make sense to escape, and as such are ignored by this module entirely. --]=] -- This function expects the string to start with the tag localvalidtags={nowiki=1,pre=1,syntaxhighlight=1,source=1,math=1} localfunctionTestForNowikiTag(text,scanPosition) localtagName=string.match(text,"^<([^\n />]+)",scanPosition) ifnottagNameornotvalidtags[string.lower(tagName)]then returnnil end localnextOpener=string.find(text,"<",scanPosition+1)or-1 localnextCloser=string.find(text,">",scanPosition+1) ifnextCloserand(nextOpener==-1ornextCloser<nextOpener)then localstartingTag=string.sub(text,scanPosition,nextCloser) -- We have our starting tag (E.g. '<pre style="color:red">') -- Now find our ending... ifendswith(startingTag,"/>")then-- self-closing tag (we are our own ending) return{ Tag=tagName, Start=startingTag, Content="",End="", Length=#startingTag } else localendingTagStart,endingTagEnd=string.find(text,"</"..allcases(tagName).."[ \t\n]*>",scanPosition) ifendingTagStartthen--Regular tag formation localendingTag=string.sub(text,endingTagStart,endingTagEnd) localtagContent=string.sub(text,nextCloser+1,endingTagStart-1) return{ Tag=tagName, Start=startingTag, Content=tagContent, End=endingTag, Length=#startingTag+#tagContent+#endingTag } else-- Content inside still needs escaping (also linter error!) return{ Tag=tagName, Start=startingTag, Content="",End="", Length=#startingTag } end end end returnnil end --[=[ Implementation Notes HTML Comments are about as basic as it gets. Start at <!--, end at -->, no extra conditions. If a comment has no end, it will eat all the text ahead. --]=] localfunctionTestForComment(text,scanPosition) ifstring.match(text,"^<!%-%-",scanPosition)then localcommentEnd=string.find(text,"-->",scanPosition+4,true) ifcommentEndthen return{ Start="<!--",End="-->", Content=string.sub(text,scanPosition+4,commentEnd-1), Length=commentEnd-scanPosition+3 } else-- Consumes all text if not given an ending return{ Start="<!--",End="", Content=string.sub(text,scanPosition+4), Length=#text-scanPosition+1 } end end returnnil end --[[ Implementation notes The goal of this function is to escape all text that wouldn't be parsed if it was preprocessed (see above implementation notes). Using keepComments will keep all HTML comments instead of removing them. They will still be escaped regardless. --]] localfunctionPrepareText(text,keepComments) localnewtext={} localscanPosition=1 whiletruedo localNextCheck=string.find(text,"<[NnSsPpMm!]",scanPosition)-- Advance to the next potential tag we care about ifnotNextCheckthen-- Done newtext[#newtext+1]=string.sub(text,scanPosition) break end newtext[#newtext+1]=string.sub(text,scanPosition,NextCheck-1) scanPosition=NextCheck localComment=TestForComment(text,scanPosition) ifCommentthen ifkeepCommentsthen newtext[#newtext+1]=Comment.Start..mw.text.nowiki(Comment.Content)..Comment.End end scanPosition=scanPosition+Comment.Length else localTag=TestForNowikiTag(text,scanPosition) ifTagthen localnewTagStart="<"..mw.text.nowiki(string.sub(Tag.Start,2,-2))..">" localnewTagEnd= Tag.End==""and""or-- Respect no tag ending "</"..mw.text.nowiki(string.sub(Tag.End,3,-2))..">" localnewContent=mw.text.nowiki(Tag.Content) newtext[#newtext+1]=newTagStart..newContent..newTagEnd scanPosition=scanPosition+Tag.Length else-- Nothing special, move on... newtext[#newtext+1]=string.sub(text,scanPosition,scanPosition) scanPosition=scanPosition+1 end end end returntable.concat(newtext,"") end -- Helper functions for ParseTemplates -- localfunctionboundlen(pair) returnpair.End-pair.Start+1 end --mw.text.trim is expensive due to utf-8 support, so we use a cheap version localtrimcache={} localwhitespace={[" "]=1,["\n"]=1,["\t"]=1,["\r"]=1} localfunctioncheaptrim(str) localquick=trimcache[str] ifquickthen returnquick else -- local out = string.gsub(str, "^%s*(.-)%s*$", "%1") locallowEnd localstrlen=#str fori=1,strlendo ifnotwhitespace[string.sub(str,i,i)]then lowEnd=i break end end ifnotlowEndthen trimcache[str]="" return"" end fori=strlen,1,-1do ifnotwhitespace[string.sub(str,i,i)]then localout=string.sub(str,lowEnd,i) trimcache[str]=out returnout end end end end --[=[ ParseTemplates Implementation notes WARNING: This is not perfect (far from it), and likely overkill. Stick to PrepareText above, as it should be enough in most cases. This function is an alternative to Transcluder's getParameters which considers the potential for a singular { or } or other odd syntax that %b doesn't like to be in a parameter's value. When handling the difference between {{ and {{{, mediawiki will attempt to match as many sequences of {{{ as possible before matching a {{ E.g. {{{{A}}}} -> { {{{A}}} } {{{{{{{{Text|A}}}}}}}} -> {{ {{{ {{{Text|A}}} }}} }} If there aren't enough triple braces on both sides, the parser will compromise for a template interpretation. E.g. {{{{A}} }} -> {{ {{ A }} }} While there are technically concerns about things such as wikilinks breaking template processing (E.g. {{[[}}]]}} doesn't stop at the first }}), it shouldn't be our job to process inputs perfectly when the input has garbage ({ / } isn't legal in titles anyways, so if something's unmatched in a wikilink, it's guaranteed GIGO) Setting dontEscape will prevent running the input text through EET. Avoid setting this to true if you don't have to set it. Returned values: A table of all templates. Template data goes as follows: Text: The raw text of the template Name: The name of the template Args: A list of arguments Children: A list of immediate template children --]=] --Main function localfunctionParseTemplates(InputText,dontEscape) --Setup ifnotdontEscapethen InputText=PrepareText(InputText) end localfunctionfinalise(text) ifnotdontEscapethen returnmw.text.decode(text) else returntext end end localfunctionCreateContainerObj(Container) Container.Text={} Container.Args={} Container.ArgOrder={} Container.Children={} -- Container.Name = nil -- Container.Value = nil -- Container.Key = nil Container.BeyondStart=false Container.LastIndex=1 Container.finalise=finalise functionContainer:HandleArgInput(character,internalcall) ifnotinternalcallthen self.Text[#self.Text+1]=character end ifcharacter=="="then ifself.Keythen self.Value[#self.Value+1]=character else self.Key=cheaptrim(self.Valueandtable.concat(self.Value,"")or"") self.Value={} end else--"|" or "}" ifnotself.Namethen self.Name=cheaptrim(self.Valueandtable.concat(self.Value,"")or"") self.Value=nil else self.Value=self.finalise(self.Valueandtable.concat(self.Value,"")or"") ifself.Keythen self.Key=self.finalise(self.Key) self.Args[self.Key]=cheaptrim(self.Value) self.ArgOrder[#self.ArgOrder+1]=self.Key else localKey=tostring(self.LastIndex) self.Args[Key]=self.Value self.ArgOrder[#self.ArgOrder+1]=Key self.LastIndex=self.LastIndex+1 end self.Key=nil self.Value=nil end end end functionContainer:AppendText(text,ftext) self.Text[#self.Text+1]=(ftextortext) ifnotself.Valuethen self.Value={} end self.BeyondStart=self.BeyondStartor(#table.concat(self.Text,"")>2) ifself.BeyondStartthen self.Value[#self.Value+1]=text end end functionContainer:Clean(IsTemplate) self.Text=table.concat(self.Text,"") ifself.ValueandIsTemplatethen self.Value={string.sub(table.concat(self.Value,""),1,-3)}--Trim ending }} self:HandleArgInput("|",true)--Simulate ending end self.Value=nil self.Key=nil self.BeyondStart=nil self.LastIndex=nil self.finalise=nil self.HandleArgInput=nil self.AppendText=nil self.Clean=nil end returnContainer end --Step 1: Find and escape the content of all wikilinks on the page, which are stronger than templates (see implementation notes) localscannerPosition=1 localwikilinks={} localopenWikilinks={} whiletruedo localPosition,_,Character=string.find(InputText,"([%[%]])%1",scannerPosition) ifnotPositionthen--Done break end scannerPosition=Position+2--+2 to pass the [[ / ]] ifCharacter=="["then--Add a [[ to the pending wikilink queue openWikilinks[#openWikilinks+1]=Position else--Pair up the ]] to any available [[ if#openWikilinks>=1then localstart=table.remove(openWikilinks)--Pop the latest [[ wikilinks[start]={Start=start,End=Position+1,Type="Wikilink"}--Note the pair end end end --Step 2: Find the bounds of every valid template and variable ({{ and {{{) localscannerPosition=1 localtemplates={} localvariables={} localopenBrackets={} whiletruedo localStart,_,Character=string.find(InputText,"([{}])%1",scannerPosition) ifnotStartthen--Done (both 9e9) break end local_,End=string.find(InputText,"^"..Character.."+",Start) scannerPosition=Start--Get to the {{ / }} set ifCharacter=="{"then--Add the {{+ set to the queue openBrackets[#openBrackets+1]={Start=Start,End=End} else--Pair up the }} to any available {{, accounting for {{{ / }}} localBracketCount=End-Start+1 whileBracketCount>=2and#openBrackets>=1do localOpenSet=table.remove(openBrackets) ifboundlen(OpenSet)>=3andBracketCount>=3then--We have a {{{variable}}} (both sides have 3 spare) variables[OpenSet.End-2]={Start=OpenSet.End-2,End=scannerPosition+2,Type="Variable"}--Done like this to ensure chronological order BracketCount=BracketCount-3 OpenSet.End=OpenSet.End-3 scannerPosition=scannerPosition+3 else--We have a {{template}} (both sides have 2 spare, but at least one side doesn't have 3 spare) templates[OpenSet.End-1]={Start=OpenSet.End-1,End=scannerPosition+1,Type="Template"}--Done like this to ensure chronological order BracketCount=BracketCount-2 OpenSet.End=OpenSet.End-2 scannerPosition=scannerPosition+2 end ifboundlen(OpenSet)>=2then--Still has enough data left, leave it in openBrackets[#openBrackets+1]=OpenSet end end end scannerPosition=End--Now move past the bracket set end --Step 3: Re-trace every object using their known bounds, collecting our parameters with (slight) ease localscannerPosition=1 localactiveObjects={} localfinalObjects={} whiletruedo localLatestObject=activeObjects[#activeObjects]--Commonly needed object localNNC,_,Character--NNC = NextNotableCharacter ifLatestObjectthen NNC,_,Character=string.find(InputText,"([{}%[%]|=])",scannerPosition) else NNC,_,Character=string.find(InputText,"([{}])",scannerPosition)--We are only after templates right now end ifnotNNCthen break end ifNNC>scannerPositionandLatestObjectthen localscannedContent=string.sub(InputText,scannerPosition,NNC-1) LatestObject:AppendText(scannedContent,finalise(scannedContent)) end scannerPosition=NNC+1 ifCharacter=="{"orCharacter=="["then localContainer=templates[NNC]orvariables[NNC]orwikilinks[NNC] ifContainerthen CreateContainerObj(Container) ifContainer.Type=="Template"then Container:AppendText("{{") scannerPosition=NNC+2 elseifContainer.Type=="Variable"then Container:AppendText("{{{") scannerPosition=NNC+3 else--Wikilink Container:AppendText("[[") scannerPosition=NNC+2 end ifLatestObjectandContainer.Type=="Template"then--Only templates count as children LatestObject.Children[#LatestObject.Children+1]=Container end activeObjects[#activeObjects+1]=Container elseifLatestObjectthen LatestObject:AppendText(Character) end elseifCharacter=="}"orCharacter=="]"then ifLatestObjectthen LatestObject:AppendText(Character) ifLatestObject.End==NNCthen ifLatestObject.Type=="Template"then LatestObject:Clean(true) finalObjects[#finalObjects+1]=LatestObject else LatestObject:Clean(false) end activeObjects[#activeObjects]=nil localNewLatest=activeObjects[#activeObjects] ifNewLatestthen NewLatest:AppendText(LatestObject.Text)--Append to new latest end end end else--| or = ifLatestObjectthen LatestObject:HandleArgInput(Character) end end end --Step 4: Fix the order localFixedOrder={} localSortableReference={} for_,Objectinnext,finalObjectsdo SortableReference[#SortableReference+1]=Object.Start end table.sort(SortableReference) fori=1,#SortableReferencedo localstart=SortableReference[i] forn,Objectinnext,finalObjectsdo ifObject.Start==startthen finalObjects[n]=nil Object.Start=nil--Final cleanup Object.End=nil Object.Type=nil FixedOrder[#FixedOrder+1]=Object break end end end --Finished, return returnFixedOrder end localp={} --Main entry points p.PrepareText=PrepareText p.ParseTemplates=ParseTemplates --Extra entry points, not really required p.TestForNowikiTag=TestForNowikiTag p.TestForComment=TestForComment returnp --[==[ console tests local s = [=[Hey!{{Text|<nowiki | ||> Hey! }} A</nowiki>|<!--AAAAA|AAA-->Should see|Shouldn't see}}]=] local out = p.PrepareText(s) mw.logObject(out) local s = [=[B<!-- Hey! -->A]=] local out = p.TestForComment(s, 2) mw.logObject(out); mw.log(string.sub(s, 2, out.Length)) local a = p.ParseTemplates([=[ {{User:Aidan9382/templates/dummy |A|B|C {{{A|B}}} { } } { |<nowiki>D</nowiki> |<pre>E |F</pre> |G|=|a=|A = [[{{PAGENAME}}|A=B]]{{Text|1==<nowiki>}}</nowiki>}}|A B=Success}} ]=]) mw.logObject(a) ]==]