Jump to content
Wikipedia The Free Encyclopedia

Module:Wikitext Parsing

From Wikipedia, the free encyclopedia
Module documentation[view] [edit] [history] [purge]
Warning This Lua module is used on approximately 18,200,000 pages, or roughly 28% of all pages .
To avoid major disruption and server load, any changes should be tested in the module's /sandbox or /testcases subpages, or in your own module sandbox. The tested changes can be added to this page in a single edit. Consider discussing changes on the talk page before implementing them.
This module can only be edited by administrators because it is transcluded onto one or more cascade-protected pages.

This module provides some functions to help with the complex edge cases involved in modules like Module:Template parameter value which intend to process the raw wikitext of a page while respecting nowiki tags or similar content reliably. This module is designed to be called by other modules, and does not support invoking.

PrepareText

This module is rated as ready for general use. It has reached a mature state, is considered relatively stable and bug-free, and may be used wherever appropriate. It can be mentioned on help pages and other Wikipedia resources as an option for new users. To minimise server load and avoid disruptive output, improvements should be developed through sandbox testing rather than repeated trial-and-error editing.
Page protected This module is currently protected from editing.
See the protection policy and protection log for more details. Please discuss any changes on the talk page; you may submit an edit request to ask an administrator to make an edit if it is uncontroversial or supported by consensus. You may also request that this page be unprotected.

PrepareText(text, keepComments) will run any content within certain tags that normally disable processing (<nowiki>, <pre>, <syntaxhighlight>, <source>, <math>) through mw.text.nowiki and remove HTML comments. This allows for tricky syntax to be parsed through more basic means such as %b{} by other modules without worrying about edge cases.

If the second parameter, keepComments, is set to true, the content of HTML comments will be passed through mw.text.nowiki instead of being removed entirely.

Any code using this function directly should consider using mw.text.decode to correct the output at the end if part of the processed text is returned, though this will also decode any input that was encoded but not inside a no-processing tag, which likely isn't a significant issue but still something worth noting.

ParseTemplates

This module is rated as alpha. It is ready for limited use and third-party feedback. It may be used on a small number of pages, but should be monitored closely. Suggestions for new features or adjustments to input and output are welcome.
Page protected This module is currently protected from editing.
See the protection policy and protection log for more details. Please discuss any changes on the talk page; you may submit an edit request to ask an administrator to make an edit if it is uncontroversial or supported by consensus. You may also request that this page be unprotected.

ParseTemplates(InputText, dontEscape) will attempt to parse all {{Templates}} on a page, handling multiple factors such as [[Wikilinks]] and {{{Variables}}} among other complex syntax. Due to the complexity of the function, it is considerably slow, and should be used carefully. The function returns a list of template objects in order of appearance, which have the following properties:

  • Args: A key-value set of arguments, not in order
  • ArgOrder: A list of keys in the order they appear in the template
  • Children: A list of template objects that are contained within the existing template, in order of appearance. Only immediate children are listed
  • Name: The name of the template
  • Text: The raw text of the template

If the second parameter, dontEscape, is set to true, the inputted text won't be ran through the PrepareText function.

The above documentation is transcluded from Module:Wikitext Parsing/doc. (edit | history)
Editors can experiment in this module's sandbox (edit | diff) and testcases (edit | run) pages.
Subpages of this module.

 require("strict")

 --Helper functions
 localfunctionstartswith(text,subtext)
 returnstring.sub(text,1,#subtext)==subtext
 end
 localfunctionendswith(text,subtext)
 returnstring.sub(text,-#subtext,-1)==subtext
 end
 localfunctionallcases(s)
 returns:gsub("%a",function(c)
 return"["..c:upper()..c:lower().."]"
 end)
 end
 localtrimcache={}
 localwhitespace={[" "]=1,["\n"]=1,["\t"]=1,["\r"]=1}
 localfunctioncheaptrim(str)--mw.text.trim is surprisingly expensive, so here's an alternative approach
 localquick=trimcache[str]
 ifquickthen
 returnquick
 else
 -- local out = string.gsub(str, "^%s*(.-)%s*$", "%1")
 locallowEnd
 localstrlen=#str
 fori=1,strlendo
 ifnotwhitespace[string.sub(str,i,i)]then
 lowEnd=i
 break
 end
 end
 ifnotlowEndthen
 trimcache[str]=""
 return""
 end
 fori=strlen,1,-1do
 ifnotwhitespace[string.sub(str,i,i)]then
 localout=string.sub(str,lowEnd,i)
 trimcache[str]=out
 returnout
 end
 end
 end
 end

 --[=[ Implementation notes
 ---- NORMAL HTML TAGS ----
 Tags are very strict on how they want to start, but loose on how they end.
 The start must strictly follow <[tAgNaMe](%s|>) with no room for whitespace in
 the tag's name, but may then flow as they want afterwards, making
 <div\nclass\n=\n"\nerror\n"\n> valid

 There's no sense of escaping < or >
 E.g.
  <div class="error\>"> will end at \> despite it being inside a quote
  <div class="<span class="error">error</span>"> will not process the larger div

 If a tag has no end, it will consume all text instead of not processing

 ---- NOPROCESSING TAGS (nowiki, pre, syntaxhighlight, source, etc.) ----
 (In most comments, <source> will not be mentioned. This is because it is the
 deprecated version of <syntaxhighlight>)

 No-Processing tags have some interesting differences to the above rules.
 For example, their syntax is a lot stricter. While an opening tag appears to
 follow the same set of rules, A closing tag can't have any sort of extra
 formatting period. While </div a/a> is valid, </nowiki a/a> isn't - only
 newlines and spaces/tabs are allowed in closing tags.
 Note that, even though <pre> tags cause a visual change when the ending tag has
 extra formatting, it won't cause the no-processing effects. For some reason, the
 format must be strict for that to apply.

 Both the content inside the tag pair and the content inside each side of the
 pair is not processed. E.g. <nowiki |}}>|}}</nowiki> would have both of the |}}
 escaped in practice.

 When something in the code is referenced to as a "Nowiki Tag", it means a tag
 which causes wiki text to not be processed, which includes <nowiki>, <pre>,
 and <syntaxhighlight>

 Since we only care about these tags, we can ignore the idea of an intercepting
 tag preventing processing, and just go straight for the first ending we can find
 If there is no ending to find, the tag will NOT consume the rest of the text in
 terms of processing behaviour (though <pre> will appear to have an effect).
 Even if there is no end of the tag, the content inside the opening half will
 still be unprocessed, meaning {{X20|<nowiki }}>}} wouldn't end at the first }}
 despite there being no ending to the tag.

 Note that there are some tags, like <math>, which also function like <nowiki>
 which are included in this aswell. Some other tags, like <ref>, have far too
 unpredictable behaviour to be handled currently (they'd have to be split and
 processed as something seperate - its complicated, but maybe not impossible.)
 I suspect that every tag listed in [[Special:Version]] may behave somewhat like
 this, but that's far too many cases worth checking for rarely used tags that may
 not even have a good reason to contain {{ or }} anyways, so we leave them alone.

 ---- HTML COMMENTS AND INCLUDEONLY ----
 HTML Comments are about as basic as it could get for this
 Start at <!--, end at -->, no extra conditions. Simple enough
 If a comment has no end, it will eat all text instead of not being processed

 includeonly tags function mostly like a regular nowiki tag, with the exception
 that the tag will actually consume all future text if not given an ending as
 opposed to simply giving up and not changing anything. Due to complications and
 the fact that this is far less likely to be present on a page, aswell as being
 something that may not want to be escaped, includeonly tags are ignored during
 our processing
 --]=]
 localvalidtags={nowiki=1,pre=1,syntaxhighlight=1,source=1,math=1}
 --This function expects the string to start with the tag
 localfunctionTestForNowikiTag(text,scanPosition)
 localtagName=(string.match(text,"^<([^\n />]+)",scanPosition)or""):lower()
 ifnotvalidtags[tagName]then
 returnnil
 end
 localnextOpener=string.find(text,"<",scanPosition+1)or-1
 localnextCloser=string.find(text,">",scanPosition+1)or-1
 ifnextCloser>-1and(nextOpener==-1ornextCloser<nextOpener)then
 localstartingTag=string.sub(text,scanPosition,nextCloser)
 --We have our starting tag (E.g. '<pre style="color:red">')
 --Now find our ending...
 ifendswith(startingTag,"/>")then--self-closing tag (we are our own ending)
 return{
 Tag=tagName,
 Start=startingTag,
 Content="",End="",
 Length=#startingTag
 }

 else
 localendingTagStart,endingTagEnd=string.find(text,"</"..allcases(tagName).."[ \t\n]*>",scanPosition)
 ifendingTagStartthen--Regular tag formation
 localendingTag=string.sub(text,endingTagStart,endingTagEnd)
 localtagContent=string.sub(text,nextCloser+1,endingTagStart-1)
 return{
 Tag=tagName,
 Start=startingTag,
 Content=tagContent,
 End=endingTag,
 Length=#startingTag+#tagContent+#endingTag
 }

 else--Content inside still needs escaping (also linter error!)
 return{
 Tag=tagName,
 Start=startingTag,
 Content="",End="",
 Length=#startingTag
 }
 end
 end
 end
 returnnil
 end
 localfunctionTestForComment(text,scanPosition)--Like TestForNowikiTag but for <!-- -->
 ifstring.match(text,"^<!%-%-",scanPosition)then
 localcommentEnd=string.find(text,"-->",scanPosition+4,true)
 ifcommentEndthen
 return{
 Start="<!--",End="-->",
 Content=string.sub(text,scanPosition+4,commentEnd-1),
 Length=commentEnd-scanPosition+3
 }
 else--Consumes all text if not given an ending
 return{
 Start="<!--",End="",
 Content=string.sub(text,scanPosition+4),
 Length=#text-scanPosition+1
 }
 end
 end
 returnnil
 end

 --[[ Implementation notes
 The goal of this function is to escape all text that wouldn't be parsed if it
 was preprocessed (see above implementation notes).

 Using keepComments will keep all HTML comments instead of removing them. They
 will still be escaped regardless to avoid processing errors
 --]]
 localfunctionPrepareText(text,keepComments)
 localnewtext={}
 localscanPosition=1
 whiletruedo
 localNextCheck=string.find(text,"<[NnSsPpMm!]",scanPosition)--Advance to the next potential tag we care about
 ifnotNextCheckthen--Done
 newtext[#newtext+1]=string.sub(text,scanPosition)
 break
 end
 newtext[#newtext+1]=string.sub(text,scanPosition,NextCheck-1)
 scanPosition=NextCheck
 localComment=TestForComment(text,scanPosition)
 ifCommentthen
 ifkeepCommentsthen
 newtext[#newtext+1]=Comment.Start..mw.text.nowiki(Comment.Content)..Comment.End
 end
 scanPosition=scanPosition+Comment.Length
 else
 localTag=TestForNowikiTag(text,scanPosition)
 ifTagthen
 localnewTagStart="<"..mw.text.nowiki(string.sub(Tag.Start,2,-2))..">"
 localnewTagEnd=
 Tag.End==""and""or--Respect no tag ending
 "</"..mw.text.nowiki(string.sub(Tag.End,3,-2))..">"
 localnewContent=mw.text.nowiki(Tag.Content)
 newtext[#newtext+1]=newTagStart..newContent..newTagEnd
 scanPosition=scanPosition+Tag.Length
 else--Nothing special, move on...
 newtext[#newtext+1]=string.sub(text,scanPosition,scanPosition)
 scanPosition=scanPosition+1
 end
 end
 end
 returntable.concat(newtext,"")
 end

 --[=[ Implementation notes
 This function is an alternative to Transcluder's getParameters which considers
 the potential for a singular { or } or other odd syntax that %b doesn't like to
 be in a parameter's value.

 When handling the difference between {{ and {{{, mediawiki will attempt to match
 as many sequences of {{{ as possible before matching a {{
 E.g.
  {{{{A}}}} -> { {{{A}}} }
  {{{{{{{{Text|A}}}}}}}} -> {{ {{{ {{{Text|A}}} }}} }}
 If there aren't enough triple braces on both sides, the parser will compromise
 for a template interpretation.
 E.g.
  {{{{A}} }} -> {{ {{ A }} }}

 While there are technically concerns about things such as wikilinks breaking
 template processing (E.g. {{[[}}]]}} doesn't stop at the first }}), it shouldn't
 be our job to process inputs perfectly when the input has garbage ({ / } isn't
 legal in titles anyways, so if something's unmatched in a wikilink, it's
 guaranteed GIGO)

 Setting dontEscape will prevent running the input text through EET. Avoid
 setting this to true if you don't have to set it.

 Returned values:
 A table of all templates. Template data goes as follows:
  Text: The raw text of the template
  Name: The name of the template
  Args: A list of arguments
  Children: A list of immediate template children
 --]=]
 --Helper functions
 localfunctionboundlen(pair)
 returnpair.End-pair.Start+1
 end

 --Main function
 localfunctionParseTemplates(InputText,dontEscape)
 --Setup
 ifnotdontEscapethen
 InputText=PrepareText(InputText)
 end
 localfunctionfinalise(text)
 ifnotdontEscapethen
 returnmw.text.decode(text)
 else
 returntext
 end
 end
 localfunctionCreateContainerObj(Container)
 Container.Text={}
 Container.Args={}
 Container.ArgOrder={}
 Container.Children={}
 -- Container.Name = nil
 -- Container.Value = nil
 -- Container.Key = nil
 Container.BeyondStart=false
 Container.LastIndex=1
 Container.finalise=finalise
 functionContainer:HandleArgInput(character,internalcall)
 ifnotinternalcallthen
 self.Text[#self.Text+1]=character
 end
 ifcharacter=="="then
 ifself.Keythen
 self.Value[#self.Value+1]=character
 else
 self.Key=cheaptrim(self.Valueandtable.concat(self.Value,"")or"")
 self.Value={}
 end
 else--"|" or "}"
 ifnotself.Namethen
 self.Name=cheaptrim(self.Valueandtable.concat(self.Value,"")or"")
 self.Value=nil
 else
 self.Value=self.finalise(self.Valueandtable.concat(self.Value,"")or"")
 ifself.Keythen
 self.Key=self.finalise(self.Key)
 self.Args[self.Key]=cheaptrim(self.Value)
 self.ArgOrder[#self.ArgOrder+1]=self.Key
 else
 localKey=tostring(self.LastIndex)
 self.Args[Key]=self.Value
 self.ArgOrder[#self.ArgOrder+1]=Key
 self.LastIndex=self.LastIndex+1
 end
 self.Key=nil
 self.Value=nil
 end
 end
 end
 functionContainer:AppendText(text,ftext)
 self.Text[#self.Text+1]=(ftextortext)
 ifnotself.Valuethen
 self.Value={}
 end
 self.BeyondStart=self.BeyondStartor(#table.concat(self.Text,"")>2)
 ifself.BeyondStartthen
 self.Value[#self.Value+1]=text
 end
 end
 functionContainer:Clean(IsTemplate)
 self.Text=table.concat(self.Text,"")
 ifself.ValueandIsTemplatethen
 self.Value={string.sub(table.concat(self.Value,""),1,-3)}--Trim ending }}
 self:HandleArgInput("|",true)--Simulate ending
 end
 self.Value=nil
 self.Key=nil
 self.BeyondStart=nil
 self.LastIndex=nil
 self.finalise=nil
 self.HandleArgInput=nil
 self.AppendText=nil
 self.Clean=nil
 end
 returnContainer
 end

 --Step 1: Find and escape the content of all wikilinks on the page, which are stronger than templates (see implementation notes)
 localscannerPosition=1
 localwikilinks={}
 localopenWikilinks={}
 whiletruedo
 localPosition,_,Character=string.find(InputText,"([%[%]])%1",scannerPosition)
 ifnotPositionthen--Done
 break
 end

 scannerPosition=Position+2--+2 to pass the [[ / ]]
 ifCharacter=="["then--Add a [[ to the pending wikilink queue
 openWikilinks[#openWikilinks+1]=Position
 else--Pair up the ]] to any available [[
 if#openWikilinks>=1then
 localstart=table.remove(openWikilinks)--Pop the latest [[
 wikilinks[start]={Start=start,End=Position+1,Type="Wikilink"}--Note the pair
 end
 end
 end

 --Step 2: Find the bounds of every valid template and variable ({{ and {{{)
 localscannerPosition=1
 localtemplates={}
 localvariables={}
 localopenBrackets={}
 whiletruedo
 localStart,_,Character=string.find(InputText,"([{}])%1",scannerPosition)
 ifnotStartthen--Done (both 9e9)
 break
 end
 local_,End=string.find(InputText,"^"..Character.."+",Start)

 scannerPosition=Start--Get to the {{ / }} set
 ifCharacter=="{"then--Add the {{+ set to the queue
 openBrackets[#openBrackets+1]={Start=Start,End=End}

 else--Pair up the }} to any available {{, accounting for {{{ / }}}
 localBracketCount=End-Start+1
 whileBracketCount>=2and#openBrackets>=1do
 localOpenSet=table.remove(openBrackets)
 ifboundlen(OpenSet)>=3andBracketCount>=3then--We have a {{{variable}}} (both sides have 3 spare)
 variables[OpenSet.End-2]={Start=OpenSet.End-2,End=scannerPosition+2,Type="Variable"}--Done like this to ensure chronological order
 BracketCount=BracketCount-3
 OpenSet.End=OpenSet.End-3
 scannerPosition=scannerPosition+3

 else--We have a {{template}} (both sides have 2 spare, but at least one side doesn't have 3 spare)
 templates[OpenSet.End-1]={Start=OpenSet.End-1,End=scannerPosition+1,Type="Template"}--Done like this to ensure chronological order
 BracketCount=BracketCount-2
 OpenSet.End=OpenSet.End-2
 scannerPosition=scannerPosition+2
 end

 ifboundlen(OpenSet)>=2then--Still has enough data left, leave it in
 openBrackets[#openBrackets+1]=OpenSet
 end
 end
 end
 scannerPosition=End--Now move past the bracket set
 end

 --Step 3: Re-trace every object using their known bounds, collecting our parameters with (slight) ease
 localscannerPosition=1
 localactiveObjects={}
 localfinalObjects={}
 whiletruedo
 localLatestObject=activeObjects[#activeObjects]--Commonly needed object
 localNNC,_,Character--NNC = NextNotableCharacter
 ifLatestObjectthen
 NNC,_,Character=string.find(InputText,"([{}%[%]|=])",scannerPosition)
 else
 NNC,_,Character=string.find(InputText,"([{}])",scannerPosition)--We are only after templates right now
 end
 ifnotNNCthen
 break
 end
 ifNNC>scannerPositionandLatestObjectthen
 localscannedContent=string.sub(InputText,scannerPosition,NNC-1)
 LatestObject:AppendText(scannedContent,finalise(scannedContent))
 end

 scannerPosition=NNC+1
 ifCharacter=="{"orCharacter=="["then
 localContainer=templates[NNC]orvariables[NNC]orwikilinks[NNC]
 ifContainerthen
 CreateContainerObj(Container)
 ifContainer.Type=="Template"then
 Container:AppendText("{{")
 scannerPosition=NNC+2
 elseifContainer.Type=="Variable"then
 Container:AppendText("{{{")
 scannerPosition=NNC+3
 else--Wikilink
 Container:AppendText("[[")
 scannerPosition=NNC+2
 end
 ifLatestObjectandContainer.Type=="Template"then--Only templates count as children
 LatestObject.Children[#LatestObject.Children+1]=Container
 end
 activeObjects[#activeObjects+1]=Container
 elseifLatestObjectthen
 LatestObject:AppendText(Character)
 end

 elseifCharacter=="}"orCharacter=="]"then
 ifLatestObjectthen
 LatestObject:AppendText(Character)
 ifLatestObject.End==NNCthen
 ifLatestObject.Type=="Template"then
 LatestObject:Clean(true)
 finalObjects[#finalObjects+1]=LatestObject
 else
 LatestObject:Clean(false)
 end
 activeObjects[#activeObjects]=nil
 localNewLatest=activeObjects[#activeObjects]
 ifNewLatestthen
 NewLatest:AppendText(LatestObject.Text)--Append to new latest
 end
 end
 end

 else--| or =
 ifLatestObjectthen
 LatestObject:HandleArgInput(Character)
 end
 end
 end

 --Step 4: Fix the order
 localFixedOrder={}
 localSortableReference={}
 for_,Objectinnext,finalObjectsdo
 SortableReference[#SortableReference+1]=Object.Start
 end
 table.sort(SortableReference)
 fori=1,#SortableReferencedo
 localstart=SortableReference[i]
 forn,Objectinnext,finalObjectsdo
 ifObject.Start==startthen
 finalObjects[n]=nil
 Object.Start=nil--Final cleanup
 Object.End=nil
 Object.Type=nil
 FixedOrder[#FixedOrder+1]=Object
 break
 end
 end
 end

 --Finished, return
 returnFixedOrder
 end

 localp={}
 --Main entry points
 p.PrepareText=PrepareText
 p.ParseTemplates=ParseTemplates
 --Extra entry points, not really required
 p.TestForNowikiTag=TestForNowikiTag
 p.TestForComment=TestForComment

 returnp

 --[==[ console tests

 local s = [=[Hey!{{Text|<nowiki | ||>
 Hey! }}
 A</nowiki>|<!--AAAAA|AAA-->Should see|Shouldn't see}}]=]
 local out = p.PrepareText(s)
 mw.logObject(out)

 local s = [=[B<!--
 Hey!
 -->A]=]
 local out = p.TestForComment(s, 2)
 mw.logObject(out); mw.log(string.sub(s, 2, out.Length))

 local a = p.ParseTemplates([=[
 {{User:Aidan9382/templates/dummy
 |A|B|C {{{A|B}}} { } } {
 |<nowiki>D</nowiki>
 |<pre>E
 |F</pre>
 |G|=|a=|A = [[{{PAGENAME}}|A=B]]{{Text|1==<nowiki>}}</nowiki>}}|A B=Success}}
 ]=])
 mw.logObject(a)

 ]==]

AltStyle によって変換されたページ (->オリジナル) /