Wikipedia:Scripts/mwlink
This Ruby program has two modes. It can run as a daemon or text processor (daemon mode is preferred, since it's more efficient).
In text-scanning mode, it interprets its command line (or stdin if no command line given) as text possibly containing [[wikilinks]]. It preserves the original text and adds a text hyperlink (the http:
address contained in <> braces).
In daemon mode, it receives HTTP requests like http://localhost:4242/mwlink?page=
wiki-page-name and redirects to the appropriate Wikimedia page. It's convenient for scripts to just use that URL rather than constructing one themselves--all they have to do is URL-escape the text between [[ and ]].
#!/usr/bin/ruby # This script is dual-licensed under the GPL version 2 or any later # version, at your option. See http://www.gnu.org/licenses/gpl.txt for more # details. =begin =NAME mwlink-Linkifymediawiki-stylewikilinksinplaintext =SYNOPSIS mwlink[options][text-to-wikilink] --daemon[=port]RunasHTTPdaemon --encodingDefaultcharactersetencoding(utf-8) --default-wikiDefaultwiki(wikipedia) --default-languageDefaultlanguage(en) =DESCRIPTION Intext-scanningmode(withoutthe--daemonargument)Themwlinkprogramscans itsarguments(oritsstandardinput,intheeventofnoarguments)for wikilinksoftheform[[link]].ItexpandssuchlinksintoURLsandinserts themintotheoriginaltextafterthe[[link]]insharpbraces((({<}))and (({>}))).Optionsareprovidedforspecifyingadefaultwiki(thewikitolink toifnoqualifierisgiveninthelink)andadefaultlanguage(thelanguage toassumeifnoqualifierisgiven)aswellasthecharactersetencodingin use.Thebuilt-indefaultsare((*wikipedia*)),((*en*))and((*utf-8*)), respectively. Indaemonmode(nowpreferred),ItreceivesHTTPrequestsoftheform "http://.../page=((*wikipedia page*))"(the((*wikipediapage*))nameiswhat wouldappearwithina[[wikilink]].URL-escapingisrequiredbutnoother processing,makingitconvenienttousefromscripts. ==InitializationFile Thenamesofnamespacesvaryindifferentlanguages(especiallydueto language.Forexample,"User:"inEnglishis"Benutzer:"inGerman.Youcan specifylistsofnamespacestouseforparticularlanguagesinan initializationfile(({~/.mwlinkrc})). This is simply a line with the language, a colon, and a space-separated list of namespaces in that language. When interpreting links for that language (either because ((*--default-language*)) was specified or there is a language qualifier in the link, mwlink will recognize it as a namespace appropriately. All the namespaces must appear on one line--line continuation is not supported. Comments (lines introduced with (({#}})) (pound sign)) are comments, and are ignored, along with blank lines. Here is an example configuration containing (only) some namespaces from the German Wikipedia. ((*Note*)): To be kind to the wiki when this script is uploaded, I have broken the line, but it ((*may not be broken*)) in order to work with mwlink. de: Spezial Spezial_diskussion Diskussion Benutzer Benutzer_diskussion Bild Bild_diskussion Einordnung Einordnung_diskussion Wikipedia Wikipedia_talk WP Hilf Hilf_diskussion = WARNINGS * The program (like mediawiki) assumes links are not broken across line boundaries. * The mechanism for providing an alternate list of namespaces only works per-language; other wikis could have different namespaces, too. * The list of wikis and their abbreviations is doubtlessly incomplete. * The initialization file mechanism is not that useful for a shared daemon. * In command-line mode, it's very difficult to process ASCII em-dashes (--) correctly and still honor command-line options. mwlink gets it wrong, and that's one reason daemon mode is preferred. = AUTHOR Demi @ Wikipedia - http://en.wikipedia.org/wiki/User:Demi =end require'cgi' require'iconv' require'getoptlong' require'webrick' includeWEBrick $opt={ 'default-wiki'=>'wikipedia', 'default-language'=>'en', 'encoding'=>'utf-8' } classString definitcap() new=self.dup # Okay, I consider it dumb that a string subscripted produces an # integer --Demi new[0]=new[0].chr.upcase returnnew end definitcap!() self[0]=self[0].chr.upcase returnself end end classCanon definitialize() @ns={} @ns_array=%w(Media Special Talk User User_talk Project Project_talk Image Image_talk MediaWiki MediaWiki_talk Template Template_talk Help Help_talk Category Category_talk Wikipedia Wikipedia_talk WP) @ns['default']={} @ns_array.each{|nspc|@ns['default'][nspc]=nspc} ifFile::readable?(ENV['HOME']+'/.mwlinkrc') IO::foreach(ENV['HOME']+'/.mwlinkrc'){|line| nextifline=~ /^\s*\#/ nextifline=~ /^\s*$/ line.chomp! ifm=line.match(/^(\w+)\:(.*)$/) lang=m[1] nslist=m[2].split @ns[lang]={} nslist.each{|nspc|@ns[lang][nspc]=nspc} end } end @wiki={ 'Wiktionary'=>'wiktionary', 'Wikt'=>'wiktionary', 'W'=>'wikipedia', 'M'=>'meta', 'N'=>'news', 'Q'=>'quote', 'B'=>'books', 'Meta'=>'meta', 'Wikibooks'=>'books', 'Commons'=>'commmons', 'Wikisource'=>'source' } @wikispec={ 'wikipedia'=>{'domain'=>'wikipedia.org','lang'=>1}, 'wiktionary'=>{'domain'=>'wiktionary.org','lang'=>1}, 'meta'=>{'domain'=>'meta.wikimedia.org','lang'=>0}, 'books'=>{'domain'=>'wikibooks.org','lang'=>1}, 'commons'=>{'domain'=>'commmons.wikimedia.org','lang'=>0}, 'source'=>{'domain'=>'sources.wikimedia.org','lang'=>0}, 'news'=>{'domain'=>'wikinews.org','lang'=>1}, } @cs=Iconv.new("iso-8859-1",$opt['encoding']) end #TODO The % part of the # section of the URL should become a dot. defurlencode(s) CGI::escape(s).gsub(/%3[Aa]/,':').gsub(/%2[Ff]/,'/').gsub(/%23/,'#') end defcanonword(word) s=word.strip.squeeze(' ').tr(' ','_').initcap begin @cs.iconv(s) rescueIconv::IllegalSequence s end end defparselink(link) l={ 'namespace'=>'', 'language'=>$opt['default-language'], 'wiki'=>$opt['default-wiki'], 'title'=>'' } terms=link.split(':') l['title']=canonword(terms.pop) terms.each{|term| nextifterm.nil?orterm.empty? t=canonword(term) if@ns[l['language']] then ns=@ns[l['language']] else ns=@ns['default'] end ifns.key?(t) l['namespace']=ns[t] elsif@wiki.key?(t) l['wiki']=@wiki[t] else l['language']=t.downcase end } l end defcanonicalize(link) linkdesc=parselink(link.sub(/\|.*$/,'')) if@wikispec.key?(linkdesc['wiki']) ws=@wikispec[linkdesc['wiki']] host=ws['domain'] ifws['lang']!=0 host=linkdesc['language']+'.'+host end else host=linkdesc['wiki']+'.'+'wikimedia.org' end uri= iflinkdesc['namespace'].length>0 linkdesc['namespace']+':'+linkdesc['title'] else linkdesc['title'] end r=urlencode('http://'+host+'/wiki/'+uri) r end defto_s() "Namespace sets: "+@ns.keys.join(', ')+ "; Wikis: "+@wiki.to_a.join(', ') end end deflinkexpand(c,bracketlink) linktext= ifm= /\[\[([^\]]+)\]\]/.match(bracketlink) m[1] else bracketlink end bracketlink+ " <"+c.canonicalize(linktext)+">" end c=Canon.new() re= /\[\[\s*[^\s\\][^\]]+\]\]/ classMwlinkServlet<HTTPServlet::AbstractServlet definitialize(server,canonicalizer) super(server) @c=canonicalizer end defdo_GET(rq,rs) p=CGI.parse(rq.query_string) # Just for testing l=@c.canonicalize(p['page'][0]) rs.status=302 rs['Location']=l rs.body="<html><body>\n"+ "<a href=\"#{l}\">#{p['page'][0]}</a>\n"+ "</body></html>\n" end end begin GetoptLong::new( ['--default-wiki',GetoptLong::REQUIRED_ARGUMENT], ['--default-language',GetoptLong::REQUIRED_ARGUMENT], ['--encoding',GetoptLong::REQUIRED_ARGUMENT], ['--daemon',GetoptLong::OPTIONAL_ARGUMENT] ).eachdo|k,v| k=k.sub(/^--/,'') casek when'default-wiki','default-language','encoding' $opt[k]=v when'daemon' $opt['daemon']=true ifv.empty? $opt['port']=4242 else $opt['port']=v end end end rescueGetoptLong::InvalidOption true end if$opt['daemon'] port=$opt['port'].to_i puts"Starting daemon on port #{port}" s=HTTPServer.new(:Port=>port) s.mount("/mwlink",MwlinkServlet,c) trap('INT'){s.shutdown} s.start else # Note, there are various combinations of -- appearing in normal text that # will break this. --daemon is the recommended method. ifARGV.empty? STDIN.each_line{|line| putsline.chomp.gsub(re){|expr|linkexpand(c,expr)} } else putsARGV.join(' ').gsub(re){|expr|linkexpand(c,expr)} end end
Example output:
[[Ashland (disambiguation)]] is an example of a [[Wikipedia:Disambiguation]] page.
[[Ashland (disambiguation)]] <http://en.wikipedia.org/wiki/Ashland_%28disambiguation%29> is an example of a [[Wikipedia:Disambiguation]] <http://en.wikipedia.org/wiki/Wikipedia:Disambiguation> page.
GET http://localhost:4242/mwlink?page=Ashland+%28disambiguation%29
GET http://localhost:4242/mwlink?page=Ashland+%28disambiguation%29 --> 302 Found GET http://en.wikipedia.org/wiki/Ashland_%28disambiguation%29 --> ...(page content)
The GET program is a utility distributed with Perl's libwww. Also, note that wikimedia servers forbid scripts based on the LWP Perl module.