Jump to content
Wikimedia Meta-Wiki

User:Micke/WikiFind

From Meta, a Wikimedia project coordination wiki

WikiFind is a simple program written in C++, used for analysing database-dumps from a MediaWiki-site such as Wikipedia. The program looks for a user specified keyword and returns a text file with wiki-formatted links to each page containing the specified keyword (Regexes can be used). Thus the program can be used for looking for templates, bits of code, misspelled words and such, perhaps in order to get a list to use with a bot.

Source code is available below under the conditions of the GNU General Public License (GPL) version 3 or later http://www.gnu.org/licenses/gpl.html.

Notice

[edit ]

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

If you decide to test the program, I would much appreciate feed-back and comments. Leave a message at my talk page or send me an e-mail (and if you know c++, don't be afraid to start fixing things on the todo-list :-) .

How to

[edit ]

I will not supply executables at this time, which only means you'll have to compile the program your self.

Start by putting the source code bellow (copy/paste) in a ordinary text file and name it "wikifind.cpp", then:

  • For Unix/Linux users I recommend the GNU Compiler Collection (open source) in which case you can use these commands (more than likely g++ will be pre-installed on your system):

cd /to/directory/where/file/is
g++ wikifind.cpp -o wikifind -lboost_regex
./wikifind

Next time you want to run the program just:

cd /to/directory/where/file/is
./wikifind

  • For Windows users I recommend Dev C++ (open source):

Open Dev C++ and open the downloaded file, then press the button "Compile and run", this will compile and start the program. Next time you want to run the program simply double click the file "wikifind.exe" that you'll now find in the same folder where you kept "wikifind.cpp".

As you can notice below, there are two versions of the program, the second version is intended for use with a bzcat pipe like this: bzcat swwiki-20080607-pages-articles.xml.bz2| ./wikifind "output.txt" "[Kk]eyword"

This means that you don't have to uncompress the database dump prior to searching it.

Regex capability

[edit ]

The regex function requires you to install Boost regexp (if you don't do this the program won't compile). If you don't want to (or can't) enable the Boost regex library you can use the old version found here [1].

Linux

First download the library from here (the file you want is called boost_1_34_1.tar.bz2):

tar --bzip2 -xf /path/to/boost_1_34_1.tar.bz2 And then follow instructions here:

Ubuntu (or other Debian?)

Install the boost library with sudo apt-get install libboost-regex-dev

Compile it with g++ wikifind.cpp -o wikifind -l boost_regex

Windows

First download the library from here (the file you want is called boost_1_33_1.exe):

  • Boost on sourceforge
  • Then download Bjam, the file is called: boost-jam-3.1.16-1-ntx86.zip)
  • Double click boost_1_33_1.exe and install to C:\Boost
  • Move everything from C:\Boost\boost_1_33_1 to C:\Boost and delete C:\Boost\boost_1_33_1
  • Extract bjam.exe from boost-jam-3.1.16-1-ntx86.zip to C:\Boost
  • Open a command line and execute bjam.exe "-sMINGW_ROOT_DIRECTORY=C:\Dev-Cpp" "-sTOOLS=mingw" install[1] . This is assuming that you have Dev-C++ installed to this location.
  • Open Dev-C++ and add C:\Boost\include\boost-1_33_1 to "Tools", "Compiler Options", "Directories", "C++ includes" ("Verktyg", Kompilatoralternativ", "Kataloger", C++inkluderingsfiler" if you have a swedish installation)
  • Set "Tools", "Compiler Options", "Directories", "Libraries" (bibliotek) to C:\Boost\lib and press ok.
  • Everything should work! Note: I din't get Boost-1.34.1 to work, so insted you should be carefull to download Boost-1.33.1.

See official guide for more info:

Reference

[edit ]
  1. Thanx to Jozef Wagner

A few pointers

[edit ]

Since the xml-dumps use < and > for tagging, all occurrences of < and > in the wiki-code are changed to:

  • <&lt;
    
  • >→&gt;
    

respectivly, which means that you have to use this in order to search for wiki tags like e.g. <nowiki>:

  • &lt;nowiki&gt;
    

TODO (and list of things that doesn't work yet):

[edit ]
  • (削除) Translate in to English (削除ここまで)
  • (削除) Add regex capabilities (削除ここまで)
  1. Add capabilities to look for a string divided on two or more lines
  2. Add an option to search for more than one string at a time
    • (削除) Add a counter of articles found (削除ここまで)
  3. Automatically sort page hits alphabetically
  4. Add option to disregard redirects
  5. Make it so that the program only searches within <text> and </text>-tags
  6. Enable specific namespace searches

See also

[edit ]

Source version 1

[edit ]

Copy/paste below or download from here.

////////////////////////////////////////////////////////////
//	WikiFind is a program used for reading database dumps 
//	from MediaWiki, written by Mikael Nordin, licensed under 
//	the GNU General Public License (GPL) version 3, 
//	or any later version.				 
//	Copyright Mikael Nordin 2008. 
#include<iostream> //for cin and cout
#include<string>	//for strings
#include<fstream> // for ifstream
#include<boost/regex.hpp> //for regex
usingnamespacestd;
stringkeyword,line,filenamein,filenameout,title,problem,found,looking;//Global variables
stringtitle2="qqqqqqxxxxwpppppzzzzzwwwwqqqq";//title not likely to exist
intnooftitles=0;
voidLang();//Sub-routines
voidSearch();
intmain()//main function
{
Lang();//select language

Search();//searching file
return0;

}
voidLang()//Localization and input/query/output function
{
stringlang,q1,q2,q3;//variables
intlang2=1;

while(lang2!=0)//selecting language
{
cout<<"Välj språk / Please choose language:\n";
cout<<"1. Svenska (sv)\n";
cout<<"2. English (en)\n";
cin>>lang;

if(lang=="sv")//Swedish localization
{
q1="Vilken fil vill du genomsöka: ",
q2="Var vill du spara resultatet: ",
q3="Vilket sökord vill du hitta: ",
problem="Filen kunde inte öppnas\n",
found=" träffar gjordes\n",
looking="Letar efter: ",
lang2=0;
}

elseif(lang=="en")//English localization
{
q1="Which file do you want to search: ",
q2="Where do you want to store results: ",
q3="Which string do you want to seach for: ",
problem="Could not open file\n",
found=" titles found\n",
looking="Looking for: ",
lang2=0;
}

else//incorrect lang choice
{
cout<<"Fel val / Wrong choice\n";
}
}

cin.get();
cout<<q1;
getline(cin,filenamein);

cout<<q2;
getline(cin,filenameout);

cout<<q3;
getline(cin,keyword);
}
voidSearch()//searching database dump
{
ifstreamFileIn(filenamein.c_str());//Open dump
if(!FileIn)//if something goes wrong with file opening
{
cout<<problem;
}

ofstreamFileOut(filenameout.c_str(),ios::app);//Open output file

FileOut<<"== "<<keyword<<" ==\n";//headline to file
cout<<looking<<keyword<<endl;//what we are doing

while(getline(FileIn,line))//reading file line by line
{//checking to see if it's a pagename
if(line[0]==' '&&line[1]==' '&&line[2]==' '&&line[3]==' '
&&line[4]=='<'&&line[5]=='t'&&line[6]=='i'&&line[7]=='t'
&&line[8]=='l'&&line[9]=='e'&&line[10]=='>')
{
title=line;//saving any pagenames

}

boost::regexrexp(keyword);
boost::smatchtokens;

if(boost::regex_search(line,tokens,rexp))//if keyword is found
{
while(title2!=title)//checking to see if pagename is allready stored
{

intlangd=title.length()-19;//removingt xml- taggs
inti=11;

FileOut<<"* [[";//wikiformating

while(langd>0)//printing pagename
{
FileOut<<title[i];
i=i+1;
langd=langd-1;

}

FileOut<<"]]\n";//wikiformating

nooftitles=nooftitles+1;//counting articles

title2=title;//saving new title

langd=title.length()-19;//removingt xml- taggs again
i=11;

cout<<"* [[";//wikiformating

while(langd>0)//printing pagename on screen
{
cout<<title[i];
i=i+1;
langd=langd-1;

}

cout<<"]]\n";//wikiformating	 
}
}
}

FileOut<<endl<<nooftitles<<found<<endl;//printing number of articles to file

cout<<endl<<nooftitles<<found;//printing number of articles to screen
}


Source version 2

[edit ]

Copy/paste below or download from here.

////////////////////////////////////////////////////////////
//	WikiFind is a programme used for reading databse dumps 
// from MediaWiki written by Mikael Nordin, licensed under 
//	the GNU General Public License (GPL) version 3, 
// or any later version.				 
//	Copyright Mikael Nordin 2008. 
#include<iostream> //for cin and cout
#include<string>	//for strings
#include<fstream> // for ifstream
#include<boost/regex.hpp> //for regex
usingnamespacestd;
stringline,filenamein,title,problem,found,looking;//Global variables
stringtitle2="qqqqqqxxxxwpppppzzzzzwwwwqqqq";//title not likely to exist
intnooftitles=0;
voidSearch(stringfilenameout,stringkeyword);
intmain(intargc,char*argv[])//main function
{
stringfilenameout=argv[1];
stringkeyword=argv[2];
Search(filenameout,keyword);//searching file
return0;

}
voidSearch(stringfilenameout,stringkeyword)//searching database dump
{
ofstreamFileOut(filenameout.c_str(),ios::app);//Open output file

FileOut<<"== "<<keyword<<" ==\n";//headline to file
cout<<"Looking for: "<<keyword<<endl;//what we are doing
while(getline(cin,line))//reading file line by line
{//checking to see if it's a pagename
if(line[0]==' '&&line[1]==' '&&line[2]==' '&&line[3]==' '
&&line[4]=='<'&&line[5]=='t'&&line[6]=='i'&&line[7]=='t'
&&line[8]=='l'&&line[9]=='e'&&line[10]=='>')
{
title=line;//saving any pagenames

}
boost::regexrexp(keyword);
boost::smatchtokens;

if(boost::regex_search(line,tokens,rexp))//if keyword is found
{
while(title2!=title)//checking to see if pagename is allready stored
{

intlangd=title.length()-19;//removingt xml- taggs
inti=11;

FileOut<<"* [[";//wikiformating

while(langd>0)//printing pagename
{
FileOut<<title[i];
i=i+1;
langd=langd-1;

}

FileOut<<"]]\n";//wikiformating

nooftitles=nooftitles+1;//counting articles

title2=title;//saving new title

langd=title.length()-19;//removingt xml- taggs again
i=11;

cout<<"* [[";//wikiformating

while(langd>0)//printing pagename on screen
{
cout<<title[i];
i=i+1;
langd=langd-1;

}

cout<<"]]\n";//wikiformating	 
}
}
}

FileOut<<endl<<" pages found"<<endl;//printing number of articles to file

cout<<endl<<nooftitles<<" pages found"<<endl;//printing number of articles to screen
}

AltStyle によって変換されたページ (->オリジナル) /