Data dumps/mwimport

This page is kept for historical interest. Any policies mentioned may be obsolete.

Unmaintained since 2007.

Description

mwimport is a Perl script to import WikiMedia XML dumps. Its purpose in life is to be faster than mwdumper. It is however much less general than mwdumper, specifically:

It is not an XML parser -- it can simply parse XML dumps available on download (tested with pages-articles.xml of enwiki/2006-11-30, enwikibooks/2006-12-23, enwiktionary/2006-12-24, dewiki/2007-01-24)
Minor variations in the dump format will break it.
It can not read compressed data: an unzipping tool must be specified in the pipe.
It can only output SQL statements according to MediaWiki schema version 1.5 (that was in 2005).
It is written in Perl, which you may have to install first.
So far only tested on Linux. (Update: It performs on Windows_2000 as well ! See discussion page.) Also tested and working on Mac OS X 10.5.2.
It is unmaintained since 2007.

Secondary benefits over mwdumper:

You can interrupt it, and it will report the number of successfully written pages.
In subsequent runs you can instruct it to skip this many pages.
Takes much less (about a ninth) memory, which may well be a major contribution to its speediness.

If the limitations do not bother you, here is some backing for the "it's faster" claim. All results are elapsed seconds (smaller is better), averaged from five runs on the same machine on an uncompressed XML dump. mwdumper was started with --quiet --format=sql:1.5. The first two columns were from throwing the resulting SQL code away (>/dev/null) while the "& mysql" columns are from "real-world" usage, i.e. piping the result into a mysql process, which is obviously slower.

dump	size [MB]	pages	mwdumper	mwimport	mwdumper & mysql	mwimport & mysql
enwikibooks-20061223	153	50255	207.113	37.879	232.358	55.012
enwiktionary-20061224	317	401519	623.063	203.191	863.120	342.529
dewiki-20070124	2832	1090612	4163.042	735.641	7408.999	2974.708

So it seems that mwimport alone is about 5.5 times faster on large pages and about 3.2 times faster on small pages. Adding mysql to the mix closes the gap somewhat, but mwimport is still 2.5 times faster in the worst case (dewiki), and always faster than mwdumper without mysql.

Example usage:

 bzcat enwiki-20120104-pages-articles.xml.bz2 | perl mwimport.pl | gzip -c > /media/wikipedia/enwiki.sql.gz

NOTE: You may be required on some MediaWiki installations to use the -f switch with mysql when using mwimport. mwimport works very well with most XML dumps provided by the Foundation. If you get an error from mysql complaining about duplicate index keys (key 1 is typically the one encountered), then use the following syntax with mwimport to get around the problem:

 cat enwiki-<date>.xml | perl mwimport.pl | mysql -f -u<admin name> -p<admin password> --default-character-set=utf8 <database name>

The -f (force) flag will not overwrite the duplicate record in the database, but will allow the remaining articles to be imported without interrupting the import process.

To skip a particular number of pages, use -s number_of_pages_to_skip , as documented in the head of the code.

Source

[edit ]

NOTE: Copy the rendered text of this program during cut and paste. Do not copy the wikitext as the quot;, lt;, and gt; XML markup tags will be improperly formatted if you use the wikitext version and will result in perl processing errors. Cut and paste the rendered page content instead.

If you are importing content in a particular language you will want to add the localized forms of the word "REDIRECT" to the two lines that begin

$page{redirect} = $page{latest_start}....

For both lines you will want to add both the upper and lower case forms of the word, because this script is not utf-8 aware and so case conversion will not work properly in the second line.

#!/usr/bin/perl -w
=head1 NAME
mwimport -- quick and dirty mediawiki importer
=head1 SYNOPSIS
cat pages.xml | mwimport [-s N|--skip=N]
=cut
usestrict;
useGetopt::Long;
usePod::Usage;
my($cnt_page,$cnt_rev,%namespace,$ns_pattern);
my$committed=0;
my$skip=0;
## set this to 1 to match "mwdumper --format=sql:1.5" as close as possible
subCompat(){0}
# 512kB is what mwdumper uses, but 4MB gives much better performance here
my$Buffer_Size=Compat?512*1024:4*1024*1024;
subtextify($)
{
my$l;
for($_[0]){
if(defined$_){
s/&quot;/"/ig;
s/&lt;/</ig;
s/&gt;/>/ig;
/&(?!amp;)(.*?;)/anddie"textify: does not know &1ドル";
s/&amp;/&/ig;
$l=length$_;
s/\\/\\\\/g;
s/\n/\\n/g;
s/'/\\'/ig;
Compatands/"/\\"/ig;
$_="'$_'";
}else{
$l=0;
$_="''";
}
}
return$l;
}
subgetline()
{
$_=<>;
defined$_ordie"eof at line $.\n";
}
subignore_elt($)
{
m|^\s*<$_[0]>.*?</$_[0]>\n$|ordie"expected $_[0] element in line $.\n";
getline;
}
subsimple_elt($$)
{
if(m|^\s*<$_[0]\s*/>\n$|){
$_[1]{$_[0]}='';
}elsif(m|^\s*<$_[0]>(.*?)</$_[0]>\n$|){
$_[1]{$_[0]}=1ドル;
}else{
die"expected $_[0] element in line $.\n";
}
getline;
}
subsimple_opt_elt($$)
{
if(m|^\s*<$_[0]\s*/>\n$|){
$_[1]{$_[0]}='';
}elsif(m|^\s*<$_[0]>(.*?)</$_[0]>\n$|){
$_[1]{$_[0]}=1ドル;
}else{
return;
}
getline;
}
subredirect_elt($)
{
if(m|^\s*<redirect\s*title="([^"]*)"\s*/>\n$|){# " -- GeSHI syntax highlighting breaks on this line
$_[0]{redirect}=1ドル;
}else{
simple_opt_eltredirect=>$_[0];
return;
}
getline;
}
subopening_tag($)
{
m|^\s*<$_[0]>\n$|ordie"expected $_[0] element in line $.\n";
getline;
}
subclosing_tag($)
{
m|^\s*</$_[0]>\n$|ordie"$_[0]: expected closing tag in line $.\n";
getline;
}
subsi_nss_namespace()
{
m|^\s*<namespacekey="(-?\d+)"[^/]*?/>()\n|
orm|^\s*<namespacekey="(-?\d+)"[^>]*?>(.*?)</namespace>\n|
ordie"expected namespace element in line $.\n";
$namespace{2ドル}=1ドル;
getline;
}
subsi_namespaces()
{
opening_tag("namespaces");
eval{
while(1){
si_nss_namespace;
}
};
# note: $@ is always defined
$@=~ /^expected namespace element /ordie"namespaces: $@";
$ns_pattern='^('.join('|',map{quotemeta}keys%namespace).'):';
closing_tag("namespaces");
}
subsiteinfo()
{
opening_tag("siteinfo");
eval{
my%site;
simple_eltsitename=>\%site;
simple_eltdbname=>\%site;
simple_eltbase=>\%site;
simple_eltgenerator=>\%site;
$site{generator}=~ /^MediaWiki 1.20wmf1$/
orwarn("siteinfo: untested generator '$site{generator}',",
" expect trouble ahead\n");
simple_eltcase=>\%site;
si_namespaces;
print"-- MediaWiki XML dump converted to SQL by mwimport
BEGIN;
-- Site: $site{sitename}
-- DBName: $site{dbname}
-- URL: $site{base}
-- Generator: $site{generator}
-- Case: $site{case}
--
-- Namespaces:
",map{"-- $namespace{$_}: $_\n"}
sort{$namespace{$a}<=>$namespace{$b}}keys%namespace;
};
$@anddie"siteinfo: $@";
closing_tag("siteinfo");
}
subpg_rv_contributor($)
{
if(m|^\s*<contributordeleted="deleted"\s*/>\s*\n|){
getline;
}else{
opening_tag"contributor";
my%c;
eval{
simple_eltusername=>\%c;
simple_eltid=>\%c;
$_[0]{contrib_user}=$c{username};
$_[0]{contrib_id}=$c{id};
};
if($@){
$@=~ /^expected username element /ordie"contributor: $@";
eval{
simple_eltip=>\%c;
$_[0]{contrib_user}=$c{ip};
};
$@anddie"contributor: $@";
}
closing_tag"contributor";
}
}
subpg_rv_comment($)
{
if(m|^\s*<comment\s*/>\s*\n|){
getline;
}elsif(m|^\s*<commentdeleted="deleted"\s*/>\s*\n|){
getline;
}elsif(s|^\s*<comment>([^<]*)||g){
while(1){
$_[0]{comment}.=1ドル;
lastif$_;
getline;
s|^([^<]*)||;
}
closing_tag"comment";
}else{
return;
}
}
subpg_rv_text($)
{
if(m|^\s*<textxml:space="preserve"\s*/>\s*\n|){
$_[0]{text}='';
getline;
}elsif(m|^\s*<textdeleted="deleted"\s*/>\s*\n|){
$_[0]{text}='';
getline;
}elsif(s|^\s*<textxml:space="preserve">([^<]*)||g){
while(1){
$_[0]{text}.=1ドル;
lastif$_;
getline;
s|^([^<]*)||;
}
closing_tag"text";
}else{
die"expected text element in line $.\n";
}
}
my$start=time;
substats()
{
my$s=time-$start;
$s||=1;
printfSTDERR"%9d pages (%7.3f/s), %9d revisions (%7.3f/s) in %d seconds\n",
$cnt_page,$cnt_page/$s, $cnt_rev, $cnt_rev/$s,$s;
}
### flush_rev($text, $rev, $page)
subflush_rev($$$)
{
$_[0]orreturn;
formy$i(0,1,2){
$_[$i]=~s/,\n?$//;
}
print"INSERT INTO text(old_id,old_text,old_flags) VALUES $_[0];\n";
$_[2]andprint"INSERT INTO page(page_id,page_namespace,page_title,page_restrictions,page_counter,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len) VALUES $_[2];\n";
print"INSERT INTO revision(rev_id,rev_page,rev_text_id,rev_comment,rev_user,rev_user_text,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id) VALUES $_[1];\n";
formy$i(0,1,2){
$_[$i]='';
}
}
### flush($text, $rev, $page)
subflush($$$)
{
flush_rev$_[0],$_[1],$_[2];
print"COMMIT;\n";
$committed=$cnt_page;
}
### pg_revision(\%page, $skip, $text, $rev, $page)
subpg_revision($$$$$)
{
my$rev={};
opening_tag"revision";
eval{
my%revision;
simple_eltid=>$rev;
simple_opt_eltparentid=>$rev;
simple_elttimestamp=>$rev;
pg_rv_contributor$rev;
simple_opt_eltminor=>$rev;
pg_rv_comment$rev;
simple_opt_eltmodel=>$rev;
simple_opt_eltformat=>$rev;
pg_rv_text$rev;
simple_opt_eltsha1=>$rev;
};
$@anddie"revision: $@";
closing_tag"revision";
$_[1]andreturn;
$$rev{id}=~ /^\d+$/orreturn
warn("page '$_[0]{title}': ignoring bogus revision id '$$rev{id}'\n");
$_[0]{latest_len}=textify$$rev{text};
formy$f(qw(comment contrib_user)){
textify$$rev{$f};
}
$$rev{timestamp}=~
s/^(\d\d\d\d)-(\d\d)-(\d\d)T(\d\d):(\d\d):(\d\d)Z$/'1ドル2ドル3ドル4ドル5ドル6ドル'/
orreturnwarn("page '$_[0]{title}' rev $$rev{id}: ",
"bogus timestamp '$$rev{timestamp}'\n");
$_[2].="($$rev{id},$$rev{text},'utf-8'),\n";
$$rev{minor}=defined$$rev{minor}?1:0;
$_[3].="($$rev{id},$_[0]{id},$$rev{id},$$rev{comment},"
.($$rev{contrib_id}||0)
.",$$rev{contrib_user},$$rev{timestamp},$$rev{minor},0,$_[0]{latest_len},$_[0]{latest}),\n";
$_[0]{latest}=$$rev{id};
$_[0]{latest_start}=substr$$rev{text},0,60;
if(length$_[2]>$Buffer_Size){
flush_rev$_[2],$_[3],$_[4];
$_[0]{do_commit}=1;
}
++$cnt_rev%1000==0andstats;
}
### page($text, $rev, $page)
subpage($$$)
{
opening_tag"page";
my%page;
++$cnt_page;
eval{
simple_elttitle=>\%page;
simple_opt_eltns=>\%page;
simple_eltid=>\%page;
redirect_elt\%page;
simple_opt_eltrestrictions=>\%page;
$page{latest}=0;
while(1){
pg_revision\%page,$skip,$_[0],$_[1],$_[2];
}
};
# note: $@ is always defined
$@=~ /^expected revision element /ordie"page: $@";
closing_tag"page";
if($skip){
--$skip;
}else{
$page{title}orreturn;
$page{id}=~ /^\d+$/
orwarn("page '$page{title}': bogus id '$page{id}'\n");
my$ns;
if($page{title}=~s/$ns_pattern//o){
$ns=$namespace{1ドル};
}else{
$ns=0;
}
formy$f(qw(title restrictions)){
textify$page{$f};
}
if(Compat){
$page{redirect}=$page{latest_start}=~ /^'#(?:REDIRECT|redirect) /?
1:0;
}else{
$page{redirect}=$page{latest_start}=~ /^'#REDIRECT /i?1:0;
}
$page{title}=~y/ /_/;
if(Compat){
$_[2].="($page{id},$ns,$page{title},$page{restrictions},0,"
."$page{redirect},0,RAND(),"
."DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,"
."$page{latest},$page{latest_len}),\n";
}else{
$_[2].="($page{id},$ns,$page{title},$page{restrictions},0,"
."$page{redirect},0,RAND(),NOW()+0,$page{latest},$page{latest_len}),\n";
}
if($page{do_commit}){
flush$_[0],$_[1],$_[2];
print"BEGIN;\n";
}
}
}
subterminate
{
die"terminated by SIG$_[0]\n";
}
my$SchemaVer='0.10';
my$SchemaLoc="http://www.mediawiki.org/xml/export-$SchemaVer/";
my$Schema="http://www.mediawiki.org/xml/export-$SchemaVer.xsd";
my$help;
GetOptions("skip=i"=>\$skip,
"help"=>\$help)orpod2usage(2);
$helpandpod2usage(1);
getline;
m|^<mediawiki\Qxmlns="$SchemaLoc"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="$SchemaLoc $Schema"version="$SchemaVer"\Exml:lang="..">$|
ordie"unknown schema or invalid first line\n";
getline;
$SIG{TERM}=$SIG{INT}=\&terminate;
siteinfo;
my($text,$rev,$page)=('','','');
eval{
while(1){
page$text,$rev,$page;
}
};
$@=~ /^expected page element /ordie"$@ (committed $committed pages)\n";
flush$text,$rev,$page;
stats;
m|</mediawiki>|ordie"mediawiki: expected closing tag in line $.\n";
=head1COPYRIGHT
Copyright2007byRobertBihlmeyer
Thisprogramisfreesoftware;youcanredistributeitand/ormodify
itunderthetermsoftheGNUGeneralPublicLicenseaspublishedby
theFreeSoftwareFoundation;eitherversion2oftheLicense,or
(atyouroption)anylaterversion.
Youmayalsoredistributeand/ormodifythissoftwareundertheterms
oftheGNUFreeDocumentationLicensewithoutinvariantsections,and
withoutfront-coverorback-covertexts.
Thisprogramisdistributedinthehopethatitwillbeuseful,
butWITHOUTANYWARRANTY;withouteventheimpliedwarrantyof
MERCHANTABILITYorFITNESSFORAPARTICULARPURPOSE.Seethe
GNUGeneralPublicLicenseformoredetails.

Retrieved from "https://meta.wikimedia.org/w/index.php?title=Data_dumps/mwimport&oldid=23679500"