Data dumps/mwimport
Description
[edit ]mwimport is a Perl script to import WikiMedia XML dumps. Its purpose in life is to be faster than mwdumper. It is however much less general than mwdumper, specifically:
- It is not an XML parser -- it can simply parse XML dumps available on download (tested with pages-articles.xml of enwiki/2006-11-30, enwikibooks/2006-12-23, enwiktionary/2006-12-24, dewiki/2007-01-24)
- Minor variations in the dump format will break it.
- It can not read compressed data: an unzipping tool must be specified in the pipe.
- It can only output SQL statements according to MediaWiki schema version 1.5 (that was in 2005).
- It is written in Perl, which you may have to install first.
- So far only tested on Linux. (Update: It performs on Windows_2000 as well ! See discussion page.) Also tested and working on Mac OS X 10.5.2.
- It is unmaintained since 2007.
Secondary benefits over mwdumper:
- You can interrupt it, and it will report the number of successfully written pages.
- In subsequent runs you can instruct it to skip this many pages.
- Takes much less (about a ninth) memory, which may well be a major contribution to its speediness.
If the limitations do not bother you, here is some backing for the "it's faster" claim. All results are elapsed seconds (smaller is better), averaged from five runs on the same machine on an uncompressed XML dump. mwdumper was started with --quiet --format=sql:1.5. The first two columns were from throwing the resulting SQL code away (>/dev/null) while the "& mysql" columns are from "real-world" usage, i.e. piping the result into a mysql process, which is obviously slower.
dump | size [MB] | pages | mwdumper | mwimport | mwdumper & mysql | mwimport & mysql |
---|---|---|---|---|---|---|
enwikibooks-20061223 | 153 | 50255 | 207.113 | 37.879 | 232.358 | 55.012 |
enwiktionary-20061224 | 317 | 401519 | 623.063 | 203.191 | 863.120 | 342.529 |
dewiki-20070124 | 2832 | 1090612 | 4163.042 | 735.641 | 7408.999 | 2974.708 |
So it seems that mwimport alone is about 5.5 times faster on large pages and about 3.2 times faster on small pages. Adding mysql to the mix closes the gap somewhat, but mwimport is still 2.5 times faster in the worst case (dewiki), and always faster than mwdumper without mysql.
Example usage:
bzcat enwiki-20120104-pages-articles.xml.bz2 | perl mwimport.pl | gzip -c > /media/wikipedia/enwiki.sql.gz
NOTE: You may be required on some MediaWiki installations to use the -f switch with mysql when using mwimport. mwimport works very well with most XML dumps provided by the Foundation. If you get an error from mysql complaining about duplicate index keys (key 1 is typically the one encountered), then use the following syntax with mwimport to get around the problem:
cat enwiki-<date>.xml | perl mwimport.pl | mysql -f -u<admin name> -p<admin password> --default-character-set=utf8 <database name>
The -f (force) flag will not overwrite the duplicate record in the database, but will allow the remaining articles to be imported without interrupting the import process.
To skip a particular number of pages, use -s number_of_pages_to_skip , as documented in the head of the code.
Source
[edit ]NOTE: Copy the rendered text of this program during cut and paste. Do not copy the wikitext as the quot;, lt;, and gt; XML markup tags will be improperly formatted if you use the wikitext version and will result in perl processing errors. Cut and paste the rendered page content instead.
If you are importing content in a particular language you will want to add the localized forms of the word "REDIRECT" to the two lines that begin
$page{redirect} = $page{latest_start}...
.
For both lines you will want to add both the upper and lower case forms of the word, because this script is not utf-8 aware and so case conversion will not work properly in the second line.
#!/usr/bin/perl -w =head1 NAME mwimport -- quick and dirty mediawiki importer =head1 SYNOPSIS cat pages.xml | mwimport [-s N|--skip=N] =cut usestrict; useGetopt::Long; usePod::Usage; my($cnt_page,$cnt_rev,%namespace,$ns_pattern); my$committed=0; my$skip=0; ## set this to 1 to match "mwdumper --format=sql:1.5" as close as possible subCompat(){0} # 512kB is what mwdumper uses, but 4MB gives much better performance here my$Buffer_Size=Compat?512*1024:4*1024*1024; subtextify($) { my$l; for($_[0]){ if(defined$_){ s/"/"/ig; s/</</ig; s/>/>/ig; /&(?!amp;)(.*?;)/anddie"textify: does not know &1ドル"; s/&/&/ig; $l=length$_; s/\\/\\\\/g; s/\n/\\n/g; s/'/\\'/ig; Compatands/"/\\"/ig; $_="'$_'"; }else{ $l=0; $_="''"; } } return$l; } subgetline() { $_=<>; defined$_ordie"eof at line $.\n"; } subignore_elt($) { m|^\s*<$_[0]>.*?</$_[0]>\n$|ordie"expected $_[0] element in line $.\n"; getline; } subsimple_elt($$) { if(m|^\s*<$_[0]\s*/>\n$|){ $_[1]{$_[0]}=''; }elsif(m|^\s*<$_[0]>(.*?)</$_[0]>\n$|){ $_[1]{$_[0]}=1ドル; }else{ die"expected $_[0] element in line $.\n"; } getline; } subsimple_opt_elt($$) { if(m|^\s*<$_[0]\s*/>\n$|){ $_[1]{$_[0]}=''; }elsif(m|^\s*<$_[0]>(.*?)</$_[0]>\n$|){ $_[1]{$_[0]}=1ドル; }else{ return; } getline; } subredirect_elt($) { if(m|^\s*<redirect\s*title="([^"]*)"\s*/>\n$|){# " -- GeSHI syntax highlighting breaks on this line $_[0]{redirect}=1ドル; }else{ simple_opt_eltredirect=>$_[0]; return; } getline; } subopening_tag($) { m|^\s*<$_[0]>\n$|ordie"expected $_[0] element in line $.\n"; getline; } subclosing_tag($) { m|^\s*</$_[0]>\n$|ordie"$_[0]: expected closing tag in line $.\n"; getline; } subsi_nss_namespace() { m|^\s*<namespacekey="(-?\d+)"[^/]*?/>()\n| orm|^\s*<namespacekey="(-?\d+)"[^>]*?>(.*?)</namespace>\n| ordie"expected namespace element in line $.\n"; $namespace{2ドル}=1ドル; getline; } subsi_namespaces() { opening_tag("namespaces"); eval{ while(1){ si_nss_namespace; } }; # note: $@ is always defined $@=~ /^expected namespace element /ordie"namespaces: $@"; $ns_pattern='^('.join('|',map{quotemeta}keys%namespace).'):'; closing_tag("namespaces"); } subsiteinfo() { opening_tag("siteinfo"); eval{ my%site; simple_eltsitename=>\%site; simple_eltdbname=>\%site; simple_eltbase=>\%site; simple_eltgenerator=>\%site; $site{generator}=~ /^MediaWiki 1.20wmf1$/ orwarn("siteinfo: untested generator '$site{generator}',", " expect trouble ahead\n"); simple_eltcase=>\%site; si_namespaces; print"-- MediaWiki XML dump converted to SQL by mwimport BEGIN; -- Site: $site{sitename} -- DBName: $site{dbname} -- URL: $site{base} -- Generator: $site{generator} -- Case: $site{case} -- -- Namespaces: ",map{"-- $namespace{$_}: $_\n"} sort{$namespace{$a}<=>$namespace{$b}}keys%namespace; }; $@anddie"siteinfo: $@"; closing_tag("siteinfo"); } subpg_rv_contributor($) { if(m|^\s*<contributordeleted="deleted"\s*/>\s*\n|){ getline; }else{ opening_tag"contributor"; my%c; eval{ simple_eltusername=>\%c; simple_eltid=>\%c; $_[0]{contrib_user}=$c{username}; $_[0]{contrib_id}=$c{id}; }; if($@){ $@=~ /^expected username element /ordie"contributor: $@"; eval{ simple_eltip=>\%c; $_[0]{contrib_user}=$c{ip}; }; $@anddie"contributor: $@"; } closing_tag"contributor"; } } subpg_rv_comment($) { if(m|^\s*<comment\s*/>\s*\n|){ getline; }elsif(m|^\s*<commentdeleted="deleted"\s*/>\s*\n|){ getline; }elsif(s|^\s*<comment>([^<]*)||g){ while(1){ $_[0]{comment}.=1ドル; lastif$_; getline; s|^([^<]*)||; } closing_tag"comment"; }else{ return; } } subpg_rv_text($) { if(m|^\s*<textxml:space="preserve"\s*/>\s*\n|){ $_[0]{text}=''; getline; }elsif(m|^\s*<textdeleted="deleted"\s*/>\s*\n|){ $_[0]{text}=''; getline; }elsif(s|^\s*<textxml:space="preserve">([^<]*)||g){ while(1){ $_[0]{text}.=1ドル; lastif$_; getline; s|^([^<]*)||; } closing_tag"text"; }else{ die"expected text element in line $.\n"; } } my$start=time; substats() { my$s=time-$start; $s||=1; printfSTDERR"%9d pages (%7.3f/s), %9d revisions (%7.3f/s) in %d seconds\n", $cnt_page,$cnt_page/$s, $cnt_rev, $cnt_rev/$s,$s; } ### flush_rev($text, $rev, $page) subflush_rev($$$) { $_[0]orreturn; formy$i(0,1,2){ $_[$i]=~s/,\n?$//; } print"INSERT INTO text(old_id,old_text,old_flags) VALUES $_[0];\n"; $_[2]andprint"INSERT INTO page(page_id,page_namespace,page_title,page_restrictions,page_counter,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len) VALUES $_[2];\n"; print"INSERT INTO revision(rev_id,rev_page,rev_text_id,rev_comment,rev_user,rev_user_text,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id) VALUES $_[1];\n"; formy$i(0,1,2){ $_[$i]=''; } } ### flush($text, $rev, $page) subflush($$$) { flush_rev$_[0],$_[1],$_[2]; print"COMMIT;\n"; $committed=$cnt_page; } ### pg_revision(\%page, $skip, $text, $rev, $page) subpg_revision($$$$$) { my$rev={}; opening_tag"revision"; eval{ my%revision; simple_eltid=>$rev; simple_opt_eltparentid=>$rev; simple_elttimestamp=>$rev; pg_rv_contributor$rev; simple_opt_eltminor=>$rev; pg_rv_comment$rev; simple_opt_eltmodel=>$rev; simple_opt_eltformat=>$rev; pg_rv_text$rev; simple_opt_eltsha1=>$rev; }; $@anddie"revision: $@"; closing_tag"revision"; $_[1]andreturn; $$rev{id}=~ /^\d+$/orreturn warn("page '$_[0]{title}': ignoring bogus revision id '$$rev{id}'\n"); $_[0]{latest_len}=textify$$rev{text}; formy$f(qw(comment contrib_user)){ textify$$rev{$f}; } $$rev{timestamp}=~ s/^(\d\d\d\d)-(\d\d)-(\d\d)T(\d\d):(\d\d):(\d\d)Z$/'1ドル2ドル3ドル4ドル5ドル6ドル'/ orreturnwarn("page '$_[0]{title}' rev $$rev{id}: ", "bogus timestamp '$$rev{timestamp}'\n"); $_[2].="($$rev{id},$$rev{text},'utf-8'),\n"; $$rev{minor}=defined$$rev{minor}?1:0; $_[3].="($$rev{id},$_[0]{id},$$rev{id},$$rev{comment}," .($$rev{contrib_id}||0) .",$$rev{contrib_user},$$rev{timestamp},$$rev{minor},0,$_[0]{latest_len},$_[0]{latest}),\n"; $_[0]{latest}=$$rev{id}; $_[0]{latest_start}=substr$$rev{text},0,60; if(length$_[2]>$Buffer_Size){ flush_rev$_[2],$_[3],$_[4]; $_[0]{do_commit}=1; } ++$cnt_rev%1000==0andstats; } ### page($text, $rev, $page) subpage($$$) { opening_tag"page"; my%page; ++$cnt_page; eval{ simple_elttitle=>\%page; simple_opt_eltns=>\%page; simple_eltid=>\%page; redirect_elt\%page; simple_opt_eltrestrictions=>\%page; $page{latest}=0; while(1){ pg_revision\%page,$skip,$_[0],$_[1],$_[2]; } }; # note: $@ is always defined $@=~ /^expected revision element /ordie"page: $@"; closing_tag"page"; if($skip){ --$skip; }else{ $page{title}orreturn; $page{id}=~ /^\d+$/ orwarn("page '$page{title}': bogus id '$page{id}'\n"); my$ns; if($page{title}=~s/$ns_pattern//o){ $ns=$namespace{1ドル}; }else{ $ns=0; } formy$f(qw(title restrictions)){ textify$page{$f}; } if(Compat){ $page{redirect}=$page{latest_start}=~ /^'#(?:REDIRECT|redirect) /? 1:0; }else{ $page{redirect}=$page{latest_start}=~ /^'#REDIRECT /i?1:0; } $page{title}=~y/ /_/; if(Compat){ $_[2].="($page{id},$ns,$page{title},$page{restrictions},0," ."$page{redirect},0,RAND()," ."DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0," ."$page{latest},$page{latest_len}),\n"; }else{ $_[2].="($page{id},$ns,$page{title},$page{restrictions},0," ."$page{redirect},0,RAND(),NOW()+0,$page{latest},$page{latest_len}),\n"; } if($page{do_commit}){ flush$_[0],$_[1],$_[2]; print"BEGIN;\n"; } } } subterminate { die"terminated by SIG$_[0]\n"; } my$SchemaVer='0.10'; my$SchemaLoc="http://www.mediawiki.org/xml/export-$SchemaVer/"; my$Schema="http://www.mediawiki.org/xml/export-$SchemaVer.xsd"; my$help; GetOptions("skip=i"=>\$skip, "help"=>\$help)orpod2usage(2); $helpandpod2usage(1); getline; m|^<mediawiki\Qxmlns="$SchemaLoc"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="$SchemaLoc $Schema"version="$SchemaVer"\Exml:lang="..">$| ordie"unknown schema or invalid first line\n"; getline; $SIG{TERM}=$SIG{INT}=\&terminate; siteinfo; my($text,$rev,$page)=('','',''); eval{ while(1){ page$text,$rev,$page; } }; $@=~ /^expected page element /ordie"$@ (committed $committed pages)\n"; flush$text,$rev,$page; stats; m|</mediawiki>|ordie"mediawiki: expected closing tag in line $.\n"; =head1COPYRIGHT Copyright2007byRobertBihlmeyer Thisprogramisfreesoftware;youcanredistributeitand/ormodify itunderthetermsoftheGNUGeneralPublicLicenseaspublishedby theFreeSoftwareFoundation;eitherversion2oftheLicense,or (atyouroption)anylaterversion. Youmayalsoredistributeand/ormodifythissoftwareundertheterms oftheGNUFreeDocumentationLicensewithoutinvariantsections,and withoutfront-coverorback-covertexts. Thisprogramisdistributedinthehopethatitwillbeuseful, butWITHOUTANYWARRANTY;withouteventheimpliedwarrantyof MERCHANTABILITYorFITNESSFORAPARTICULARPURPOSE.Seethe GNUGeneralPublicLicenseformoredetails.