Data dumps
Wikimedia provides public dumps of our wiki's content:
- for archival/backup purposes
- for offline use
- for academic research
- for republishing (don't forget to follow the license terms)
- for fun!
The timezone of the file dates is UTC.
The dump files can be downloaded at:
If you cannot download the dump you want because it does not exist [has been discontinued, didn't complete successfully the past few months] you may be able to request a new dump: see Requests for dumps. You can also use that page to request a dump via snail mail, if you have a connection too slow or intermittent to download a dump you want.
Schedule
Starting January 23, 2006 dumps will be run approximately once a week. Since the whole process takes more than a week for all databases, not all databases become available at the same time.
- Note that the larger databases such as enwiki, dewiki, and frwiki can take a long time to run, especially when compressing the full edit history. If you see it stuck on one of these for a few hours, or up to nine days, don't worry -- it's not dead, it's just a lot of data.
The download site at http://download.wikimedia.org/ shows the status of each dump, if it's in progress, when it was last dumped, etc.
What's available?
- Page content
- Page-to-page link lists (pagelinks, categorylinks, imagelinks tables)
- Image metadata (image, oldimage tables)
- Misc bits (interwiki, site_stats tables)
What else?
may or may not be available, but is public data
- Log data (protection, deletion, uploads) -- see logging.sql.gz
- Dump metadata (availability, schedule)
- Multi-language dumps (clusters of languages in one file)
What's not available?
- User data: passwords, e-mail addresses, preferences, watchlists, etc
- Deleted page content
At the moment uploaded files are dealt with separately and somewhat less regularly, but we intend to make upload dumps more regularly again in the future.
Format
The main page data is provided in the same XML wrapper format that Special:Export produces for individual pages. It's fairly self-explanatory to look at, but there is some documentation at Help:Export.
Three sets of page data are produced for each dump, depending on what you need:
- pages-articles.xml
- Contains current version of all article pages, templates, and other pages
- Excludes discussion pages ('Talk:') and user "home" pages ('User:')
- Recommended for republishing of content.
- pages-meta-current.xml
- Contains current version of all pages, including discussion and user "home" pages.
- pages-meta-history.xml
- Contains complete text of every revision of every page (can be very large!)
- Recommended for research and archives.
The XML itself contains complete, raw text of every revision, so in particular the full history files can be extremely large; en.wikipedia.org would run upwards of six hundred gigabytes raw. Currently we are compressing these XML streams with bzip2 (.bz2 files) and additionally for the full history dump SevenZip (.7z files).
SevenZip's LZMA compression produces significantly smaller files for the full-history dumps, but doesn't do better than bzip2 for our other files.
Several of the tables are also dumped with mysqldump should anyone find them useful (for the database definition, see the documentation [1]); the gzip-compressed SQL dumps (.sql.gz) can be read directly into a MySQL database but may be less convenient for other database formats.
What happened to the SQL dumps?
In mid-2005 we upgraded the Wikimedia sites to MediaWiki 1.5, which uses a very different database layout than earlier versions. SQL dumps of the 'cur' and 'old' tables are no longer available because those tables no longer exist.
We don't provide direct dumps of the new 'page', 'revision', and 'text' tables either because aggressive changes to the backend storage make this extra difficult: much data is in fact indirection pointing to another database cluster, and deleted pages which we cannot reproduce may still be present in the raw internal database blobs. The XML dump format provides forward and backward compatibility without requiring authors of third-party dump processing or statistics tools to reproduce our every internal hack. If required, you can use the mwdumper tool (see below) to produce SQL statements compatible with the version 1.4 schema from an XML dump.
Tools
Note:
The page import methods mentioned below don't automatically rebuild the auxiliary tables such as the links tables in MediaWiki versions prior to 1.7.X. Later versions of MediaWiki now rebuild these links via importDump.php. The non-private auxiliary tables are provided as gzipped SQL dumps which can be imported directly into MySQL.
See also Meta's notes on rebuilding link tables
importDump.php
MediaWiki 1.5 and above includes a command-line script 'importDump.php' which can be used to import an XML page dump into the database. This requires first configuring and installing MediaWiki. It's also relatively slow; to import a large Wikipedia data dump into a fresh database you may wish to use mwdumper, however, mwdumper has compatibility problems with some Linux Distributions and does not provide robust error reporting. Please see the installation notes concerning mwdumper for a list of known problems and systems affected by problems with mwdumper.
As an example invocation, when you have an XML file called temp.xml
php maintenance\importDump.php < maintenance\temp.xml
As an example invocation, when you have an XML file called temp.xml with an output log and in the background:
php maintenance\importDump.php < maintenance\temp.xml >& progress.log &
NOTE: importDump.php has not worked correctly since MediaWiki 1.6.9 in all cases due to changes which were necessary in the base code that update article links and image rendering. It will report NULL titles and crash if the dumps contain poorly formatted articles or articles which contain NULL titles or which require URL expansion. Since it is impossible for the Wikipedia Community to fully police all of the content placed on the site, many articles and talk pages will contain large tables of unicode data, errors in the content itself, and articles in various states of completeness.
I am still getting NULL title crashes with importDump.php after adding the '+' character. Are there any other solutions to this problem?
The articles contained in the XML dumps and some of the templates used to create the Wikipedia: namespace entries are particularly troublesome in some of the dumps. Many titles reference deleted articles from the Wikipedia: namespace which are the residuals of spamming and other attacks to deface Wikipedia. The PHP based XML parser libraries on most Linux systems have built in checks to reject garbage text strings with inconsistent XML tags. The best solution is to enabled debugging and attempt to determine which specific article is causing problems. Enable the debug log and run import Dump again. It should display the last article successfully imported in the debug.log file.
$wgDebugLogFile = "debug.log";
Common Problems and Solutions
I get NULL title errors and importDump.php crashes with an exception. what can I do to get it working?
One solution to get around these problems is to insert a "+" character into $wgLegalTitleChars in the LocalSettings.php file located in the root MediaWiki directory to deal with NULL titles in articles and articles which require URL expansion:
$wgLegalTitleChars = " %!\"$&'()*,\\-.\\/0-9:;=?@A-Z\\\\^_`a-z~\\x80-\\xFF+"; <-- add the "+' character like this
Also, if you are seeing problems with importDump.php it is helpful to get the proper output from MediaWiki about where the error is occurring. Add this command to LocalSettings.php and restart importDump.php after the error to get more helpful error information for reporting or analyzing the problem.
$wgShowExceptionDetails = true;
importDump.php is very slow and will render images and TeX settings while the articles are importing. On large dumps, such as the English Wikipedia (enwiki), it can take over a week to update an active Wiki. As of MediaWiki 1.9.3 it still has severe problems with published dumps provided by the Wikimedia Foundation. It is advised to write your own utilities to strip out articles with NULL titles from the dumps in those cases where importDump.php does not work with LEX or a "C" based program.
I am running importDump.php and its really slow and I am seeing a large number of what appear to be error messages being output in the logs like "TeX" was running and "radicaleye.com dvips". what can I do?
Some tips in increasing the performance of importDump.php when attempting to import enwiki dumps are:
- set $wgUseTex = false; in LocalSettings.php before running importDump.php. This will delay Tex formatting until someone actually tries to read an article from the web. Be certain to set this variable back to "true" after running importDump.php to re-enable TeX formatting for formulas and other content.
- do not run maintenance/rebuildImages.php until after the dump has imported on a new MediaWiki installation if you have downloaded the image files. importDump.php will attempt to render the images if they have templates that insert thumbnail images into articles with ImageMagic or covert and this will slow down importing by several orders of magnitude.
- use a system with a minimum of 8GB of physical memory and at least 4GB of configured swap space to run the enwiki dumps.
- in the MYSQL /etc/my.cnf file, make certain max_allowed_packet is set above 20M if possible. This variable limits the size an SQL request plus the data can be during import operations. If it is set too low, mysqld will abort the any database connections to the SQL server that send requests larger than this value and halt article importing from mwdumper or importDump.php.
- set the MYSQL my.cnf file to settings to maximize performance. A sample is provided below:
/etc/my.cnf [mysqld] datadir=/var/lib/mysql socket=/var/lib/mysql/mysql.sock # Default to using old password format for compatibility with mysql 3.x # clients (those using the mysqlclient10 compatibility package). old_passwords=1 set-variable = key_buffer_size=2GB set-variable = max_allowed_packet=20M // during XML import, article size is directly related to this setting set-variable = table_cache=256 // max_allowed_packet=1GB is MAX for mysql 4.0 and above, 20M for 3.x set-variable = max_connections=500 innodb_data_file_path = ibdata1:10M:autoextend # Set buffer pool size to 50-80% of your computer's memory innodb_buffer_pool_size=2G innodb_additional_mem_pool_size=40M # # Set the log file size to about 25% of the buffer pool size innodb_log_file_size=250M innodb_log_buffer_size=8M # innodb_flush_log_at_trx_commit=1 [mysql.server] user=mysql basedir=/var/lib [mysqld_safe] log-error=/var/log/mysqld.log pid-file=/var/run/mysqld/mysqld.pid
After I run importDump.php or mwdumper, the articles are messed up. Templates are skewed and do not display properly. what can I do?
The Wikimedia Foundations websites use the tidy.lib program which corrects improperly formatted HTML files produced by the MediaWiki software. Many of these problems are not nessecarily the fault of the MediaWiki software package or its supporting logic. They are caused by users of Wikipedia inputting poorly designed templates and other errors in the content itself. Since it is almost impossible to correct these data elements in the XML dumps, a program which corrects the HTML output rendered by MediaWiki solves a lot of these problems. Make certain you have tidy.lib installed on your system, then add the following commands to the LocalSettings.php file for your MediaWiki installation:
$wgUseTidy = true; $wgTidyBin = '/usr/bin/tidy'; $wgTidyConf = $IP.'/extensions/tidy/tidy.conf';
Setup your tidy.conf file to match these settings:
<MediaWiki root>/extensions/tidy/tidy.conf show-body-only: yes force-output: yes tidy-mark: no wrap: 0 wrap-attributes: no literal-attributes: yes output-xhtml: yes numeric-entities: yes enclose-text: yes enclose-block-text: yes quiet: yes quote-nbsp: yes fix-backslash: no fix-uri: no
Enabling tidy.lib will clean up a lot of these errors. When enabled, MediaWiki will pipe the HTML output through tidy.lib and correct most of the formatting errors in the HTML output produced by the XMl Dumps.
Some useful links for using tidy lib are located here:
- html tidy (http://tidy.sf.net) configuration
- tidy - validate, correct, and pretty-print HTML files
- see: man 1 tidy, http://tidy.sourceforge.net/docs/quickref.html
My images are not rendering properly or do not show up at all in the article. SVG files don't render at all and after I run rebuildImages.php, my png files were misidentified as "text/plain" MIME types and now they appear incomplete and have what appear to be image errors. What can I do to fix it?
You must run rebuildImages.php if you have externally copied images into the /images directory under your MediaWiki installation. This is required in order for MediaWiki to place database entries for the images into the master database and resync them with articles. There are two methods of invoking this script:
Rebuild the entire image table for MediaWiki:
php maintenance\rebuildImages.php >& images.log &
Rebuild the missing images for the Image Table for MediaWiki:
php maintenance\rebuildImages.php --missing >& images.log &
You should run the program in both modes after copying images into the /images directory to sync up the database. MediaWiki 1.9.3 has some problems with identifying MIME types properly on some Linux distributions. If you run into these problems, add this line to your LocalSettings.php to direct MediaWiki to use an alternate method of MIME detection for your system:
$wgMimeDetectorCommand = "file -bi";
Please note that if the Unix file -ba command encounters corrupted images it will report "very short file" errors and cause rebuildImages.php to halt execution. Please check your images.log file if you encounter this error and delete the file in question then restart the rebuildImage.php script. The file -bi command does not have a silent mode and MediaWiki will halt on errors where it cannot determine the MIME type for a particular image file.
SVG rendering does not work with all of the SVG files referenced by the Wikipedia Dumps in MediaWiki 1.9.3 on several Linux distributions at present with rsvg. There is no known fix at present for SVG based images for these problems. SVG+XML image files appear to render correctly in most cases.
How do I get the system statistics to show up properly after I import the enwiki XML files with importDump.php or mwdumper? The import completed successfully, but now the site says it only has 330 articles. What can I do?
Run the initStats.php program after importDump.php completes. This will clear out the stats and update them with correct information:
php maintenance\initStats.php --update
There are several options to initStats.php depending on which MediaWiki version you are currently running. You can open the file initStats.php with a text editor and go to the bottom of the file for usage information.
I want to improve performance for clients reading from my wiki. When I setup the site and allow people on the Web to access articles, performance is very slow. What can I do to get better performance
- Install eaccelerator or another php compiler to compile the php scripts real time
- Enable the MediaWiki caching to cache generated HTML files on disk so clients accessing your site will not always execute the php MediaWiki code to create what would essentially be static articles.
MediaWiki will use a local directory on your server and create a caching system very similiar to that provided by the squid web proxy cache by caching the articles after they have been rendered into HTML and allow clients to read the HTML files directly rather than always invoking php code to recreate the article from wikitext (which is what the articles are written in). To enable MediaWiki caching of HTML, you need to create a directory and set permissions of the directory to 777 (rwxrwxrwx). You then point MediaWiki to this directory and enable the caching. You may also want to delay SQL updates of articles during editing and updates from remote users to increase system performance. A sample entry in the LocalSettings.php file which enables this capability would be:
$wgUseFileCache = true; $wgFileCacheDirectory = "$IP/cache"; <- point to directory to cache $wgAntiLockFlags = ALF_NO_LINK_LOCK | ALF_NO_BLOCK_LOCK; <- delay writes to improve parallelism
The Local caching capability of MediaWiki will not verify how much disk space you have installed. Make certain you have a lot of disk space if you enable this. Rendered HTML files for 1,400,000 English articles can be very large and consume a large amount of storage.
Common Rendering Problems Related to MediaWiki Extensions
I get these strange math tags all over the articles with text formatting strings where mathematical formulas should be after I run importDump.php or mwdumper. What have I done wrong?
MediaWiki uses the texvc program to render math expressions by dynamically creating png images for these scripts. Make certain you have downloaded the texvc packages or downloaded O-Caml to rebuild this program. The program is located in the /math directory off your main MediaWiki root directory. Also, make certain you have created a /tmp directory off the root and set the permissions to 777 (rwxrwxrwx) to allow the texvc program workspace to render the images from articles.
I get these strange tags like expr, ref, and cite when my articles are displayed and they are filled with unsightly looking text. How do I fix this?
You need to include a minimal set of parser extensions to parse citation and reference tags. Most of the templates in Wikipedia are a lot like actual code, and process instructions. Two of the parsers that are essential to render 99% of the Wikipedia articles are the Cite.php and the ParserFunctions.php extensions. Wikipedia heavily relies on these extensions. They can be downloaded from Meta or the MediaWiki site. You should place them into your extensions directory and add the following lines to the end of your MediaWiki LocalSettings.php file:
require_once( "$IP/extensions/ParserFunctions/ParserFunctions.php" ); require_once( "$IP/extensions/Cite.php" ); require_once( "$IP/extensions/ImageMap/ImageMap.php" );
There are at present two different ImageMap extensions, once for MediaWiki 1.5 and another for MediaWiki 1.9. This example refers to the MediaWiki 1.9 version, which is required with recent Wikipedia dumps. If you include the ImageMap.php extension, you may be required to enable the php-dom module. On Federa Core 5 and Fedora Core 6, use the YUM utility to update your system:
yum install php-dom <enter>
You may be required to upgrade both your mysql database and PHP modules to 5.1.6 in order to properly support php-dom and the ImageMap.php extension. If you have been using eaccelerator or another PHP compiler or accelerator, you will also need to update this module as well. On FC5, the command to update is:
yum install php-eaccelerator <enter>
The directory structure of your /extensions directory for MediaWiki should contain the files listed below for minimal rendering of a MediaWiki site with the enwiki dumps installed via importDump.php:
[root@gadugi chr]# find extensions/ extensions/ extensions/README extensions/Cite.php extensions/ParserFunctions extensions/ParserFunctions/Expr.php extensions/ParserFunctions/ParserFunctions.php extensions/Chr2Syl.php extensions/Cite.i18n.php extensions/Citation.php extensions/HTTPRedirect.php extensions/Purge.php extensions/Tidy.php extensions/tidy extensions/tidy/tidy.conf extensions/ImageMap/ImageMap_body.php extensions/ImageMap/ImageMap.i18n.php extensions/ImageMap/ImageMap.php [root@gadugi chr]#
InterWiki Language Issues with Importing Dumps
I have imported the XML enwiki XML dump, and I have enabled all of the extensions, but I keep getting red link tags to interwiki links back into Wikipedia and they do not display correctly in the panel on the left side of my display. How do I get the interwiki links to format properly and work?
You have to update the mysql database with wikipedia-interwiki.sql file. The steps to do this are very simple. You must have already installed MediaWiki successfully in order to apply the sql script. Invoke mysql and invoke the "use" command to select the database. In the example below, the database is called "endb". You should replace this name with whatever you have called your database file, along with the absolute path to the /maintenance directory wherever you installed the MediaWiki software:
[root@gadugi chr]# [root@gadugi chr]# [root@gadugi chr]# mysql -p Enter password: Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 48635 to server version: 5.0.18 Type 'help;' or '\h' for help. Type '\c' to clear the buffer. mysql> use endb Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed mysql> mysql> mysql> mysql> \. /wikidump/en/maintenance/wikipedia-interwiki.sql Query OK, 215 rows affected (0.04 sec) Records: 215 Duplicates: 0 Warnings: 0 mysql> mysql>
This will enable the interwiki links contained in the dumps. At this point, your installed enwiki dumps should closely match the Wikipedia main site provided you have followed all steps.
I ran the wikipedia-interwiki.sql command that came with MediaWiki 1.9.3 and most of the interwiki language links are now rendering properly, but I am still getting a small number of unresolved interwiki links at the bottom of some of the articles. How do I fix this?
Not all of the interwiki link definitions are contained in the default wikipedia-interwiki.sql file that is provided with MediaWiki 1.9.3. New Wikipedia Language sites are added continuously to the project, and you will need to obtain a more recent version of wikipedia-interwiki.sql if you encounter unresolved interwiki language codes. As of the enwiki-20070206.xml dumps, the following new tags have been added. Cut and paste this wikipedia-interwiki.sql file and use it to update your site if you are still seeing unresolved interwiki links.
file wikipedia-interwiki.sql
-- For convenience, here are the *in-project* interwiki prefixes -- for Wikipedia. REPLACE INTO /*$wgDBprefix*/interwiki (iw_prefix,iw_url,iw_local) VALUES ('aa','http://aa.wikipedia.org/wiki/1ドル',1), ('ab','http://ab.wikipedia.org/wiki/1ドル',1), ('af','http://af.wikipedia.org/wiki/1ドル',1), ('ak','http://ak.wikipedia.org/wiki/1ドル',1), ('als','http://als.wikipedia.org/wiki/1ドル',1), ('am','http://am.wikipedia.org/wiki/1ドル',1), ('ang','http://ang.wikipedia.org/wiki/1ドル',1), ('an','http://an.wikipedia.org/wiki/1ドル',1), ('arc','http://arc.wikipedia.org/wiki/1ドル',1), ('ar','http://ar.wikipedia.org/wiki/1ドル',1), ('as','http://as.wikipedia.org/wiki/1ドル',1), ('ast','http://ast.wikipedia.org/wiki/1ドル',1), ('av','http://av.wikipedia.org/wiki/1ドル',1), ('ay','http://ay.wikipedia.org/wiki/1ドル',1), ('az','http://az.wikipedia.org/wiki/1ドル',1), ('ba','http://ba.wikipedia.org/wiki/1ドル',1), ('bar','http://bar.wikipedia.org/wiki/1ドル',1), ('bat-smg','http://bat-smg.wikipedia.org/wiki/1ドル',1), ('be','http://be.wikipedia.org/wiki/1ドル',1), ('bg','http://bg.wikipedia.org/wiki/1ドル',1), ('bh','http://bh.wikipedia.org/wiki/1ドル',1), ('b','http://en.wikibooks.org/wiki/1ドル',1), ('bi','http://bi.wikipedia.org/wiki/1ドル',1), ('bm','http://bm.wikipedia.org/wiki/1ドル',1), ('bn','http://bn.wikipedia.org/wiki/1ドル',1), ('bo','http://bo.wikipedia.org/wiki/1ドル',1), ('br','http://br.wikipedia.org/wiki/1ドル',1), ('bs','http://bs.wikipedia.org/wiki/1ドル',1), ('ca','http://ca.wikipedia.org/wiki/1ドル',1), ('ce','http://ce.wikipedia.org/wiki/1ドル',1), ('ch','http://ch.wikipedia.org/wiki/1ドル',1), ('cho','http://cho.wikipedia.org/wiki/1ドル',1), ('chr','http://chr.wikipedia.org/wiki/1ドル',1), ('chy','http://chy.wikipedia.org/wiki/1ドル',1), ('co','http://co.wikipedia.org/wiki/1ドル',1), ('cr','http://cr.wikipedia.org/wiki/1ドル',1), ('csb','http://csb.wikipedia.org/wiki/1ドル',1), ('cs','http://cs.wikipedia.org/wiki/1ドル',1), ('cu','http://cu.wikipedia.org/wiki/1ドル',1), ('cv','http://cv.wikipedia.org/wiki/1ドル',1), ('cy','http://cy.wikipedia.org/wiki/1ドル',1), ('da','http://da.wikipedia.org/wiki/1ドル',1), ('de','http://de.wikipedia.org/wiki/1ドル',1), ('dk','http://da.wikipedia.org/wiki/1ドル',1), ('dv','http://dv.wikipedia.org/wiki/1ドル',1), ('dz','http://dz.wikipedia.org/wiki/1ドル',1), ('ee','http://ee.wikipedia.org/wiki/1ドル',1), ('el','http://el.wikipedia.org/wiki/1ドル',1), ('en','http://en.wikipedia.org/wiki/1ドル',1), ('eo','http://eo.wikipedia.org/wiki/1ドル',1), ('es','http://es.wikipedia.org/wiki/1ドル',1), ('et','http://et.wikipedia.org/wiki/1ドル',1), ('eu','http://eu.wikipedia.org/wiki/1ドル',1), ('fa','http://fa.wikipedia.org/wiki/1ドル',1), ('ff','http://ff.wikipedia.org/wiki/1ドル',1), ('fi','http://fi.wikipedia.org/wiki/1ドル',1), ('fj','http://fj.wikipedia.org/wiki/1ドル',1), ('fo','http://fo.wikipedia.org/wiki/1ドル',1), ('fr','http://fr.wikipedia.org/wiki/1ドル',1), ('frp','http://frp.wikipedia.org/wiki/1ドル',1), ('fur','http://fur.wikipedia.org/wiki/1ドル',1), ('fy','http://fy.wikipedia.org/wiki/1ドル',1), ('ga','http://ga.wikipedia.org/wiki/1ドル',1), ('gd','http://gd.wikipedia.org/wiki/1ドル',1), ('gl','http://gl.wikipedia.org/wiki/1ドル',1), ('gn','http://gn.wikipedia.org/wiki/1ドル',1), ('got','http://got.wikipedia.org/wiki/1ドル',1), ('gu','http://gu.wikipedia.org/wiki/1ドル',1), ('gv','http://gv.wikipedia.org/wiki/1ドル',1), ('ha','http://ha.wikipedia.org/wiki/1ドル',1), ('haw','http://haw.wikipedia.org/wiki/1ドル',1), ('he','http://he.wikipedia.org/wiki/1ドル',1), ('hi','http://hi.wikipedia.org/wiki/1ドル',1), ('ho','http://ho.wikipedia.org/wiki/1ドル',1), ('hr','http://hr.wikipedia.org/wiki/1ドル',1), ('ht','http://ht.wikipedia.org/wiki/1ドル',1), ('hu','http://hu.wikipedia.org/wiki/1ドル',1), ('hy','http://hy.wikipedia.org/wiki/1ドル',1), ('hz','http://hz.wikipedia.org/wiki/1ドル',1), ('ia','http://ia.wikipedia.org/wiki/1ドル',1), ('id','http://id.wikipedia.org/wiki/1ドル',1), ('ie','http://ie.wikipedia.org/wiki/1ドル',1), ('ig','http://ig.wikipedia.org/wiki/1ドル',1), ('ii','http://ii.wikipedia.org/wiki/1ドル',1), ('ik','http://ik.wikipedia.org/wiki/1ドル',1), ('ilo','http://ilo.wikipedia.org/wiki/1ドル',1), ('io','http://io.wikipedia.org/wiki/1ドル',1), ('is','http://is.wikipedia.org/wiki/1ドル',1), ('it','http://it.wikipedia.org/wiki/1ドル',1), ('iu','http://iu.wikipedia.org/wiki/1ドル',1), ('ja','http://ja.wikipedia.org/wiki/1ドル',1), ('jbo','http://jbo.wikipedia.org/wiki/1ドル',1), ('jv','http://jv.wikipedia.org/wiki/1ドル',1), ('ka','http://ka.wikipedia.org/wiki/1ドル',1), ('kg','http://kg.wikipedia.org/wiki/1ドル',1), ('ki','http://ki.wikipedia.org/wiki/1ドル',1), ('kj','http://kj.wikipedia.org/wiki/1ドル',1), ('kk','http://kk.wikipedia.org/wiki/1ドル',1), ('kl','http://kl.wikipedia.org/wiki/1ドル',1), ('km','http://km.wikipedia.org/wiki/1ドル',1), ('kn','http://kn.wikipedia.org/wiki/1ドル',1), ('ko','http://ko.wikipedia.org/wiki/1ドル',1), ('kr','http://kr.wikipedia.org/wiki/1ドル',1), ('ks','http://ks.wikipedia.org/wiki/1ドル',1), ('ku','http://ku.wikipedia.org/wiki/1ドル',1), ('kv','http://kv.wikipedia.org/wiki/1ドル',1), ('kw','http://kw.wikipedia.org/wiki/1ドル',1), ('ky','http://ky.wikipedia.org/wiki/1ドル',1), ('lad','http://lad.wikipedia.org/wiki/1ドル',1), ('la','http://la.wikipedia.org/wiki/1ドル',1), ('lb','http://lb.wikipedia.org/wiki/1ドル',1), ('lg','http://lg.wikipedia.org/wiki/1ドル',1), ('li','http://li.wikipedia.org/wiki/1ドル',1), ('lmo','http://lmo.wikipedia.org/wiki/1ドル',1), ('ln','http://ln.wikipedia.org/wiki/1ドル',1), ('lo','http://lo.wikipedia.org/wiki/1ドル',1), ('lt','http://lt.wikipedia.org/wiki/1ドル',1), ('lv','http://lv.wikipedia.org/wiki/1ドル',1), ('meta','http://meta.wikimedia.org/wiki/1ドル',1), ('mg','http://mg.wikipedia.org/wiki/1ドル',1), ('mh','http://mh.wikipedia.org/wiki/1ドル',1), ('m','http://meta.wikimedia.org/wiki/1ドル',1), ('mi','http://mi.wikipedia.org/wiki/1ドル',1), ('minnan','http://zh-min-nan.wikipedia.org/wiki/1ドル',1), ('mk','http://mk.wikipedia.org/wiki/1ドル',1), ('ml','http://ml.wikipedia.org/wiki/1ドル',1), ('mn','http://mn.wikipedia.org/wiki/1ドル',1), ('mo','http://mo.wikipedia.org/wiki/1ドル',1), ('mr','http://mr.wikipedia.org/wiki/1ドル',1), ('ms','http://ms.wikipedia.org/wiki/1ドル',1), ('mt','http://mt.wikipedia.org/wiki/1ドル',1), ('mus','http://mus.wikipedia.org/wiki/1ドル',1), ('my','http://my.wikipedia.org/wiki/1ドル',1), ('nah','http://nah.wikipedia.org/wiki/1ドル',1), ('na','http://na.wikipedia.org/wiki/1ドル',1), ('nap','http://nap.wikipedia.org/wiki/1ドル',1), ('nb','http://nb.wikipedia.org/wiki/1ドル',1), ('nds','http://nds.wikipedia.org/wiki/1ドル',1), ('nds-nl','http://nds-nl.wikipedia.org/wiki/1ドル',1), ('ne','http://ne.wikipedia.org/wiki/1ドル',1), ('ng','http://ng.wikipedia.org/wiki/1ドル',1), ('n','http://en.wikinews.org/wiki/1ドル',1), ('nl','http://nl.wikipedia.org/wiki/1ドル',1), ('nn','http://nn.wikipedia.org/wiki/1ドル',1), ('no','http://no.wikipedia.org/wiki/1ドル',1), ('nrm','http://nrm.wikipedia.org/wiki/1ドル',1), ('nv','http://nv.wikipedia.org/wiki/1ドル',1), ('ny','http://ny.wikipedia.org/wiki/1ドル',1), ('oc','http://oc.wikipedia.org/wiki/1ドル',1), ('om','http://om.wikipedia.org/wiki/1ドル',1), ('or','http://or.wikipedia.org/wiki/1ドル',1), ('os','http://os.wikipedia.org/wiki/1ドル',1), ('pa','http://pa.wikipedia.org/wiki/1ドル',1), ('pam','http://pam.wikipedia.org/wiki/1ドル',1), ('pdc','http://pdc.wikipedia.org/wiki/1ドル',1), ('pi','http://pi.wikipedia.org/wiki/1ドル',1), ('pl','http://pl.wikipedia.org/wiki/1ドル',1), ('ps','http://ps.wikipedia.org/wiki/1ドル',1), ('pt','http://pt.wikipedia.org/wiki/1ドル',1), ('q','http://en.wikiquote.org/wiki/1ドル',1), ('qu','http://qu.wikipedia.org/wiki/1ドル',1), ('rm','http://rm.wikipedia.org/wiki/1ドル',1), ('rmy','http://rmy.wikipedia.org/wiki/1ドル',1), ('rn','http://rn.wikipedia.org/wiki/1ドル',1), ('roa-rup','http://roa-rup.wikipedia.org/wiki/1ドル',1), ('ro','http://ro.wikipedia.org/wiki/1ドル',1), ('ru','http://ru.wikipedia.org/wiki/1ドル',1), ('rw','http://rw.wikipedia.org/wiki/1ドル',1), ('sa','http://sa.wikipedia.org/wiki/1ドル',1), ('sc','http://sc.wikipedia.org/wiki/1ドル',1), ('scn','http://scn.wikipedia.org/wiki/1ドル',1), ('sco','http://sco.wikipedia.org/wiki/1ドル',1), ('sd','http://sd.wikipedia.org/wiki/1ドル',1), ('se','http://se.wikipedia.org/wiki/1ドル',1), ('sep11','http://sep11.wikipedia.org/wiki/1ドル',1), ('sg','http://sg.wikipedia.org/wiki/1ドル',1), ('sh','http://sh.wikipedia.org/wiki/1ドル',1), ('si','http://si.wikipedia.org/wiki/1ドル',1), ('simple','http://simple.wikipedia.org/wiki/1ドル',1), ('sk','http://sk.wikipedia.org/wiki/1ドル',1), ('sl','http://sl.wikipedia.org/wiki/1ドル',1), ('sm','http://sm.wikipedia.org/wiki/1ドル',1), ('sn','http://sn.wikipedia.org/wiki/1ドル',1), ('so','http://so.wikipedia.org/wiki/1ドル',1), ('sq','http://sq.wikipedia.org/wiki/1ドル',1), ('sr','http://sr.wikipedia.org/wiki/1ドル',1), ('ss','http://ss.wikipedia.org/wiki/1ドル',1), ('st','http://st.wikipedia.org/wiki/1ドル',1), ('su','http://su.wikipedia.org/wiki/1ドル',1), ('sv','http://sv.wikipedia.org/wiki/1ドル',1), ('sw','http://sw.wikipedia.org/wiki/1ドル',1), ('ta','http://ta.wikipedia.org/wiki/1ドル',1), ('te','http://te.wikipedia.org/wiki/1ドル',1), ('tg','http://tg.wikipedia.org/wiki/1ドル',1), ('th','http://th.wikipedia.org/wiki/1ドル',1), ('ti','http://ti.wikipedia.org/wiki/1ドル',1), ('tk','http://tk.wikipedia.org/wiki/1ドル',1), ('tlh','http://tlh.wikipedia.org/wiki/1ドル',1), ('tl','http://tl.wikipedia.org/wiki/1ドル',1), ('tn','http://tn.wikipedia.org/wiki/1ドル',1), ('to','http://to.wikipedia.org/wiki/1ドル',1), ('tokipona','http://tokipona.wikipedia.org/wiki/1ドル',1), ('tpi','http://tpi.wikipedia.org/wiki/1ドル',1), ('tr','http://tr.wikipedia.org/wiki/1ドル',1), ('ts','http://ts.wikipedia.org/wiki/1ドル',1), ('tt','http://tt.wikipedia.org/wiki/1ドル',1), ('tum','http://tum.wikipedia.org/wiki/1ドル',1), ('tw','http://tw.wikipedia.org/wiki/1ドル',1), ('ty','http://ty.wikipedia.org/wiki/1ドル',1), ('ug','http://ug.wikipedia.org/wiki/1ドル',1), ('uk','http://uk.wikipedia.org/wiki/1ドル',1), ('ur','http://ur.wikipedia.org/wiki/1ドル',1), ('uz','http://uz.wikipedia.org/wiki/1ドル',1), ('ve','http://ve.wikipedia.org/wiki/1ドル',1), ('vi','http://vi.wikipedia.org/wiki/1ドル',1), ('vo','http://vo.wikipedia.org/wiki/1ドル',1), ('wa','http://wa.wikipedia.org/wiki/1ドル',1), ('w','http://en.wikipedia.org/wiki/1ドル',1), ('wo','http://wo.wikipedia.org/wiki/1ドル',1), ('xh','http://xh.wikipedia.org/wiki/1ドル',1), ('yi','http://yi.wikipedia.org/wiki/1ドル',1), ('yo','http://yo.wikipedia.org/wiki/1ドル',1), ('za','http://za.wikipedia.org/wiki/1ドル',1), ('zh-cfr','http://zh-min-nan.wikipedia.org/wiki/1ドル',1), ('zh-classical','http://pam.wikipedia.org/wiki/1ドル',1), ('zh-cn','http://zh.wikipedia.org/wiki/1ドル',1), ('zh','http://zh.wikipedia.org/wiki/1ドル',1), ('zh-min-nan','http://zh-min-nan.wikipedia.org/wiki/1ドル',1), ('zh-tw','http://zh.wikipedia.org/wiki/1ドル',1), ('zh-yue','http://zh-yue.wikipedia.org/wiki/1ドル',1), ('zu','http://zu.wikipedia.org/wiki/1ドル',1);
I imported the dumps from the English Wikipedia (the enwiki dumps) into my Spanish Language MediaWiki installation so I can use the articles as a basis to translate English Articles into Spanish, but now Recent Changes and several other Menu Items from the Navigation Bar point to non-existent pages. What did I do wrong?
By Default, most of the enwiki dumps from the English Wikipedia (this same statement also applies to just about any language specific XML dump provided by the foundation) contains MediaWiki namespace entries which define such things as the navigation toolbar, site messages, and other MediaWiki specific configuration settings. These entries are included in the XML dumps by language to allow someone to import the XML dump for a target language and in essence totally replicate the Wikipedia site provided the XML dump is imported properly to a target MediaWiki installation.
Many of the MediaWiki namespace entries are very specific to the precise language used on a particular Wikipedia project, and can cause problems if the XML dump is imported into a MediaWiki installation configured to a target language different from the source language from which the XML dump originated.
It is possible to import XML dumps compiled from one particular language into a MediaWiki site configured for an entirely different language, however, it is recommended you use mwdumper to convert the XML file into another target XML dump and strip out the MediaWiki specific namespace entries from the target XML file. This can be accomplished by setting mwdumper to use the following syntax to remove the MediaWiki namespace records from the target file. In the example below, enwiki-<date>.xml is the source XML file, and enwiki-no_mediawiki.xml is the output XML dump with the MediaWiki namespace entries removed:
java -jar mwdumper.jar --output=file:enwiki-no_mediawiki.xml --format=xml --filter=namespace:\!NS_MEDIAWIKI \ enwiki-<date>.xml
Downloading Images from Wikipedia and Wikimedia Commons
Where can I download all of Wikipedia's images from?
Downloading the entire image base on Wikipedia is not a good idea from the main site. Wikipedia is one of the busiest sites on the web in terms of traffic. Mass image downloads would not be encouraged from the site due to bandwidth concerns. There can also be serious copyright issues and implications with doing so. The Wikimedia Foundation has permission to use certain images, and many of the fair use images are borderline in terms of whether they can be used or not off Wikipedia. As of February of 2007, the entire collection of images produce a compressed tar.gz file of over 213 GB (gigabytes).
Several organizations which support the Wikimedia Foundations Wikipedia projects provide bittorrent downloading of the enwiki-allimages-<date>.tar file, however, anyone using these images would assume full liablity for misuse of of the images. The Wikipedia Community vigorously police the site and remove infringing images daily, however, it is always possible that some images may escape this extraordinary level of vigilence and end up on the site for a short time.
Wikipedia BitTorrent Image TAR's (Created and Hosted with BitTorrent)
Tracker URL
- http://www.wikigadugi.org:6969 Tracking Information
- http://www.wikigadugi.org:6969/announce Announce
mwdumper
mwdumper is a standalone program for filtering and converting compressed XML dumps. It can produce output as another XML dump as well as SQL statements for inserting data directly into a database in MediaWiki's 1.4 or 1.5 schema.
Future versions of mwdumper will include support for creating a database and configuring a MediaWiki installation directly, but currently it just produces raw SQL which can be piped to MySQL.
The program is written in Java and has been tested with Sun's 1.5 JRE and GNU's GCJ 4. Source is in our CVS; a precompiled .jar is available at http://download.wikimedia.org/tools/
Be sure to review the README.txt file which is also provided. It explains the invocation options required. Friendly wiki-version of README with few additional hints is available at http://www.mediawiki.org/wiki/MWDumper
NOTE: mwdumper is an unsuitable choice for an already installed MediaWiki with an active database on any of the Redhat or Fedora Core Linux Distributions. Running mwdumper against an installed MediaWiki site will result in the program inaccurately reporting it is inputting pages when in fact the mysql database is rejecting the records. Typically, the cause of the mysql commands being rejected seems related to the set-variable max_allowed_packet= setting in the mysql /etc/my.conf file. In order to use mwdumper, you must follow the attached steps if you are running on any Linux distribution with mysql 5.X or above. Make certain you have first downloaded the php-mysql extensions with mysql before installing MediaWiki. On Fedora Core 5 and later platforms, use the YUM utility to update your system.
mwdumper installation notes
Steps to get mwdumper to work with Ubuntu, Debian and FreeBSD 6.2
- 1. Drop relevant tables via mysql shell (page, revision, text): i.e. mysql> truncate table page;. This will drop and recreate the table, and is MUCH faster than deleting the data from the tables themselves (delete * from table; for example). There is no need to drop the entire database.
- 2. Install MediaWiki as you would normally
- 3. Start mwdumper to import the data, i.e.: java -jar mwdumper.jar --format=sql:1.5 <xml dump file name> | mysql -u root -p <dbname>
- 4. Get some coffee, it could take awhile, depending on the size of your imported data.
Another approach which also works Ubuntu, Debian and FreeBSD 6.2
- 1. Drop relevant tables as in step 1 in the previous example
- 2. Install MediaWiki as you would normally
- 3. Start mwdumper to redirect the data to a flat SQL file, i.e.: java -jar mwdumper.jar --format=sql:1.5 <xml dump file name> > output_filename.sql
- 4. When that completes, pull it in with MySQL directly, i.e.: mysql <dbname> < output_filename.sql
- 5. Fill a second cup of coffee, this will take longer, and will not give you any progress output like it would if you were using mwdumper directly
You can check the progress of the import as follows, from within the mysql shell: mysql> select count(*) from page;
Some Java optimizations that might help
There are some Java switches which you can use that may help with your import speed and performance, if you're using mwdumper (google them for details on what they do):
java -Xmx512m -Xms128m -XX:NewSize=32m -XX:MaxNewSize=64m -XX:SurvivorRatio=6 -XX:+UseParallelGC -XX:GCTimeRatio=9 -XX:AdaptiveSizeDecrementScaleFactor=1 -server -jar mwdumper.jar [...]
Steps to get mwdumper to work on Redhat and Fedora Core Linux Distros with mysql 5.x
- 1. Destroy any existing database: i.e. mysqladmin drop <dbname> -p (enter root db password)
- 2. Recreate the database: i.e. mysqladmin create <dbname> -p (enter root db password)
- 3. DO NOT install MediaWiki until the database in uploaded. If MediaWiki is installed, mwdumper may not work.
- 3. Run the tables.sql function supplied with MediaWiki from the MediaWiki root directory: i.e. mysql -u root -p <dbname> < maintenance/tables.sql
- 4. Start mwdumper: i.e. java -jar mwdumper.jar --format=sql:1.5 <xml dump file name> | mysql -u root -p <dbname>
- 5. You will have to manually enter admin status for WikiSysop accounts and run the upgrade.php script on the MediaWiki database in order to obtain WikiSysop access or install over the previous MediaWiki installation and import the databases in order to activiate the WikiSysop account. Add the following MediaWiki PHP file to your /maintenance directory by cutting and pasting the text, and name the file createBcrat.php, then recreate the WikiSysop account by executing the example scripts provided. The attached PHP file is for the MediaWiki 1.9.3 release.
file createBcrat.php
<?php /** * Maintenance script to create an account and grant it administrator and bureaucrat group membership * * @package MediaWiki * @subpackage Maintenance * @author Rob Church <robchurch@gmail.com> * @author Jeff Merkey <jmerkey@wolfmountaingroup.com> */ require_once( 'commandLine.inc' ); if( !count( $args ) == 2 ) { echo( "Please provide a username and password for the new account.\n" ); die( 1 ); } $username = $args[0]; $password = $args[1]; echo( wfWikiID() . ": Creating and promoting User:{$username}..." ); # Validate username and check it doesn't exist $user = User::newFromName( $username ); if( !is_object( $user ) ) { echo( "invalid username.\n" ); die( 1 ); } elseif( 0 != $user->idForName() ) { echo( "account exists.\n" ); $user->addGroup( 'sysop' ); $user->addGroup( 'bureaucrat' ); $ssu->doUpdate(); echo( "done.\n" ); die( 1 ); } # Insert the account into the database $user->addToDatabase(); $user->setPassword( $password ); $user->setToken(); # Promote user $user->addGroup( 'sysop' ); $user->addGroup( 'bureaucrat' ); # Increment site_stats.ss_users $ssu = new SiteStatsUpdate( 0, 0, 0, 0, 1 ); $ssu->doUpdate(); echo( "done.\n" ); ?>
Steps to recreate WikiSysop and add the account to groups "sysop" and "bureaucrat"
from your MediaWiki root directory, enter the following commands:
php maintenance/createBcrat WikiSysop <password> php maintenance/changePassword --user=WikiSysop --password=<password>
You may have to enter "password" twice in order for the account to work properly, which is why there is a call to "changePassword" after the account has been recreated and assigned sysop and bureacrat status.
mwdumper is not the correct tool if you want to maintain an existing wiki as it may not always work correctly if the MediaWiki databases have already been installed on the Fedora Core releases and may not provide useful output as to any errors occurring. Most of these problems are related to record and insertion rejection of SQL requests by the underlying MySQL database version you may be running. You may wish to test mwdumper with your particular OS distribution with a trial run to see if you encounter any of these problems. There are several fixes for some of these issues.
Known Problems
- mwdumper will fail with: ERROR 1153 (08S01) at line 2187: Got a packet bigger than 'max_allowed_packet' bytes if your dumps contain large sections of unicode characters with Cherokee Unicode and other unicode texts and in most cases does not work at all with these dumps, even with the defaults in mysql set to utf8. One solution to this problem is to increase the max packet (request) size which can be input into mysql via INSERT commands. Try changing set-variable = max_allowed_packet=20M in your /etc/my.conf file and restart the mysqld program.
- mwdumper does not report errors when uploading to a system with a database that is not freshly created.
- mwdumper may not always complete the dump, even though it is reporting that it is and even if you have followed all the procedures listed here. Due to the lack of proper error handling in the program, it may be better to just run importDump.php if you encounter problems with this tool.
- if you run into problems using mwdumper to input directly into mysql on a particular Linux Distribution or version of the mysql database, consider setting up mwdumper to convert the XML dumps into an intermediate .sql file then import the output file directly into mysql rather then allowing mwdumper to do so.
- try passing the '-f' switch (force) with mysql to force record insertions into your MySQL database if mysql starts rejecting updates from the mwdumper program or reports duplicate key errors.
mwimport
mwimport is a faster alternative to mwdumper (written in Perl), but may not work in all cases.
- tested with enwiki-20070206 XML dumps. Averages 890.00 pages/second during XML to SQL processing.
NOTE: You may be required on some MediaWiki installations to use the -f switch with mysql when using mwimport. mwimport works very well with most XML dumps provided by the Foundation. If you get an error from mysql complaining about duplicate index keys (key 1 is typically the one encountered), then use the following syntax with mwimport to get around the problem:
cat enwiki-<date>.xml | mwimport | mysql -f -u <admin name> -p <database name>
The -f (force) flag will not overwrite the duplicate record in the database, but will allow the remaining articles to be imported without interurpting the import process.
bzip2
For the .bz2 files, use bzip2 to decompress. bzip2 comes standard with most Linux/Unix/Mac OS X systems these days. For Windows you may need to obtain it separately from the link below.
mwdumper can read the .bz2 files directly, but importDump.php requires piping like so:
bzip2 -dc pages_current.xml.bz2 | php importDump.php
7-Zip
For the .7z files, you can use 7-Zip or p7zip to decompress. These are available as free software:
- Windows: http://www.7-zip.org/
- Unix/Linux/Mac OS X: http://p7zip.sourceforge.net/
Something like:
7za e -so pages_current.xml.7z | php importDump.php
will expand the current pages and pipe them to the importDump.php PHP script.
Perl importing script
This is a script Tbsmith made to import only pages in certain categories. It works for Mediawiki 1.5.
Producing your own dumps
MediaWiki 1.5 and above includes a command-line maintenance script dumpBackup.php which can be used to produce XML dumps directly, with or without page history. mwdumper can be used to make filtered dumps (like pages_articles.xml); this is also built into dumpBackup.php in latest CVS.
The program which manages our multi-database dump process is available in our source repository, but likely would require customization for use outside Wikimedia's cluster setup.
Where to go for help
If you have trouble with the dump files, you can:
- Ask in #mediawiki on irc.freenode.net - Although help is not always available at all times
- Ask on wikitech-l on http://mail.wikimedia.org/
Alternatively, if you have a specific bug to report:
- File a bug at http://bugzilla.wikimedia.org/
For French speaking people, see also fr:Wikipédia:Requêtes XML
What about bittorrent?
bittorrent is not currently used to distribute Wikimedia dumps... at least not officially. Of course some torrents of dumps exist. If you have started torrenting dumps, leave a note here.
- Torrentspy search -- currently showing one wikipedia and one wikipedia-fr... 00:00, 2 June 2006 (UTC)
- TorrentPortal Search -- currently showing 11 wikipedia related dumps. 00:00 3 November 2006
bittorrent is used to distribute the image archives from Wikipedia and Wikimedia Commons. Torrent access is available from the links listed below.
Wikipedia BitTorrent Image TAR's (Created and Hosted with BitTorrent)
Tracker URL
- http://www.wikigadugi.org:6969 Tracking Information
- http://www.wikigadugi.org:6969/announce Announce
- ed2k://|file|enwiki-20060925-pages-articles.xml.bz2|1827812154|22043A77B602F65CC01A33D6CE13496D|/