Talk:Data dumps/Archive 2

Archives This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

PAR files

I just downloaded the 5G enwiki-20070802-pages-meta-current.xml.bz2 but the md5 check (plus extraction) failed, meaning my download manager isn't working as expected with huge files. Now I'm stuck with an almost perfectly good file with probably a few bytes (up to a few Kb/consecutive clusters) corrupted. Is it possible to start adding some par2 files in the future (1% redundancy or something) so we don't have to download the huge files again. For recovering corrupted downloads under Windows I recommend using QuickPar.

Incorrect sizes

Latest comment: 18 years ago 2 comments2 people in discussion

The dumps pages [1] say that the full history enwiki dump is very very small. They dropped from 5.1 gigs to less than 100 bytes, and have climbed up to a few dozen megabytes. What's wrong? Is it safe to try and download these for use? -- 71.231.198.197 14:18, 6 October 2006 (UTC)

The last complete dump is from 16 August 2006; all subsequent ones (including last week's) have failed.

James F. (talk) 15:26, 7 October 2006 (UTC)

Has anyone sucessfully extracted a full edit history dump (b7) from any wiki recently? If so what platform/app? 213.48.182.7

Strange characters in XML dump

Latest comment: 18 years ago 1 comment1 person in discussion

Hi, This could be a beginner question, so please bear with me. I got the XML dump for the english wiki, but when I looked at the file, there are strange characters like "dÃ©partement", when I imported the content into my site, it still shows up in the article, you can see it here. When I use the Export Special pages on en.wikipedia.org/Special:Export and export a page, it appears OK and I do not see the strange characters. Why are the characters shown up like that in the dump and how can I get rid of them?

Thanks for your help in advance. Feel free to email me @ Admin At Wiki Quick Facts Dot Com.

I haven't worked with the full-text dumps, so this is just inference on my part, but I think what you're seeing here is a character encoding issue. Wikipedia uses UTF-8 (a Unicode mapping), so you'd want to make sure your web server (and database, and all the bits and bobs in between) are set up to use that too. ("é", your problem in the case you mention, is encoded by latin-1, but you'll still have problems if you use that setting.) Alai 00:26, 30 October 2006 (UTC)

old images

Latest comment: 18 years ago 1 comment1 person in discussion

Hello, why are the image-dumps from last year? I think it is an important motivation factor.

Would it be possible to get the small thumbnails as an extra file? It should be possible to seperate the fair-use images from the dump. Kolossos 19:12, 29 October 2006 (UTC)

7Zip

Latest comment: 17 years ago 7 comments3 people in discussion

Has anyone suceeded in unzipping a 7Z on a windows machine? I get the following:

>7za e -so angwiki-20060803-pages-meta-history.xml.7z>angwiki-20060
803-pages-meta-history1.xml
7-Zip (A) 4.20 Copyright (c) 1999-2005 Igor Pavlov 2005年05月30日
Processing archive: angwiki-20060803-pages-meta-history.xml.7z
Error: angwiki-20060803-pages-meta-history.xml.7z is not supported archive>

I've yet to un7zip any file from the data dumps. Bz7 seems not to be a problem. Rich Farmbrough 11:27 4 November 2006 (UTC).

Does anyone have any insight to offer on why we're producing (or trying to produce) both .7z and .bz complete history dumps in the first place? Isn't this basically just compounding the difficulty and delay in creating all the dumps? Alai 03:28, 7 November 2006 (UTC)

I suspect that .bz does not slow down the dump process, and saves areasonable amount of disc. Then transferring to .7z makes downloading feasable. 194.72.246.245

I heard from a developer that bz slows down the whole process big time. A special distributed version of bzip was hacked by Brion Vibber to get it working at all. 7z is much more efficient in speed and file size. bz is still done because not all platforms support 7z (Wintel, Linux, Mac does) Erik Zachte 12:36, 14 November 2006 (UTC)

7z dumps don't seem to have been any faster (though perhaps the server load is less?). But clearly, the slowest possibility of all is to do both, as at present. Is the gist that the longer-term masterplan is to do 7z, only? If someone were to volunteer to host bz offsite, could that be skipped, making the whole process about twice as fast at a stroke? Perhaps someone might be able to do this on the toolserver? Alai 13:27, 14 November 2006 (UTC)

I notice that the .7z dump of de: seems to be seriously stalled (12 days on, while the .bz finished in about six), and one wonders if the en: .7z isn't going the same way. I don't notice much in the way of dev or siteop input at this page: perhaps we should move this discussion to the wikitech list. Alai 13:31, 24 November 2006 (UTC)

I mailed Brion a week ago about it. en: is still running. The de: job has aborted and the file is gone. Erik Zachte 19:54, 24 November 2006 (UTC)

OK, thanks: hopefully we might yet see another dump cycle in 2006! Alai 14:58, 25 November 2006 (UTC)

Has anybody tried using 7zip PPMd compression? I find it gives better results with text. 201.213.37.39 22:07, 25 August 2007 (UTC)

Yes, bz2 to 7zip PPMd recompression on the enwiki-XXXXX-pages-meta-current files results in significant size reduction (5G to 3.5G), so I've been using this trick to burn the oversized 5G pages-meta-current images to DVD for months. I have no idea why PPMd isn't the default method and they prefer bz2 over "the text/XML master" PPMd (7z portability shouldn't be an issue for the presumably smart people who care to download these images).

Working?

Latest comment: 18 years ago 2 comments2 people in discussion

It's been a couple of months since a full dump has worked. This page [2] says that the *.7z file of "all pages with complete history" is 36.2 megabytes long. How can that possibly be? -- Mikeblas 14:41, 28 October 2006 (UTC)

Not being an insider at all, I would bet that it compressed whatever garbage got stuffed into the bzip version that failed. -- ForteTuba 20:17, 10 November 2006 (UTC)

List of historical dumps by language?

Latest comment: 18 years ago 1 comment1 person in discussion

Any chance that someone could modify what's going on at the download site to produce, for each language, a list of links that jump straight to the date on which that language was dumped? Something like

etc...? Walking back in time one dump at a time is fun, until one of the links is broken, and I am interested in getting a relatively early, smaller full dump for research fun. -- ForteTuba 20:21, 10 November 2006 (UTC)

-> http://download.wikimedia.org/enwiki/

Dump do-over?

Latest comment: 17 years ago 2 comments1 person in discussion

I notice the dumps have started again (yay), but the en: dump seems to be stalled on pagelinks for over two days now (boo). Does anyone have an idea of the likely timescale of these being mulliganised? Perhaps an email to the tech list or to BV might be in order... Alai 14:21, 2 December 2006 (UTC)

They've been resumed now (rather than restarted, seemingly), as you all were. Alai 05:37, 3 December 2006 (UTC)

Static HTML dumps/db dump interleaving

Latest comment: 17 years ago 2 comments1 person in discussion

On a slight variation of the usual "how are the dumps doing?" theme: does anyone know how the static dumps and the db dumps are being parallelised, or interleaved? Alai 21:57, 15 December 2006 (UTC)

I notice the idle period is now over a week, which I haven't noticed before, and the static progress is at "1817 of 3659", several days later. So I'm sensing we may have a bit more of a wait yet... Alai 01:37, 19 December 2006 (UTC)

enwiki

Latest comment: 17 years ago 2 comments2 people in discussion

When is enwiki going to be dumped again? 66.41.167.64 22:25, 28 December 2006 (UTC)

I'd estimate that'll happen as soon as the es: dump becomes "unstuck", plus however long it takes to dump the other largish dumps that are also pending. As that includes de:, that'll be at least a week or so before en: starts, and hence several weeks before it's entirely complete, I'd think. Alai 00:28, 5 January 2007 (UTC)

eswiki dump stuck

Latest comment: 17 years ago 1 comment1 person in discussion

The eswiki dump seems to be stuck. --84.239.165.234 15:21, 30 December 2006 (UTC)

importDump problem

Latest comment: 17 years ago 1 comment1 person in discussion

When piping bzcat of the 20061130 pages-articles.xml.bz2 to importDump.php an error stopped processing.

63400 (1.49277246383 pages/sec 1.49277246383 revs/sec)
XML import parse failure at line 6033397, col 39 (byte 515710976; ""): Invalid document end

The MD5sum for the downloaded file is correct. (SEWilco 21:30, 5 January 2007 (UTC))

Dump server capacity increase

Latest comment: 17 years ago 3 comments2 people in discussion

Wiki dumps are very useful for maintenance. Does any one know if there are plans to increase server capacity to do dumps at least once a month. The English dump has not happened since November 30 2006 and it seems it will not be done for several more weeks per Alai above. --- Skapur 04:19, 8 January 2007 (UTC)

I get the feeling that we're talking to ourselves a little on this page, and probably we need to contact a dev or two, via other channels. (Maybe on the techie mailing list?) It appears to me, though, that if the dumps happened "without incident", or if problems were addressed with any necessary human intervention on the order of days, rather than weeks, the whole cycle would indeed be about a month, give or take. Now of course, if there happens to be the means of reducing the hardware bottleneck, and making them inherently faster and more frequent, even better! Alai 04:48, 8 January 2007 (UTC)

Well, I asked, and here's a reply from Brion Vibber. Looks like things should be looking up in the medium term, but we still have some waiting to do in the meantime. Alai 10:15, 9 January 2007 (UTC)

pagelinks.sql

Latest comment: 17 years ago 2 comments2 people in discussion

I cannot seem to get pagelinks.sql to successfully import. Has anyone successfully imported the pagelinks table into MySQL? page.sql and categorylinks.sql import quite quickly (~30 minutes), but with pagelinks.sql the disk starts thrashing and CPU use drops to about 1 or 2%. Coming up on 9 hours now...

Importing with:

mysql -u username -p schema < enwiki-20061130-pagelinks.sql

My my.ini looks like:

[mysqld]

Port number to use for connections.

port=3306

Path to installation directory. All paths are usually resolved relative to this.

basedir=C:/Program Files/MySQL/MySQL Server 5.0

Path to the database root

datadir=D:/MySQL

If no specific storage engine/table type is defined in an SQL-Create statement the default type will be used.

default-storage-engine=innodb

The bigger you set this the less disk I/O is needed to access data in tables. On a dedicated database server you may set this parameter up to 80% of the machine physical memory size. Do not set it too large, though, because competition of the physical memory may cause paging in the operating system.

innodb_buffer_pool_size=1000M

Size of a memory pool InnoDB uses to store data dictionary information and other internal data structures. A sensible value for this might be 2M, but the more tables you have in your application the more you will need to allocate here. If InnoDB runs out of memory in this pool, it will start to allocate memory from the operating system, and write warning messages to the MySQL error log.

innodb_additional_mem_pool_size=8M

Paths to individual datafiles and their sizes.

innodb_data_file_path=ibdata1:1000M:autoextend:max:10000M

How much to increase datafile size by when full.

innodb_autoextend_increment=512M

The common part of the directory path for all InnoDB datafiles. Leave this empty if you want to split the data files onto different drives.

innodb_data_home_dir=D:/MySQL/InnoDB

Directory path to InnoDB log files.

innodb_log_group_home_dir=D:/MySQL/InnoDB

Size of each log file in a log group in megabytes. Sensible values range from 1M to 1/n-th of the size of the buffer pool specified below, where n is the number of log files in the group. The larger the value, the less checkpoint flush activity is needed in the buffer pool, saving disk I/O. But larger log files also mean that recovery will be slower in case of a crash. The combined size of log files must be less than 4 GB on 32-bit computers. The default is 5M.

innodb_log_file_size=200M

The size of the buffer which InnoDB uses to write log to the log files on disk. Sensible values range from 1M to 8M. A big log buffer allows large transactions to run without a need to write the log to disk until the transaction commit. Thus, if you have big transactions, making the log buffer big will save disk I/O.

innodb_log_buffer_size=8M

Specifies when log files are flushed to disk.

innodb_flush_log_at_trx_commit=0

Helps in performance tuning in heavily concurrent environments.

innodb_thread_concurrency=4

Max packetlength to send/receive from to server.

max_allowed_packet=128M
[mysql]

Max packetlength to send/receive from to client.

max_allowed_packet=128M

Any ideas?

I had huge problems with this too; eventually I gave up trying to build it as a single table, and instead split it up into several dozen (if memory serves) separate tables. I think I also changed the storage engine from InnoBD to MyISAM. That way I found the performance didn't degrade as quickly, though I admit it was all entirely ad hoc, and it still took a horrendously long time in total. Alai 05:35, 14 January 2007 (UTC)

HOWTO quickly import pagelinks.sql

This and this led me to this procedure to quickly import the pagelinks table. It forces MySQL to think it has a corrupt database and "re"creates the indices much quicker than the regular process.

Make a copy of the table declaration at the top of pagelinks.sql (but change the table type to MyISAM and change the table name to "pagelinks_alt") and save it as pagelinks_alt.sql
Edit pagelinks.sql to change the table type to MyISAM and comment out the two keys in the table declaration
Execute pagelinks.sql
Execute pagelinks_alt.sql
Where your database is stored, copy pagelinks_alt.MYI to pagelinks.MYI and pagelinks_alt.frm to pagelinks.frm (DON'T copy the .MYD file or you will delete all the data)
Execute "FLUSH TABLES;"
Execute "REPAIR TABLE 'pagelinks';"
Execute "DROP TABLE 'pagelinks_alt';"

On a Pentium 4 2.8 GHz machine with 1 Gb of ram, myisam_sort_buffer_size = 500M, myisam_max_sort_file_size = 20G, and key_buffer = 400M (I'm not sure if these are the best settings:

Time to create table with index: hours and hours and hours

vs

Time to create table: 38 minutes
Time to "repair" table to add the index: 82 minutes.

If someone know of a better place to put this, by all means, go ahead! --Bkkbrad 21:53, 4 February 2007 (UTC)

A slightly quicker method is to first create the tables structures and drop all unnecessary indices, then to create a user with permission only for data manipulation, ie:

 CREATE USER dataentry;
 REVOKE ALL PRIVILEGES, GRANT OPTION FROM dataentry;
 GRANT SELECT,INSERT ON *.* TO dataentry;

Now log on as this user and run the downloaded .sql files in as per usual - the DDL instructions to drop tables and create indices are all denied and data loads quickly into your pre-created indiceless tables. When done, log back in as your usual self and add and of the indices back that you want.

- Topbanana 17:32, 18 June 2007 (UTC)

Request for clarification

"and deleted pages which we cannot reproduce may still be present in the raw internal database blobs." -- I didn't understand this sentence. "cannot reproduce" because the data is missing/corrupt or because it cannot be legally reproduced/transmitted?

As it is not very clear from the current article I'll post a piece of one of the referrences: "The workaround which I found so far is really ugly, however I've seen users using it with good success. You can create table of the same structure without keys, load data into it to get correct .MYD, Create table with all keys defined and copy over .frm and .MYI files from it, followed by FLUSH TABLES. Now you can use REPAIR TABLE to rebuild all keys by sort, including UNIQUE keys."

Official BitTorrent?

Why no torrents? It would really help to transfer 74.5 GB without loading the server bandwidth...

Redirect table?

Latest comment: 17 years ago 1 comment1 person in discussion

Is the redirect table going to be dumped? Or do I have to import all the pages and rebuild the links?--Bkkbrad 16:26, 16 March 2007 (UTC)

enwiki dumps failing?

Latest comment: 17 years ago 4 comments4 people in discussion

There is an enwiki dump currently processing at the moment, but the previous one failed a number of jobs: [3]

In particular:

2007年04月02日 14:05:27 failed Articles, templates, image descriptions, and primary meta-pages. This contains current versions of article content, and is the archive most mirror sites will probably want. pages-articles.xml.bz2

I can't seem to find any older dumps either.

67.183.26.86 05:47, 4 April 2007 (UTC)

There was a complete dump at [4], but it seems to have been moved.

I'm concerned, since there is now no pagelinks.sql.gz available for enwiki, since it failed on the latest two dumps. --Bkkbrad 17:16, 18 April 2007 (UTC)

I also find that odd -- if something fails, shouldn't it be repaired? There do not seem to be any dumps available for quite long times.

Any info from admins on why the dumps are failing? 68.197.164.179 15:17, 4 June 2007 (UTC)

Unfortunately it requires a (shell-access) dev, not just an admin, and they don't seem to hang out here much. You might do better asking at wikitech-l. Alai 15:16, 9 June 2007 (UTC)

Namespaces not recognized in import

Latest comment: 17 years ago 2 comments1 person in discussion

Hi, I am trying to import dawiki-20070417-pages-articles.xml into MediaWiki version 1.9.3 on a PC running Linux. After configuration of MediaWiki, I use these commands:

$ mysql -u username -p dawiki200704 < dawiki-20070417-interwiki.sql
$ php maintenance/importDump.php < dawiki-20070417-pages-articles.xml

Everything seems to be OK except for one thing. The Wikipedia and Portal namesspaces (and the corresponding talk namespaces) are not recognized, so pages using these namespaces are placed in the article namespace with titles starting with "Wikipedia:" and "Portal:". The page database table looks like this:

mysql> select page_id, page_namespace, page_title from page where page_title like '%:%' limit 10;
+---------+----------------+-----------------------------------+
| page_id | page_namespace | page_title |
+---------+----------------+-----------------------------------+
| 339 | 0 | Portal:Økologi |
| 416 | 0 | Wikipedia:Administratorer |
| 7 | 0 | Wikipedia:Alfabetisk_liste |
| 194 | 0 | Wikipedia:Ambassaden |
| 400 | 0 | Wikipedia:Bekendtgørelser |
| 478 | 0 | Wikipedia:Busholdepladsen |
| 394 | 0 | Wikipedia:Fejlrapporter |
| 191 | 0 | Wikipedia:Hvad_bruges_en_wiki_til |
| 450 | 0 | Wikipedia:Landsbybrønden |
| 387 | 0 | Wikipedia:LanguageDa.php |
+---------+----------------+-----------------------------------+
10 rows in set (0.00 sec)

How do I tell MediaWiki about these namespaces? Thank you in advance for any help. Byrial 14:57, 19 April 2007 (UTC)

Update: I found this manual page with descriptions of the variables $wgMetaNamespace, $wgMetaNamespaceTalk and wgExtraNamespaces. Therefore I have put the following code into my LocalSettings.php file:

/**
 * Name of the project namespace. If left set to false, $wgSitename will be
 * used instead.
 */
$wgMetaNamespace = 'Wikipedia';
/**
 * Name of the project talk namespace. If left set to false, a name derived
 * from the name of the project namespace will be used.
 */
$wgMetaNamespaceTalk ='Wikipedia-diskussion';
/**
 * Additional namespaces. If the namespaces defined in Language.php and
 * Namespace.php are insufficient, you can create new ones here, for example,
 * to import Help files in other languages.
 * PLEASE NOTE: Once you delete a namespace, the pages in that namespace will
 * no longer be accessible. If you rename it, then you can access them through
 * the new namespace name.
 *
 * Custom namespaces should start at 100 to avoid conflicting with standard
 * namespaces, and should always follow the even/odd main/talk pattern.
 */
$wgExtraNamespaces =
 array(100 => "Portal",
 101 => "Portaldiskussion",
 );

However it does not seem to change anything. The Wikipedia and Portal namespaces are still not recognized. Any ideas what is wrong? Thanks, Byrial 16:50, 19 April 2007 (UTC)

Mysql error 1064

 When I import a .sql edition of enwiki-20060219-pages-meta-history.xml.7z into mysql,I come to an error:
 You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'odo]] 
','utf-8'); INSERT INTO text (old_id,old_text,old_fla (1064) - odo]] 
','utf-8');
 Does anyone know how to solve it?

enwiki dump truncated?

Latest comment: 16 years ago 9 comments5 people in discussion

The uncompressed size of enwiki-20070402-pages-meta-history.xml.bz2 is 1,763,048,493,749 bytes (about 1 TB), whereas the uncompressed size of enwiki-20070716-pages-meta-history.xml.bz2 is 96,311,265,169 (96 GB). 99% of the dump seems to be missing.

Hey gang. enwiki-20070716-pages-meta-history.xml.bz2 is still the most recent complete dump. Any chance someone can look into this? This query is months old.

HELLO!? It has been three months since the original query. enwiki-20070716-pages-meta-history.xml.bz2 is still the most recent dump, therefore the question of its completeness still stands. Can someone PLEASE answer: Is enwiki-20070716-pages-meta-history.xml.bz2 truncated (yes/no/we don't know)???

What the hell is the point of this forum is absolutely no one reads it? Is Wikipedia to be taken seriously? Or is it just a joke??

See notice at top of this page: "Note that this page is not at all monitored by anyone who can solve your problems. Attempts to get any issue with the wiki dumps resolved through this forum is futile." You just did not get the purpose of a talk page. There are different communication channels for direct Q&A. Anyway despite your swearing a short answer: No valid English dump have been produced for a long time. The situation has escalated to foundation level, and corrective actions are planned. Erik Zachte 12:07, 10 October 2007 (UTC)

I am very familiar with the notice at the top of the page. downloads.wikimedia.org says this is the forum to turn to for issues. As for my "swearing," none of this would have been necessary had the powers that be monitored the forum they established as the place to discuss these issues. The fact that it took several months to get even this answer is very frustrating.

The notice at the top of the page, which I've just reverted to what I think is a more useful and temperately-worded form that I edited it to earlier, is part of the same exercise in complaint by the above user. It does look as if the devs don't monitor this page (or else are silently indifferent to our ant-like struggles here), but if that's the case it merely makes upping the rhetorical ante all the more pointless. If you need dev assistance, you'll probably do better with at wikitech-l. It's clearly incorrect to claim that all queries and requests for assistance here are "futile": witness the ones that have already been answered. Alai 19:25, 12 October 2007 (UTC)

There are still pending, unanswered queries on this page. You can remove the notice that I put at the top of this page if and only if you answer all pending questions. INSTEAD OF COVERING UP THE PROBLEMS, ADDRESS THEM!

I can remove the notice you quite inappropriately placed any time I like. In fact, what I did was not to remove it, but make it more civil, accurate, and helpful. Seems like a win-win-win to me. Alai 19:29, 22 October 2007 (UTC)

Are you paying us? No, then shut up. Erik Zachte 23:04, 14 October 2007 (UTC)

Well, strictly speaking, there is an argument that the many-months-long lack of any viable database dump is a GFDL violation. I'm sure you don't care, and even more certain that nothing will be done about it, but there it is. -- 66.80.15.66 22:09, 6 November 2007 (UTC)

I don't see how that could be argued at all, but then again, IANAL. In any case, bear in mind that "distributions" still exist in the form of a) the Wiki, and b) every other dump up to and including pages-meta-current.xml.bz2. Is there any licensing obligation to do more than that? Anyhoo, fingers crossed for a full and successful dump on or before 20th December... Alai 15:20, 8 November 2007 (UTC)

Oh well. So much for that idea. It seems a little pointless that on the one hand, these don't succeed, and on the other, that they spend about a month just to fail, making the rest of the cycle that much slower. I'll ask on wikitech-l if there's likely to be any improvement, in either direction. Alai 20:01, 18 November 2007 (UTC)

The December ones looks like it is aiming for failing too. Pages-articles has already failed. I attempted to post 2 messages on wikitech-l asking if this is the only viable backup method used for backup up enwiki/wikipedia in general but they both got deleted. Perhaps someone would like to state what the problem is, and what is being done to fix it as I like to help out if at all possible. I think this proccess needs to be more transparent rather than refusing to talk about backup failures.--Alun Liggins

Wikipedia is a collection of GFDL documents. Each revision of an article is a separate GFDL document, licensed by its creator (editor). Failure to provide a machine-readable copy of the database, including revisions, would seem to violate both the spirit and letter of the GFDL, especially since there appears to have been deliberate removal of old database dumps. Wikipedia itself is not a valid copy of the collection, since you would surely shut down anyone who tried to make a copy of each article revision through the normal Wikipedia interface. While the failure of complete database dumps since -- when was it, April? -- can perhaps be explained by technical problems or incompetence, but the removal of old dumps is surely a violation of the "maintain a transparent copy" requirement. -- 207.118.8.27 23:52, 22 December 2007 (UTC)

Whoa! Those previous comments aren't overly helpful (whoever you are...). I would just like to see this issue fixed, and would like to help if I can as I personally use the dump files and wouldn't want to see wikipedia fail. Remember it is a funded venture rather than a business. Still, on a lighter note, Citizendium has got its XML dumps sorted out, they seem much more open about things like this. Liggs42 --77.96.111.181 00:39, 23 December 2007 (UTC)

Deltas

Latest comment: 16 years ago 1 comment1 person in discussion

For people who download new dumps every month as they are produced, it would be useful to produce diffs, so that they only need to download a small file with the differences instead of the whole 5GB thing again. I'm now trying to make some xdelta3 diffs, to see how much smaller it gets. Normal text-based diff doesn't work on the 12GB file (I think it's trying to allocate insane amounts of RAM).

It may be more useful seeing if articles that have been updated since the last dump can be detected, I'm guessing that the majority of the database doesn't change that much.--77.96.111.181 17:52, 23 November 2007 (UTC)

How to make the connection between deletion logs and revision for deleted pages?

Latest comment: 17 years ago 1 comment1 person in discussion

Hi,

I am trying to collect some statistics of deleted pages (on ptwiki dumps) to find out how many of the deleted pages were created by unregistered users. But the logging table only provides log_title and log_namespace while the revision page (where I can figure out who first created the page) only has rev_page (page_id) information. So to join the two tables i need information that was on the page table, but is no longer there since the pages where deleted.

How can I figure out this info (i.e: list revisions of deleted pages)?

Is this info available in the dumps?

thanks, pt:usuário:girino. --201.24.48.174 02:01, 14 September 2007 (UTC)

Stalled again?

Latest comment: 16 years ago 1 comment1 person in discussion

Stalled since the 6th, on fr: and es:? Alai 19:25, 10 December 2007 (UTC)

Reformatting this page

Latest comment: 16 years ago 2 comments2 people in discussion

Section 8 of the content page, entitled "Tools", is rather big. I'm thiinking of taking it out and putting it into a new page. Any objections/comments? -- Cabalamat on wp 19:31, 15 January 2008 (UTC)

I think that makes a lot of sense. Perhaps a (much shorter) summary could be kept in place here. Alai 00:14, 13 April 2008 (UTC)

Broken image (enwiki-20080103-pages-meta-current.xml.bz2)

Latest comment: 16 years ago 2 comments2 people in discussion

I have downloaded http://download.wikimedia.org/enwiki/20080103/enwiki-20080103-pages-meta-current.xml.bz2, it has the same md5sum as reported in enwiki-20080103-md5sums.txt. When I decompress, I get the following error:

 [3492: huff+mtf rt+rld]
 [3493: huff+mtf rt+rld]
 [3494: huff+mtf rt+rld]
 [3495: huff+mtf data integrity (CRC) error in data

I then downloaded the file twice more from different hosts. They all have the same md5sum (9aa19d3a871071f4895431f19d674650) but they all fail at the same segment of the file.

FWIW I see the same thing. I'll mention it on wikitech-l. LevSB 02:48, 28 January 2008 (UTC)

I thought I should point this out, as well as ask for someone to check their file ('bzip2 -tvv enwiki-20080103-pages-meta-current.xml.bz2'). It would be possible to make .PAR2 files of the unbroken compressed image, and others would be able to repair their own broken image (like mine) by downloading the .PAR2 files. For Linux, parchive is available and for Windows, QuickPar does the trick nicely. Nessus 16:09, 27 January 2008 (UTC)

Main page data dump

Latest comment: 16 years ago 1 comment1 person in discussion

I've used dumpbackup.php to dump all data in another wiki site which I built by myself to another one. Everything ran well except the Main page. It didn't display with the dump data!! The history of Main page could show the version of dump, but can not be rollbacked or undone.

Does anyone have ideas about this problem? Thanks. 140.114.134.198 08:03, 28 March 2008 (UTC)

Wiki markup (Extra characters) in xml dumps

Hello, hope you doing well.

My question is: Is there any database dump (xml or sql) without the wiki markup (extra characters and formatting) and references, and links in its content (like a clean page)?

Checking the xml files that contain the pages and articles (db dumps) I noticed that the text has all the wiki markup in it (special formatting (brackets, references, links, slashes) for example, links are enclosed between double square brackets, and so on. I also see references to images and to other sites and things. I'll describe what I mean with the following example: if I open the enwiki-20151002-pages-articles.xml file, I'll get the following formatting in its content:

Anarchist|Anarchists|the fictional character|Anarchist (comics)|other uses|Anarchists (disambiguation)}}\n{{pp-move-indef}}\n{{Use British English|date=January 2014}}\n{{Anarchism sidebar}}\n\'\'\'Anarchism\'\'\' is a [[political philosophy]] that advocates [[self-governance|self-governed]] societies with voluntary institutions. These are often described as [[stateless society|stateless societies]],<ref>\"ANARCHISM, a social philosophy that rejects authoritarian government and maintains that voluntary institutions are best suited to express man\'s natural social tendencies.\" George Woodcock. \"Anarchism\" at The Encyclopedia of Philosophy</ref><ref>\"In a society developed on these lines, the voluntary associations which already now begin to cover all the fields of human activity would take a still greater extension so as to substitute themselves for the state in all its functions.\" [http://www.theanarchistlibrary.org/HTML/Petr_Kropotkin___Anarchism__from_the_Encyclopaedia_Britannica.html Peter Kropotkin. \"Anarchism\" from the Encyclopædia Britannica]</ref><ref>\"Anarchism.\" The Shorter Routledge Encyclopedia of Philosophy. 2005. p. 14 \"Anarchism is the view that a society without the state, or government, is both possible and desirable.\"</ref><ref>Sheehan, Sean. Anarchism, London: Reaktion Books Ltd., 2004. p. 85</ref> but several authors have defined them more specifically as institutions based on non-[[Hierarchy|hierarchical]] [[Free association (communism and anarchism)|free associations]].<ref>\"as many anarchists have stressed, it is not government as such that they find objectionable, but the hierarchical forms of government associated with the nation state.\" Judith Suissa. \'\'Anarchism and Education: a Philosophical Perspective\'\'. Routledge. New York. 2006. p. 7</ref><ref name=\"iaf-ifa.org\"/><ref>\"That is why Anarchy, when it works and so on...

Wiki markup example

Is there a database dump without the wiki markup and references, or a way to remove it, so that we can I import the xml file into MySQL as a clean page?

Thank you very much for you help

Retrieved from "https://meta.wikimedia.org/w/index.php?title=Talk:Data_dumps/Archive_2&oldid=23525436"