Questions tagged [data-dump]
For questions about the quarterly Creative Commons data dumps of all public data in the Stack Exchange network Q&A sites.
521 questions
- Bountied 1
- Unanswered
- Frequent
- Score
- Trending
- Week
- Month
- Unanswered (my tags)
26
votes
1
answer
441
views
Has Stack Exchange Inc. actually declined anyone access to future downloads of SE data dumps?
Over one year ago, Stack Exchange Inc. drastically changed their data dump process and, aside from making it a major pain to download the entire dump of the Stack Exchange network, decided to add the ...
81
votes
0
answers
2k
views
+50
Full data dump requests - the company doesn’t seem to be holding up on its end
More than a year ago - during the initial discussions about the new, per site data dump system, I'd asked if I could get a full copy of the data dumps. I followed up during the public release and I ...
75
votes
6
answers
3k
views
Fabricated data in posts.xml for multiple/all data dumps
TL;DR; as I wrote this post parallel to troubleshooting, I went from thinking there was an odd bug to thinking that the company is doing something intentional (or nefarious?) with the data dump again. ...
user avatar
anon
15
votes
1
answer
302
views
Posts.xml is empty in latest webapps.stackexchange.com.7z
$ sha256sum webapps.stackexchange.com.7z
7af2cfa857eed56f9396261b2985b387122b28f4a7fc43efc45629b20bf488c3 webapps.stackexchange.com.7z
$ 7z e -so webapps.stackexchange.com.7z Posts.xml
<?xml ...
23
votes
1
answer
356
views
The data dump access page is throwing a 500 server error
Attempting to access the data dump page (this is a "current" user link) throws a server error. This happens across the network, e.g. on Stack Overflow.
This has been happening for 2 days now,...
-42
votes
2
answers
503
views
Could SE remove the "For inquiries about using the data for LLM training, contact us." clause? It's pointless now that a US judge ruled it's fair use
Feature request: remove that clause when accessing an SE data dump:
You can access this site's data for personal use. For inquiries about using the data for large language model (LLM) training, ...
15
votes
0
answers
189
views
Tags starting with dot are missing from `posts.tags` column
Best demonstrated by query which returns no results when run on e.g. AskUbuntu:
select * from posts where posttypeid = 1 and tags like '%<.%'
This should return, for example, the 1,080 questions ...
15
votes
1
answer
385
views
AI-generated Answers experiment & The Data Dump
Following the AI-generated Answers experiment on Stack Exchange sites that volunteered to participate announcements I wanted to ask if the company has already planned a way to filter out the (few?) AI-...
28
votes
3
answers
1k
views
Posts from deleted users are missing from Data Explorer and Data Dump
I wanted to query for answers where the user account has been deleted, but the answer is still up. According to this link (https://stackoverflow.com/help/deleting-account), deleting your account will ...
5
votes
0
answers
168
views
Sites.xml is not present anymore in StackExchange data dumps
In recent changes more or less linked to Shifting the data dump schedule: A proposal, I notice that we do not have anymore the Sites.xml file in Stack Exchange data dumps.
The file was still present ...
10
votes
0
answers
226
views
What is a pre vote?
As seen in
New Vote Types in latest data dump?,
some new vote types appeared (likely) exclusively in the Stack Overflow data dump. A couple users helped find out what each one meant. However, what ...
12
votes
1
answer
221
views
Checksums for data dumps should be included in the dump announcement and on the dump download page itself
Following on from my previous request for current checksums
I'm looking at 2 use cases for a data dump:
verifying if a current download for a data dump is correct
verifying if a specific historical ...
15
votes
2
answers
3k
views
Data Dumps - updates and bug fixes
Thanks to everyone who posted bug reports and feature requests related to the updated data dumps process. Below, we’ve detailed some work on those reports and requests.
Issues reported on this post:
...
Berthold's user avatar
- 3,497
9
votes
1
answer
309
views
Am I allowed to publicly reshare some JSON file containing SE data created after the introduction of the new data dump process?
I ran across some JSON (magnet link) containing SE data that was created after the introduction of the new data dump process. Am I allowed to publicly reshare it (e.g., on https://archive.org), or ...
21
votes
3
answers
765
views
Latest Data Dump has invalid XML and invalid characters
As I have been looking through the latest StackExchange data dump, it seems like a non-compliant XML serializer was used. There are numerous escape sequences that are simply invalid XML such as &#...