-
Recent Posts
-
Recent Comments
- pm286 on ContentMine at IFLA2017: The future of Libraries and Scholarly Communications
- Hiperterminal on ContentMine at IFLA2017: The future of Libraries and Scholarly Communications
- Next steps for Text & Data Mining | Unlocking Research on Text and Data Mining: Overview
- Publishers prioritize “self-plagiarism” detection over allowing new discoveries | Alex Holcombe's blog on Text and Data Mining: Overview
- Kytriya on Let’s get rid of CC-NC and CC-ND NOW! It really matters
-
Archives
- June 2018
- April 2018
- September 2017
- August 2017
- July 2017
- November 2016
- July 2016
- May 2016
- April 2016
- December 2015
- November 2015
- September 2015
- May 2015
- April 2015
- January 2015
- December 2014
- November 2014
- September 2014
- August 2014
- July 2014
- June 2014
- May 2014
- April 2014
- March 2014
- February 2014
- January 2014
- December 2013
- November 2013
- October 2013
- September 2013
- August 2013
- July 2013
- May 2013
- April 2013
- March 2013
- February 2013
- January 2013
- December 2012
- November 2012
- October 2012
- September 2012
- August 2012
- July 2012
- June 2012
- May 2012
- April 2012
- March 2012
- February 2012
- January 2012
- December 2011
- November 2011
- October 2011
- September 2011
- August 2011
- July 2011
- May 2011
- April 2011
- March 2011
- February 2011
- January 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010
- April 2010
- August 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
- January 2008
- December 2007
- November 2007
- October 2007
- September 2007
- August 2007
- July 2007
- June 2007
- May 2007
- April 2007
- December 2006
- November 2006
- October 2006
- September 2006
-
Categories
- "virtual communities"
- ahm2007
- berlin5
- blueobelisk
- chemistry
- crystaleye
- cyberscience
- data
- etd2007
- fun
- general
- idcc3
- jisc-theorem
- mkm2007
- nmr
- open issues
- open notebook science
- oscar
- programming for scientists
- publishing
- puzzles
- repositories
- scifoo
- semanticWeb
- theses
- Uncategorized
- www2007
- XML
- xtech2007
-
Meta
Data repositories for long-tail science: setting the scene?
I’m assuming we all believe that we need data repositories for science; that’s there are about 10 different reasons why (not all consistent); that many of us (including communities I am) in are starting to build them. So what should they be like?
I’m talking here and in the future about long-tail science – where there are tens of thousands of separate laboratories with no infrastructural coordination and with a huge variety of interest, support, antagonism, suspicion, excitement, boredom about data. A 6-10 person lab which analyses living organisms, or dead organisms, or properties of materials, or making chemicals, or photographing meteorites, or correlating economic indicators, or climate models, or nanotubes or... Where the individuals may work together on one or more projects, or where each worker has a separate project. Where there is a mix of postdocs, technical staff, central support, graduate students, interns, undergraduates, visiting staff, public projects, private projects, commercially exploitable material. With 3 month projects, 3 year projects, 3 decade projects. With ground-breaking new science, minor tweaks to existing science , application of science to make new things or develop new processes. Not usually all in the same lab. But often in the same department.
Indeed almost as diverse and heterogeneous as you can imagine. The common theme being that people create new raw stuff through lab experiments, field observations, computer simulations, analysing existing data, or text. We’ll call the non-material part (i.e. the bits and bytes, not the atoms or photons) of the raw stuff “data” (a singular noun). This data is very important to each researcher, but they have had generally had no formal training in how to manage this data. They base their “data management policy” on software already on their machine, what their neighbours suggest to them, what they see on the public web, what the student last year submitted for their thesis.
And unfortunately data are often very complicated. So generic data management systems are likely to be either very abstract, or complicated, or both.
So here I’ll try to suggest some views as to how long-tail scientists regard data... I’ll also treat it as a 3-4 year problem – the length of a typical PhD thesis.
- At the start data is not an issue. Undergraduate work has designed the environment of an experiment so that you only capture and record a very small amount of stuff. In many PhDs you aren’t expected to start collecting data at this stage. You are meant to be reading and thinking.
- You read the literature. In the literature data is a second-class citizen. It’s hidden away, never published. Maybe you read some theses from last year. They have a bit more data in. Usually tables. But it’s still something that you read, rather than interact with. There are also graphs and photographs of stuff. They are self-consistent and make sense (they have to make sense to the examiners).
- You learn how to use the equipment, or grow the bugs, or grow the crystals or collect fruit flies or photograph mating bats or whatever. Sometimes this is fun; sometimes it doesn’t work. You’ve been trained to have a lab book (a blue one with 128 pages with hard covers and “University of Hogwarts” on each numbered page.) You’ve been trained (perhaps) to write down your experiment plan. This is generally required if you work with stuff which has legal or mortal consequences if you do it wrong. Hydrogen peroxide can be used for homeland insecurity. In some cases someone has to sign off what you say you are going to do.
- Now you do your experiment. You write down – in ballpoint – the date. Then what you are doing, and what happened. And now you have got some stuff. You record it, perhaps as a photograph, perhaps as a spectrum, perhaps in a spreadsheet if it changes with time. Your first data. By this time you are well into your PhD. You’re computer-literature so you have it as a file. But you also have to record it in your lab-book. So, easy – you print it out! Then get some glue and glue it into the book. Now it’s a permanent record of the experiment. [For additional fun, some glues degrade with time, so by the third year all your pasted spectra fall out. Naturally you haven’t labelled which page they were stuck to – why would you? So you have to make an educated guess as to where they used to be.
- Oh, and that file with the spectrum in? You have to give it a name – so “first spectrum” and we’ll put it on the Desktop because then we know where to find it. At least it’s safe.
- 6 months, and the experiments are doing well. Now your desktop is pretty full, so we’ll make the icons smaller. They are called “first spectrum after second purification” and so forth. You can always remember what page in the lab book this related to.
- A year on, and you’ve gone to the “Unseen University” to use their new entranceometer. The data is all collected and stored on their machine. You get a paper copy of each experiment for your book. There is no point in taking a copy of the data as it’s in binary and the only processing software is at UU. And you are going back next year so any problems can be sorted then.
- Two years on and you are proud of the new system you have devised. Each bit of stuff has a separate name. Something like “carbon-magnesium-high-temperature/1/aug/200atmosphere/version5”. You’ve finished this part of the study and you need a new machine. So you save your data as 2010.zip. Your shiny new machine has 5 times as much diskspace so you copy the file into “old machine”/2010.zip. It’s safe.
- Three years on. Time to start thinking about writing up. The entranceometer stuff has been reported at a meeting and got quite a lot of interest. Your supervisor has started to write a paper (some supervisors write their students’ papers. Others don’t). This is good practice for the thesis. You give him the entranceometer diagrams. The paper is sent off.
- And reviewer 2 doesn’t like the diagram. They’ve used a different design of entranceometer and it plots the data on logarithmic axes. They ask you to replot.
- What to do? You have the data, so we’ll replot it. Where is it? “old machine”/something_or_other. Finally you find the zip file.
- It doesn’t unzip – “auxiliary file missing”. You have no idea what that means. So let’s mail the UU quoting the reference number on the printed diagram. After a week or so no answer so try again. A mail with a binary file “these are the only files we could find”. You unzip it – “imaginary component of Furrier transform missing”. Basically you’re stuffed. You cannot recompute the diagram.
- Then you have a bright idea. You could replot the spectra by hand. By measuring every point on the curve you get X and Y. And they want logX. Your mate writes a Javascript tool that reads points off an image and records the X-Y values. So you can digitize the spectrum by clicking each point. It only takes 2 hours per spectrum. There’s 30 altogether. So you can do it in under a week if you spend most of the day working on it...
- Now this is not fun, and it’s not science and it’s against health and safety. But it will get the data measured for the paper. And you are now knackered.
- Wow – the paper got a “highly accessed” (You can’t actually read it because you’re now visiting a lab which doesn’t subscribe to that journal. So it will have to wait till you can read your own paper.
- And now the thesis. It’s a bit of a rush because you had to present the results at a conference because the boss said so. But you got a job offer – assuming you finish your thesis.
...
- Help, what does this file mean: “first compound second version really latest version 3.1”? What type of data is it (it doesn’t seem to have an extension). And should you not use “first compound (version 2) first spectrum”. You can’t find the dates because when you copied the file they all got changed to the date of copying so they all have the same date. So you talk to the second year PhD . “one of the files was run after the machine software changed; which is similar to yours?” “Ah, I have only seen that type” “Thanks, this must the later one, I’ll use it”.
So, as I continue to stress, the person who needs the datamanagement plan is the researcher themselves. Forget preserving for posterity if you cannot preserve for the present. So let’s formulate principle 1:
“the people who need repositories are the people who will want to put data into them”.
The difficult word is “want”. If we can solve that we have a chance of moving forward. If we can’t we shall not succeed.
One Response to Data repositories for long-tail science: setting the scene?
Nice story Peter, I’m half tempted to turn it into a powerpoint with some pictures 😉