MTZ appendix for synchrotron metadata · project-gemmi/gemmi · Discussion #116

wojdyr
May 17, 2021
Maintainer

Problem

During a PDB deposition, in most cases, the user needs to manually input details of data collection:

This is tedious, error-prone¹, and the web form includes only the most important items. If the software that prepares mmCIF files had access to information about data collection it would include it in the mmCIF file and this form would be pre-filled.
Such information is available in the synchrotron, but how to transfer it all the way down to the deposition?

Proposed solution

It's proposed here that a small file with metadata (info about the detector, collection time, processing software, etc) is stored together with images.

Let us use the PDBx/mmCIF format – the same format that is used for deposition to the PDB. The metadata from the synchrotron (any items⁴ that software developers in the synchrotron can provide) could be appended in a block named collection. Here is an example:

data_collection
_diffrn_source.source SYNCHROTRON
_diffrn_source.pdbx_synchrotron_site 'Light Source X'
_diffrn_source.pdbx_synchrotron_beamline 'Beamline A'
_diffrn_source.pdbx_wavelength_list 0.9792,0.9794,0.9796
_diffrn_radiation.pdbx_scattering_type x-ray
_diffrn_radiation.pdbx_diffrn_protocol MAD
_diffrn_detector.detector PIXEL
_diffrn_detector.type 'DECTRIS PILATUS 6M'
_diffrn_detector.pdbx_collection_date 2020-20-20
_diffrn.ambient_temp 295
# if possible, links to experiment info and/or data files could be useful
_synchrotron.data_info https://my.synchrotron/dataset/0123456789/

To make it more likely that the user takes this metadata, this file should be appended to MTZ files (both unmerged and merged) produced by pipelines that are run in the synchrotron. This is possible because the MTZ format supports storing extra text at the end of the file². Here, we call this extra text appendix.
Such text can be added even using the Unix shell:

$ cat collection.cif >> file.mtz

This metadata will then be used when preparing PDB deposition³.

The appendix in data may get lost during further processing⁵, but the software that prepares mmCIF files for deposition may still have access to the original files. In particular, that software may ask for scaled unmerged data⁶, so it's good if the appendix is also there.

If you plan to (or already did) implement it please leave a comment below. 👋

To check if an MTZ file has an appendix you can print it with one of the following commands:

cctbx.python -c 'import iotbx.mtz; print(iotbx.mtz.object("file.mtz").xml())'
gemmi mtz --appendix file.mtz

Footnotes:

For example, Synchrotron Work Patterns show many collection dates during shutdowns. It's also observed that first (alphabetically) beamlines in their synchrotrons get more credits than others.
Extra text at the end of MTZ file is an undocumented feature of the MTZ format, supported by the cmtz library (part of libccp4) since 2010, and by gemmi since 2021. In the cmtz code this text at the end is referred to as XML datablock, although it's read and written as generic text.
For example, in the deposition task in ccp4i2 or in Global Phasing scripts that generate mmCIF files for deposition. In a software suite that can be run in various ways (CCP4 has 3 alternative GUIs and still the programs can be run from pipelines or the command line) it'd be best if a refinement program would also transfer data from MTZ to the coordinate mmCIF file).
Check documentation for _diffrn_source.pdbx_synchrotron_site, _diffrn_source.pdbx_synchrotron_beamline, _diffrn_source.type, _diffrn_detector.type to see allowed values. The PDB adds new allowed values when needed.
Although almost all MX programs (including CCP4, PHENIX and Global Phasing) use the cmtz library that reads the appendix, some programs pass it from the input MTZ to the output MTZ (e.g. pointless, reindex, truncate) and some don't (e.g. aimless, ctruncate, cad, uniqueify). The latter programs could be modified to retain the appendix.
In April 2021 the PDB started supporting the deposition of scaled unmerged data (it was possible also before, but in a limited way). This made it even more important that synchrotron pipelines that do data scaling provide users with scaled unmerged data. Ideally, this data will then be inputted to the software that prepares mmCIF files for deposition.

Replies: 4 comments 11 replies

CV-GPhL
May 20, 2021
Maintainer

Can you clarify what exactly you mean with

"MTZ format supports storing extra text at the end of a file."
"feature is undocumented, but it's supported by the cmtz library"

As you say, this is not in the official MTZ format description nor in the C API or Fortran API. At the moment I can't find it in either cmtzlib.c or cmtzlib.h - at least not with the phrase "appendix". It would be great if you could provide explicit pointers to the actual part of the CCP4 library that deals with this.

I'm mainly concerned by this part in the MTZ format specification

END of main header card
Up to 30 Character*80 lines containing history information
For multi-record files:
Batch title (Character*70) and (optionally) orientation data for each batch present in the file
End of all headers record

Does that mean that one could have less than 30 history records? And what about those optional orientation data blocks: could one have just a batch title without orientation data? This seems imprecise enough that I wonder if the fact that some programs seemingly pass additions at the end of a MTZ file into output - while others don't - could be due to unclear handling of the data after the END record within those programs.

Would it make sense to define a new section explicitely, e.g.

BEGIN_METADATA
FORMAT_METADATA 'PDBx/mmCIF'
<content>
END_METADATA

to (1) have section delimiters and (2) provide a format definition. The latter could allow storage of other formats - JSON, XML, CBF etc. There are probably better ways of doing this, but you get the idea.

Small addition: all this obviously applies not just to synchrotron pipelines, but to any (processing) program producing MTZ files. A stable and portable method of storing additional information and/or helping with provenance tracking is valuable in all environments. :-)

2 replies

@wojdyr

wojdyr May 20, 2021
Maintainer Author

It's called XML datablock in libccp4. Search cmtzlib.c for xml. But I suppose the Fortran API is missing.

It's stored after MTZENDOFHEADERS, so it doesn't affect history and other records.

It was proposed and added in 2010. Here is the proposal:

Date: 2010年5月24日 09:34:10 +0100
From: Kevin Cowtan <cowtan@ysbl.york.ac.uk>
To: CCP4-dev <ccp4-dev@dl.ac.uk>
Here's new versions of mtzdata.h and cmtzlib.h/c, starting from Martyn's
version with the spacegroup confidence code. This version includes
column groups, column provenance, an XML datablock, and (as an
afterthought) extensibilty.
The XML datablock just appears after the end of the exiting headers.
It's just a string. It can't contain '0円', otherwise it can be used any
way you like.

@CV-GPhL

CV-GPhL May 20, 2021
Maintainer

Thanks for the clarification: not having a Fortran API in the core CCP4 library is probably ok (one could always write that missing wrapper oneself if needed).

As far as I can see, there is nothing XML-like at all in the way cmtzlib handles any dangling lines after MTZENDOFHEADERS. So one could definitely wrap any appendix into something slightly more structured as mentioned above, right?

dagewa
Jul 8, 2021

xia2 writes scaled, unmerged mmCIF under the assumption that this will be used for deposition dials/dials#1457 (comment). As such the information is already there. Adding it to the MTZ as well implies that users won't necessarily use the mmCIF during deposition, but will take the scaled, unmerged MTZ. But why then are we writing the mmCIF?

5 replies

@CV-GPhL

CV-GPhL Jul 8, 2021
Maintainer

I usually see the typical workflow often as

 synchrotron/beamline/instrument
 -> raw data
 -> processed data
 -> MTZ file
 -> structure solution
 -> refinement
 -> deposition

with a distinct break between the first part (up to processed data) and the second part (from structure solution to deposition). This can be a break in location (beamline vs home lab), computer system used (synchrotron compute and storage cluster vs lab fileserver or local storage) or just access (not all data/files from original data processing are still accessible at the time the deposition is started).

Since MTZ is very often the working format of choice for reflection data, it would be great if it could provide a bridge/connection between these two parts: not necessarily by providing the same full set of meta data, but some kind of (provenance) tracking information. If (1) some identifier (DOI?) can (a) uniquely describe the first part and (b) provide access to the full set of data including the raw diffraction images, and if (2) this information could travel unaltered as part of the reflection data right down to the deposition stage, then it would allow us to hook the deposition preparation into the original data collection/processing information in the same way we might connect our refined model with a LIMS or sequence database.

mmCIF has the advantage of being a very rich and general deposition format (and well documented), while MTZ has the advantage of being highly tuned towards fast access to specific types of reflection data and a small set of essential meta data (and being well documented). Together they are quite powerful while a single one could always give problems at certain stages.

@dagewa

dagewa Jul 8, 2021

Thanks, I can see the point of inserting a "fingerprint" in the MTZ that matches that in the mmCIF, so that these can be matched together as coming from the same source later on, notwithstanding the breaks you describe.

My assumption is that this is the desired flowchart

 -> raw data
 -> processed data
 -> unmerged reflns mmCIF --╮
 -> MTZ file |
 -> structure solution |
 -> refinement |
 -> deposition <------╯

That is, the unmerged reflns mmCIF forms part of the deposition package rather than the information being pulled from an unmerged MTZ file.

@wojdyr

wojdyr Jul 8, 2021
Maintainer Author

Can the mmCIF file from DIALS have all this information (beamline name, collection date, collection temperature)?

@CV-GPhL

CV-GPhL Jul 8, 2021
Maintainer

We have to distinguish between (1) the information that is (or should) be present in the raw diffraction data and that the processing software/system will have access to and (2) the information that is in the data acquisition part but doesn't necessarily make it into the raw diffraction data software.

For a data processing system that is triggered by (or close to) the data acquisition system there is the possibility of fetching relevant information from either. When the starting point is purely the raw diffraction data we might not have all information (but there could be a way of fetching e.g. ISPyB information via a web-based API even if processing at home). Ideally, any information that is currently not part of the raw diffraction meta data but should be present (in order to provide a unique fingerprint) would need to be added to that meta data in order to allow the same level of identification no matter where the raw diffraction data is processed ("transferability").

Maybe as a note how we've been approaching this in autoPROC over the years (after reading plenty of meta data out of a multitude of different diffraction data and image headers etc): initially we decided to collect any known deviations from some perceived standard on a public wiki, partially in the hope this was going to become obsolete with the advent of imCIF, full CBF and HDF5. However, this became a time-consuming and cumbersome task - both to us as maintainers (often detectives) and also to users. So we bit the bullet and went for an automatic system that would remove one of the aspects (annoying to users) ... while still leaving the maintenance work to us ;-) This is similar to the generate_XDS.INP tool from the XDS developers or the specific format readers in dxtbx as far as I understand them.

Anyway, we had to make a decision what would uniquely define a specific instrument. What we came up with is the combination of detector serial number and data collection date (beamline names are a fuzzy concept and temperature is often some hardwired value of 100K). And yes, some instruments don't write one or both of these. And yes, some instruments have incorrect values ... but they still seem to be the most reliable items in the raw diffraction meta data. Of course, we might be in trouble if a detector moves between beamlines and across timezones ... we'll get there once a synchrotron is built on Fiji ;-)

@dagewa

dagewa Jul 8, 2021

@CV-GPhL, yes your automatic system seems similar to dxtbx in that respect. DIALS (via dxtbx) will recognise images from specific beamlines around the world based on matching serial numbers etc., so we could capture some of the data acquisition metadata even after-the-fact. In practice we don't currently keep this information: after import of the images, the format class that was used to interpret them is not recorded as part of the data processing metadata. We could add this, and there are other good reasons to store the format class (such as when plugins are used that might override those supplied in the standard dxtbx library). However, this would not work everywhere, as you point out. There are no doubt many beamlines across the world that get a generic format base class rather than an instrument-specific match. For Diamond, auto-processing by xia2 will of course have access to the truth, by virtue of being triggered by data acquisition.

keitaroyam
Jul 8, 2021
Collaborator

We are planning to implement the mtz appendix stuff in KAMO for the use at SPring-8. A complexity is that we need a mechanism to track the information when merging multi-crystal data. Moreover, I don’t know what to put when data from multiple beamlines are merged.

Another worry just came to my mind - is there any risk this appendix may be transferred undesirably? (e.g. when transferring test flags)

3 replies

@wojdyr

wojdyr Jul 8, 2021
Maintainer Author

That's a great news!

I don't know what to do in case of multiple beamlines. When the user manually selects beamline during deposition, I think they can select only one. And having one beamline is simpler. OTOH it's possible to have multiple values for any mmCIF category:

loop_
_diffrn_source.source 
_diffrn_source.pdbx_synchrotron_site 
_diffrn_source.pdbx_synchrotron_beamline 
_diffrn_source.pdbx_wavelength_list
SYNCHROTRON 'Light Source X' 'Beamline A' 0.9792
SYNCHROTRON 'Light Source X' 'Beamline B' 0.9999

I'm pretty sure that the appendix wouldn't get copied together with test flags. If ccp4 cad is used, it ignores such appendices. Perhaps there are tools that would copy it, but I suppose it'd be from the file with data.

@CV-GPhL

CV-GPhL Jul 8, 2021
Maintainer

If CCP4 CAD ignores such appendices it might get dropped a lot of times (CAD is such a workhorse). I would try and keep the content of the appendix as simple as possible: as long as it provides /some/ way of finding the original deposition-ready mmCIF file from processing/scaling it will already be a huge benefit.

On the other hand, if it carries many more values but we can't trust that this is accurate (because info from N>1 datasets was lost, wrong info copied through test-set flag transfer etc) we'll end up in another situation where stuff makes it into a deposition just because it populates some fields ... a bit like all those wwPDB entries that are clearly cut-n-paste jobs from a common template with all detail lost.

I'd rather have a NULL than an unreliable value. The NULL value might get corrected/replaced very easily later on while non-NULL values tend to stay forever.

@wojdyr

wojdyr Jul 8, 2021
Maintainer Author

Clemens: the assumption here is that you have a GUI or a workflow that keeps track of the provenance of files. You start from an MTZ file from a synchrotron and this file doesn't get lost in the project.

Generally, providing a way to find metadata in another file would be more complex than directly providing metadata in a file.
Also, „original deposition-ready mmCIF file from processing/scaling" is not possible – you need merged data for deposition.

dagewa
Jul 9, 2021

It might be useful if gemmi provided a parameterised function that would add an appendix to an MTZ. Something like

add_data_collection_appendix(site, beamline, detector, date, ...)

Then developers of data processing software who use this won't have to worry about the format of the appendix block, while you will be able to define this more easily, making use of the close relationship between gemmi and the PDB. You also then get control over what to do in corner cases, NULL values and so on.

1 reply

@wojdyr

wojdyr Jul 13, 2021
Maintainer Author

Are there other items, apart from the ones I listed in the example, that could be included?
Mmcif has tags for describing monochromator, collimation, goniometer and other things. I'm wondering what is useful and possible to determine by the synchrotron automation software.
Possible additions are in categories radiation, source, detector, measurement.

MTZ appendix for synchrotron metadata #116

Uh oh!

Uh oh!

wojdyr May 17, 2021 Maintainer

Problem

Proposed solution

Replies: 4 comments · 11 replies

Uh oh!

Uh oh!

CV-GPhL May 20, 2021 Maintainer

Uh oh!

Uh oh!

wojdyr May 20, 2021 Maintainer Author

Uh oh!

Uh oh!

CV-GPhL May 20, 2021 Maintainer

Uh oh!

dagewa Jul 8, 2021

Uh oh!

CV-GPhL Jul 8, 2021 Maintainer

Uh oh!

dagewa Jul 8, 2021

Uh oh!

wojdyr Jul 8, 2021 Maintainer Author

Uh oh!

CV-GPhL Jul 8, 2021 Maintainer

Uh oh!

dagewa Jul 8, 2021

Uh oh!

keitaroyam Jul 8, 2021 Collaborator

Uh oh!

wojdyr Jul 8, 2021 Maintainer Author

Uh oh!

CV-GPhL Jul 8, 2021 Maintainer

Uh oh!

wojdyr Jul 8, 2021 Maintainer Author

Uh oh!

dagewa Jul 9, 2021

Uh oh!

wojdyr Jul 13, 2021 Maintainer Author

wojdyr
May 17, 2021
Maintainer

Replies: 4 comments 11 replies

CV-GPhL
May 20, 2021
Maintainer

wojdyr May 20, 2021
Maintainer Author

CV-GPhL May 20, 2021
Maintainer

dagewa
Jul 8, 2021

CV-GPhL Jul 8, 2021
Maintainer

wojdyr Jul 8, 2021
Maintainer Author

CV-GPhL Jul 8, 2021
Maintainer

keitaroyam
Jul 8, 2021
Collaborator

wojdyr Jul 8, 2021
Maintainer Author

CV-GPhL Jul 8, 2021
Maintainer

wojdyr Jul 8, 2021
Maintainer Author

dagewa
Jul 9, 2021

wojdyr Jul 13, 2021
Maintainer Author