Primary Structure Section

The primary structure section of a PDB formatted file contains the sequence of residues in each chain of the macromolecule(s). Embedded in these records are chain identifiers and sequence numbers that allow other records to link into the sequence.

DBREF (standard format)

The DBREF record provides cross-reference links between PDB sequences (what appears in SEQRES record) and a corresponding database sequence.

Record Format

COLUMNS    DATA TYPE   FIELD       DEFINITION
-----------------------------------------------------------------------------------
 1 - 6    Record name  "DBREF "
 8 - 11    IDcode    idCode       ID code of this entry.
13      Character   chainID      Chain identifier.
15 - 18    Integer    seqBegin      Initial sequence number of the
 PDB sequence segment.
19      AChar     insertBegin    Initial insertion code of the
 PDB sequence segment.
21 - 24    Integer    seqEnd       Ending sequence number of the
 PDB sequence segment.
25      AChar     insertEnd     Ending insertion code of the
 PDB sequence segment.
27 - 32    LString    database      Sequence database name.
34 - 41    LString    dbAccession    Sequence database accession code.
43 - 54    LString    dbIdCode      Sequence database identification code.
56 - 60    Integer    dbseqBegin    Initial sequence number of the
 database seqment.
61      AChar     idbnsBeg      Insertion code of initial residue of the
 segment, if PDB is the reference.
63 - 67    Integer    dbseqEnd     Ending sequence number of the
 database segment.
68      AChar     dbinsEnd     Insertion code of the ending residue of
 the segment, if PDB is the reference.

Note: By default this format is used as long as the information entered into these fields fits. For sequence databases that use longer accession code or long sequence numbering, the new DBREF1/DBREF2 format can be used.

Details

PDB entries contain multi-chain molecules with sequences that may be wild type, variant, or synthetic. Sequences may also have been modified through site-directed mutagenesis experiments (engineered). A number of PDB entries report structures of individual domains cleaved from larger molecules.

The DBREF records present sequence correlations between PDB SEQRES records and corresponding GenBank (for nucleic acids) or UNIPROT/Norine (for proteins) entries. PDB entries containing heteropolymers are linked to different sequence database entries.

Database names and their abbreviations as used on DBREF records. https://www.rcsb.org/pdb/home/home.do

                 Database abbreviations
Database name           (columns 27 – 32)   
----------------------------------------------------------------------
GenBank                  GB
Protein Data Bank             PDB
UNIPROT                  UNP
Norine NORINE

wwPDB does not guarantee that all possible references to the listed databases will be provided. In most cases, only one reference to a sequence database will be provided.
If no reference is found in the sequence databases, then the PDB entry itself is given as the reference.
Selection of the appropriate sequence database entry or entries to be linked to a PDB entry is done on the basis of the sequence and its biological source. Questions on entry assignment that may arise are resolved by consultation with the database.

Verification/Validation/Value Authority Control

The sequence database entry found during PDB's search is compared to that provided by the depositor and any differences are resolved or annotated.

All polymers in the entry will be assigned a DBREF record.

Relationships to Other Record Types

DBREF represents the sequence as found in SEQRES records.

DBREF1/DBREF2 replaces DBREF when the accession codes or sequence numbering does not fit the DBREF format.

Examples

 1     2     3     4     5     6     7     8
12345678901234567890123456789012345678901234567890123456789012345678901234567890
DBREF 2JHQ A  1  226 UNP  Q9KPK8  UNG_VIBCH    1  226 
     
DBREF 3AKY A  1  219 UNP  P07170  KAD1_YEAST    3  221  
DBREF 1HAN A  2  298 UNP  P47228  BPHC_BURCE    1  297
DBREF 3D3I A  0  760 UNP  P42592  YGJK_ECOLI   23  783      
DBREF 3D3I B  0  760 UNP  P42592  YGJK_ECOLI   23  783    
DBREF 3C2J A  1   8 PDB  3C2J   3C2J       1   8      
DBREF 3C2J B 101  108 PDB  3C2J   3C2J      101  108      
DBREF 1FFK 0  2 2923 GB   3377779 AF034620   2597  5518      
DBREF 1FFK 9  1  122 GB   3377779 AF034620   5658  5779   
DBREF 1UNJ X 6 11 NOR NOR00228 NOR00228 6 11

DBREF1 / DBREF2 (added)

Details

This updated two-line format is used when the accession code or sequence numbering does not fit the space allotted in the standard DBREF format. This includes some GenBank sequence numbering (greater than 5 characters) and UNIMES accession numbers (greater than 12 characters).

Record Format

DBREF1

COLUMNS    DATA TYPE  FIELD     DEFINITION
-----------------------------------------------------------------------------------
 1 - 6    Record name  "DBREF1"
 8 - 11    IDcode    idCode    ID code of this entry.
13      Character   chainID    Chain identifier.
15 - 18    Integer    seqBegin   Initial sequence number of the
 PDB sequence segment, right justified.
19      AChar     insertBegin  Initial insertion code of the
 PDB sequence segment.
21 - 24    Integer    seqEnd    Ending sequence number of the
 PDB sequence segment, right justified.
25      AChar     insertEnd   Ending insertion code of the
 PDB sequence segment.
27 - 32    LString    database   Sequence database name.
48 - 67    LString    dbIdCode   Sequence database identification code,
 left justified.

DBREF2

COLUMNS    DATA TYPE  FIELD     DEFINITION
-----------------------------------------------------------------------------------
 1 - 6    Record name  "DBREF2"
 8 - 11    IDcode    idCode    ID code of this entry.
13      Character   chainID    Chain identifier.
19 - 40    LString    dbAccession  Sequence database accession code,
 left justified.
46 - 55    Integer    seqBegin   Initial sequence number of the
 Database segment, right justified.
58 - 67    Integer    seqEnd    Ending sequence number of the
 Database segment, right justified.

Details

The DBREF1/DBREF2 record presents sequence correlations between PDB SEQRES records and corresponding GenBank (for nucleic acids) or UNIMES (for proteins) entries. Several cases are easily represented by means of pointers between the databases using DBREF.
Database names and their abbreviations as used as in DBREF records.

                 Database abbreviations
Database name           (columns 27 – 32)   
----------------------------------------------------------------------
GenBank                  GB
UNIMES                  UNIMES

wwPDB does not guarantee that all possible references to the listed databases will be provided. In most cases, only one reference to a sequence database will be provided.

Verification/Validation/Value Authority Control

The sequence database entry found by wwPDB staff is compared to answers provided by the depositor; any differences are resolved or annotated appropriately.

Relationships to Other Record Types

DBREF1/DBREF2 represents the sequence as found in SEQRES records.

Template

     1     2     3     4     5     6     7     8
12345678901234567890123456789012345678901234567890123456789012345678901234567890
DBREF1 2J83 A  61  322 XXXXXX        YYYYYYYYYYYYYYYYYYYY          
DBREF2 2J83 A ZZZZZZZZZZZZZZZZZZZZZZ nnnnnnnnnn mmmmmmmmmm

Examples

     1     2     3     4    5     6     7     8
12345678901234567890123456789012345678901234567890123456789012345678901234567890
DBREF1 2J83 A  61  322 UNIMES        UPI000148A153          
DBREF2 2J83 A MES00005880000 61 322
     1     2     3     4     5     6     7     8
12345678901234567890123456789012345678901234567890123456789012345678901234567890
DBREF1 2J83 A  61  322 GB          AE017221          
DBREF2 2J83 A 46197919 1534489 1537377

SEQADV

Overview

The SEQADV record identifies differences between sequence information in the SEQRES records of the PDB entry and the sequence database entry given in DBREF. Please note that these records were designed to identify differences and not errors. No assumption is made as to which database contains the correct data. A comment explaining any engineered differences in the sequence between the PDB and the sequence database may also be included here.

Record Format

COLUMNS    DATA TYPE   FIELD     DEFINITION
-----------------------------------------------------------------
 1 - 6    Record name  "SEQADV"
 8 - 11    IDcode    idCode    ID code of this entry.
13 - 15    Residue name resName    Name of the PDB residue in conflict.
17      Character   chainID    PDB chain identifier.
19 - 22    Integer    seqNum    PDB sequence number.
23      AChar     iCode     PDB insertion code.
25 - 28    LString    database
30 - 38    LString    dbIdCode   Sequence database accession number.
40 - 42    Residue name dbRes     Sequence database residue name.
44 - 48    Integer    dbSeq     Sequence database sequence number.
50 - 70    LString    conflict   Conflict comment.

Details

In a number of cases, conflicts between the sequences found in PDB entries and in sequence database reference entries have been noted. There are several possible reasons for these conflicts, including natural variants or engineered sequences (mutants), polymorphic sequences, or ambiguous or conflicting experimental results. These discrepancies are reported in SEQADV. Additional details may be included in remark 999.
When conflicts arise which are not classifiable by these terms, a reference to either a published paper, a PDB entry, or a REMARK within the entry is given.
The comment "SEE REMARK 999" is included when the explanation for the conflict is too long to fit the SEQADV record.
Some of the possible conflict comments:

- Cloning artifact
- Expression tag
- Conflict
- Engineered
- Variant 
- Insertion
- Deletion
- Microheterogeneity
- Chromophore

Microheterogeneity is to be represented as a variant with one of the possible residues in the site being selected (arbitrarily) as the primary residue. The residues which do not match to the UNP reference will be listed in SEQADV records with the explanation of "microheterogeneity".

Verification/Validation/Value Authority Control

SEQADV records are automatically generated.

Relationships to Other Record Types

SEQADV refers to the sequence as found in the SEQRES records, and to the sequence database
reference found on DBREF.

REMARK 999 contains text that explains discrepancies when the explanation is too lengthy to fit in SEQADV.

Examples

     1     2    3     4     5     6     7     8
12345678901234567890123456789012345678901234567890123456789012345678901234567890
SEQADV 3ABC MET A  -1 UNP P10725       EXPRESSION TAG
SEQADV 3ABC GLY A  50 UNP P10725  VAL  50 ENGINEERED
SEQADV 2QLE CRO A  66 UNP P42212  SER  65 CHROMOPHORE
SEQADV 2OKW LEU A  64 UNP P42212  PHE  64 SEE REMARK 999 
SEQADV 2OKW LEU A 64 NOR NOR00669 PHE 14 SEE REMARK 999

SEQRES (updated)

Overview

SEQRES records contain a listing of the consecutive chemical components covalently linked in a linear fashion to form a polymer. The chemical components included in this listing may be standard or modified amino acid and nucleic acid residues. It may also include other residues that are linked to the standard backbone in the polymer. Chemical components or groups covalently linked to side-chains (in peptides) or sugars and/or bases (in nucleic acid polymers) will not be listed here.

Record Format

COLUMNS    DATA TYPE   FIELD    DEFINITION
-------------------------------------------------------------------------------------
 1 - 6    Record name  "SEQRES"
 8 - 10    Integer    serNum    Serial number of the SEQRES record for the
 current chain. Starts at 1 and increments
 by one each line. Reset to 1 for each chain.
12      Character   chainID   Chain identifier. This may be any single
 legal character, including a blank which is
 is used if there is only one chain.
14 - 17    Integer    numRes    Number of residues in the chain.
 This value is repeated on every record.
20 - 22    Residue name  resName   Residue name.
24 - 26    Residue name  resName   Residue name.
28 - 30    Residue name  resName   Residue name.
32 - 34    Residue name  resName   Residue name.
36 - 38    Residue name  resName   Residue name.
40 - 42    Residue name  resName   Residue name.
44 - 46    Residue name  resName   Residue name.
48 - 50    Residue name  resName   Residue name.
52 - 54    Residue name  resName   Residue name.
56 - 58    Residue name  resName   Residue name.
60 - 62    Residue name  resName   Residue name.
64 - 66    Residue name  resName   Residue name.
68 - 70    Residue name  resName   Residue name.

Verification/Validation/Value Authority Control

The residues presented in the ATOM records must agree with those on the SEQRES records.

The SEQRES records are checked using sequence databases and information provided by the depositor.

SEQRES is compared to the ATOM records during processing, and both are checked against the sequence databases. All discrepancies are either resolved or annotated appropriately in the entry.

The ribo- and deoxyribonucleotides in the SEQRES records are distinguished. The ribo- forms of these residues are identified with the residue names A, C, G, U and I. The deoxy- forms of these residues are identified with the residue names DA, DC, DG, DT and DI. Modified nucleotides in the sequence are identified by separate 3-letter residue codes. The plus character prefix to label modified nucleotides (e.g. +A, +C, +T) is no longer used.

Example

     1     2     3     4     5     6     7     8
12345678901234567890123456789012345678901234567890123456789012345678901234567890
SEQRES  1 A  21 GLY ILE VAL GLU GLN CYS CYS THR SER ILE CYS SER LEU     
SEQRES  2 A  21 TYR GLN LEU GLU ASN TYR CYS ASN               
SEQRES  1 B  30 PHE VAL ASN GLN HIS LEU CYS GLY SER HIS LEU VAL GLU     
SEQRES  2 B  30 ALA LEU TYR LEU VAL CYS GLY GLU ARG GLY PHE PHE TYR     
SEQRES  3 B  30 THR PRO LYS ALA                       
SEQRES  1 C  21 GLY ILE VAL GLU GLN CYS CYS THR SER ILE CYS SER LEU     
SEQRES  2 C  21 TYR GLN LEU GLU ASN TYR CYS ASN               
SEQRES  1 D  30 PHE VAL ASN GLN HIS LEU CYS GLY SER HIS LEU VAL GLU     
SEQRES  2 D  30 ALA LEU TYR LEU VAL CYS GLY GLU ARG GLY PHE PHE TYR     
 SEQRES  3 D  30 THR PRO LYS ALA
SEQRES  1 A  8  DA DA DC DC DG DG DT DT               
SEQRES  1 B  8  DA DA DC DC DG DG DT DT
SEQRES  1 X  39  U  C C C  C  C  G U  G  C C  C  A     
SEQRES  2 X  39  U  A G  C  G  G  C G  U  G G  A  A     
SEQRES  3 X  39  C  C A  C  C  C  G U  U  C C  C  A

Known Problems

Polysaccharides do not lend themselves to being represented in SEQRES.

There is no mechanism provided to describe the sequence order if their starting position is unknown.

For cyclic peptides, a residue is arbitrarily assigned as the N-terminus.

MODRES (updated)

Overview

The MODRES record provides descriptions of modifications (e.g., chemical or post-translational) to protein and nucleic acid residues. Included are correlations between residue names given in a PDB entry and standard residues.

Record Format

COLUMNS    DATA TYPE   FIELD    DEFINITION
--------------------------------------------------------------------------------
 1 - 6    Record name  "MODRES"
 8 - 11    IDcode    idCode   ID code of this entry.
13 - 15    Residue name resName   Residue name used in this entry.
17      Character   chainID   Chain identifier.
19 - 22    Integer    seqNum   Sequence number.
23      AChar     iCode    Insertion code.
25 - 27    Residue name stdRes   Standard residue name.
30 - 70    String    comment   Description of the residue modification.

Details

Residues modified post-translationally, enzymatically, or by design are described in MODRES records. In those cases where the wwPDB has opted to use a non-standard residue name for the residue, MODRES also correlates the new name to the precursor standard residue name.
Modified nucleotides in the sequence are now identified by separate 3-letter residue codes. The plus character prefix to label modified nucleotides (e.g. +A, +C, +T) is no longer used.
MODRES is mandatory when modified standard residues exist in the entry. Examples of some modification descriptions:

- Glycosylation site
- Post-translational modification
- Designed chemical modification
- Phosphorylation site
- D-configuration

A MODRES record is not required if coordinate records are not provided for the modified residue.
D-amino acids are given their own residue name (resName), i.e., DAL for D-alanine. This resName appears in the SEQRES records, and has the associated MODRES, HET, and FORMUL records. The coordinates are given as HETATMs within the ATOM records and occur in the correct order within the chain. This ordering is an exception to the stated Order of Records.
When a standard residue name is used to describe a modified site, resName (columns 13-15) and stdRES (columns 25-27) contain the same value.

Verification/Validation/Value Authority Control

MODRES is generated by the wwPDB.

Relationships to Other Record Types

MODRES maps ATOM and HETATM records to the standard residue names. HET, and FORMUL may also appear.

Example

1 2 3 4 5 6 7 8 12345678901234567890123456789012345678901234567890123456789012345678901234567890 MODRES 2R0L ASN A 74 ASN GLYCOSYLATION SITE MODRES 1IL2 1MG D 1937 G 1N-METHYLGUANOSINE-5'-MONOPHOSPHATE MODRES 4ABC MSE B 32 MET SELENOMETHIONINE