Gene Association File (GAF) file description

The Gene Association File (GAF) file contains annotation data provided by the Gene Ontology Consortium in standardized tab-delimited text files. Each line in the file represents an association between a gene product and a GO term, with an evidence code, a reference to support the association, and other data associated with the gene product or the annotation. This page describes the GAF 2.2 file format.

GO also provides annotations as GPAD+GPI.

For general information on GO annotations, please see the introduction to GO annotation page.

GAF 2.2 format description

File Header

Mandatory elements of the GAF 2.2 file header

!gaf-version: 2.2
!generated-by: database (must be listed in dbxrefs.yaml)
!date-generated: YYYY-MM-DD or YYYY-MM-DDTHH:MM

Other header elements may be included such as links to the submitters project page, funding sources, ontology versions, etc.

! URL: e.g. http://www.yeastgenome.org/
! Project-release: e.g. WS275
! Funding: e.g. NHGRI
! Columns: file format written out
! go-version: PURL
! ro-version: PURL
! gorel-version: PURL
! eco-version: PURL

GAF 2.2 fields

The annotation flat file format is comprised of 17 tab-delimited fields.

Column	Content	Required?	Cardinality	Example
1	DB	required	1	UniProtKB
2	DB Object ID	required	1	P12345
3	DB Object Symbol	required	1	GOT2
4	Relation	required	1 (pipe-separated with NOT for negation)	involved_in or NOT\|involved_in
5	GO ID	required	1	GO:0006457
6	Reference	required	1 or greater	PMID:9683573
7	Evidence Code	required	1	TAS
8	With (or) From	optional	0 or greater	GO:0000346 \|UniProtKB:P00508
9	Aspect	required	1	P
10	DB Object Name	optional	0 or 1	Aspartate aminotransferase, mitochondrial
11	DB Object Synonym	optional	0 or greater	mAspAT
12	DB Object Type	required	1	protein
13	Taxon	required	1 or 2	taxon:9986 or taxon:9986\|taxon:652611
14	Date	required	1	20111018
15	Assigned By	required	1	HGNC
16	Annotation Extension	optional	0 or greater	part_of(CL:0000576)
17	Gene Product Form ID	optional	0 or 1	UniProtKB:P12345-2

Definitions and requirements for field contents

DB (column 1)

Refers to the database from which the identifier of the biological entity described in DB object ID (column 2) is drawn. For example, if a UniProtKB ID is the DB object ID (column 2), DB (column 1) should be UniProtKB.

Must be one of the values from GO database cross-references.

Cardinality = 1

DB Object ID (column 2)

A unique identifier from the database in DB (column 1) describing the biological entity annotated.

Cardinality = 1

Note that the identifier must reference a top-level primary gene or gene product identifier: either a gene, or a protein that has a 1:1 correspondence to a gene. Identifiers referring to particular protein isoforms or post-translationally cleaved or modified proteins are not allowed in this field of the GAF file; such identifiers are captured in the Gene Product Form ID.

DB Object Symbol (column 3)

A name for the entity represented by the DB object ID. The DB Object Symbol field should be text that means something to a biologist wherever possible (a gene symbol, for example). If the entity has no name, the DB object ID can be used as a DB Object Symbol.

Cardinality = 1

Relation (column 4)

This column is populated with relations from the Relation Ontology that describe how the annotated biological entity relates to the GO term with which it is associated.

See also the documentation on qualifiers in the GO annotation guide.

Cardinality = 1
- For negation, a pipe must be used to separate the "NOT" from the relation (e.g. "NOT|contributes_to" or "NOT|enables").

GO ID (column 5)

The GO identifier for the term associated with the DB object ID.

Cardinality = 1

Reference (column 6)

One or more unique identifiers for a single reference cited as the source experiment or method for atributing the GO ID to the DB Object ID. This may be a literature reference (PMID or DOI), a GOREF internal reference record or a Model Organism Database (MOD) internal reference. The syntax is DB:accession_number.

Note that only one unique reference can be cited on a single line in the GAF. If a reference has identifiers in more than one database, multiple identifiers for that reference can be included on a single line, separated by a pipe. For example, if a reference has a PMID and a model organism database reference, the PMID must be included but the model orgainsm database identifier may be included, as well. Note that if a model organism database has an identifier for the reference, that identifier should always be included, even if a PubMed ID is also used.

Cardinality = 1, >1
For cardinality >1, values must be pipe-separated (e.g. PMID:2676709|SGD_REF:S000047763).

Evidence Code (column 7)

See GO-ECO mapping file and the GO Evidence code guide for the list of valid evidence codes for GO annotations.

Cardinality = 1

With [or] From (column 8)

This field is used with specific ECO codes to capture an additional identifier supporting the evidence for the annotation. For example, it can identify another gene product to which the annotated gene product is similar (ISS) or interacts with (IPI). Population of the With/From is mandatory for certain evidence codes, see the documentation for the individual evidence codes for more information.

Cardinality = 0, 1, >1, with the following rules:
- Cardinality must be 0 for evidence codes IDA, TAS, NAS, or ND.
- Cardinality must be 1, >1 for IEA, IC, IGI, IPI, ISS & child terms of ISS.
- For cardinality >1 pipes or commas may be used. A pipe is used to separate independent evidence (e.g. FB:FBgn1111111|FB:FBgn2222222). A comma indicates grouped evidence, e.g. two of three genes in a triply mutant organism.

Aspect (column 9)

Refers to the specific branch of the GO to which the GO ID (column 5) belongs: P (biological process), F (molecular function) or C (cellular component).

Cardinality = 1

DB Object Name (column 10)

Name of the annotated gene or gene product.

Cardinality = 0 or 1
- White space is allowed.

DB Object Synonym (column 11)

A gene symbol (or other text) that denotes another name by which the annotated gene or gene product might be known.

Cardinality = 0, 1, >1
- For cardinality >1 use a pipe to separate entries (e.g. YFL039C|ABY1|END7|actin gene).
- White space is allowed.

DB Object Type (column 12)

A label corresponding to the ontology identifier describing the class of biological entity of the DB Object_ID in Column 2. The values used are shown below. The full list of entity types and their corresponding identitiers can be found in the biological_entity_mapping.yaml.

protein (PR:000000001)
protein-containing complex (GO:0032991)
ncRNA or any SO child term (SO:0000655)
Cardinality = 1

Taxon (column 13)

The NCBI taxonomic identifier(s) of the annotated entity (column 1). Identifiers must come from NCBI Taxonomy database and have the taxon: prefix (e.g. taxon:1|taxon:1000). It is also possible to capture a second taxonomic identifier for an interacting organism, in conjunction with terms that have the biological process term ‘GO:0044419 biological process involved in interspecies interaction between organisms’or the cellular component term ‘GO:0018995 host cellular component’ as an ancestor.

Cardinality = 1 or 2
For cardinality 2, values must be pipe-separated.

Date (column 14)

Date on which the annotation was made; format is YYYYMMDD.

Cardinality = 1

Assigned By (column 15)

The database that made the annotation. The value of this column is used for tracking the source of an individual annotation, but will differ from the value in column 1 for any annotation that is made by one database and incorporated into another.

This column must use one of the values from the set of GO database cross-references.

Cardinality = 1

Annotation Extension (column 16)

Annotation extensions allow GO terms in standard annotations to be further specified, using gene products, chemicals, cell types, anatomical structures, to provide additional biological context. The cross-reference is prefaced by an appropriate relationship from the Relation Ontology. Multiple extensions may be entered.

Cardinality = 0, 1, >1
For cardinality > 1, use of a pipe (|) specifies an independent statement (OR) and is equivalent to making separate annotations, i.e. not all conditions are required to infer the annotated GO term. Use of a comma (,) specifies a connected statement (AND) and indicates that all conditions are required to infer the annotated GO term. In this case, ‘OR’ is a weaker statement than ‘AND’, therefore will be correct in all cases. Pipe and comma separators may be used together in the same annotation extension field.

Gene Product Form ID (column 17)

This column captures specific isoforms or post-translationally processed forms of a gene or gene product that are associated with the annotation. As the DB Object ID (column 2) must be a canonical entity, i.e. a gene OR a representative protein that has a 1:1 correspondence to a gene, this column allows for capturing greater specificity about the annotated entity. Content may include identifiers for distinct proteins produced by differential splicing, alternative translationalal starts, post-translational cleavage or post-translational modification. Identifiers for functional RNAs can also be included in this column.

The identifier used must be a standard 2-part global identifier, e.g. UniProtKB:OK0206-2

When the Gene Product Form ID is filled with a protein identifier, the value in DB Object Type (column 12) must be protein. Protein identifiers can include UniProtKB accessions or Protein Ontology (PRO) identifiers.
When the Gene Product Form ID is filled with a functional RNA identifier, the DB Object Type (column 12) must be either ncRNA, rRNA, tRNA, snRNA, or snoRNA.
Cardinality = 0 or 1