PhenCode: Programs for data conversion
In order to prepare data from a source database for loading into the
Locus Variants track, we use a number of
Perl scripts to transform it
into a standardized format. The ones listed here are fundamental
utility routines for performing common tasks; an additional customized
script that calls these will be necessary for each data source (see
Example of adding a new data
source for more information).
To place the variants on the chromosomes, these scripts use alignments
between a particular genome assembly and the reference sequence(s)
used by the source database. The alignments are run beforehand using
whatever program you wish (I use
Blat), but the results
must be in
PSL format
to serve as input for the scripts. (For Blat, specify the database's
reference sequence as the "query", and the assembly's chromosome
sequence as the "target".) The optional 'pslLine' parameter allows the
calling program to supply a tab-separated PSL line directly, in which
case the alignment file will be ignored. This option was added as a
way to speed up the processing of some large data sets.
Note that all position parameters here start with "1", not "0".
If you have any questions or problems with these scripts, please contact
Belinda Giardine.
- parseHgvsName2
- USAGE: parseHgvsName2 alignmentFile.psl
referenceSequenceName 'hgvsName' ['pslLine']
This script calls the others as needed to find the chromosome position
for a variant, given its HGVS-style name. The hgvsName is quoted to
keep the shell from acting on special characters in it.
(Note that some name formats may not be supported.)
- convert_coors2
- USAGE: convert_coors2 alignmentFile.psl
referenceSequenceName 999[±99] ['pslLine']
This script finds the chromosome coordinate for a mutation position,
using a PSL-formatted alignment between the database's reference
sequence and the chromosome. The numeric parameter specifies the
nucleotide position in the reference sequence. If the database uses
a coding sequence as the reference instead of a genomic sequence,
positions within introns are specified as a coding position plus or
minus an offset into the intron, e.g. "128+2
". (See
the HGVS nomenclature
for a more detailed discussion of this convention.)
- convert_prot_coors2
- USAGE: convert_prot_coors2 alignmentFile.psl
referenceSequenceName 999 ['pslLine']
This is similar to convert_coors2, but it converts a codon position
instead of a nucleotide position. It uses an alignment where the
lengths in the PSL format are in codons, not nucleotides.
- convert_prot_coors3
- USAGE: convert_prot_coors3 alignmentFile.psl
referenceSequenceName 999 ['pslLine']
This converts a codon position like convert_prot_coors2 does, but
uses an alignment with normal PSL numbering where the lengths are
in nucleotides.
- sequenceCheck
- USAGE: sequenceCheck sequenceFile.fa 'hgvsName'
This script will verify the sequence for a variant where possible.
For example, it will check the wild-type sequence in the case of
substitutions or deletions where the deleted nucleotides are listed.
It uses getSubSeq to extract the sequence for comparison.
- getSubSeq
- USAGE: getSubSeq from,to sequenceFile.fa
This extracts a particular range from a
FASTA-formatted
sequence. The 'from,to' parameter specifies the endpoint positions
(inclusive) and must not contain any spaces; e.g.
"1,100
" returns the first 100 nucleotides.