PhenCode: Programs for data conversion

In order to prepare data from a source database for loading into the Locus Variants track, we use a number of Perl scripts to transform it into a standardized format. The ones listed here are fundamental utility routines for performing common tasks; an additional customized script that calls these will be necessary for each data source (see Example of adding a new data source for more information).

To place the variants on the chromosomes, these scripts use alignments between a particular genome assembly and the reference sequence(s) used by the source database. The alignments are run beforehand using whatever program you wish (I use Blat), but the results must be in PSL format to serve as input for the scripts. (For Blat, specify the database's reference sequence as the "query", and the assembly's chromosome sequence as the "target".) The optional 'pslLine' parameter allows the calling program to supply a tab-separated PSL line directly, in which case the alignment file will be ignored. This option was added as a way to speed up the processing of some large data sets.

Note that all position parameters here start with "1", not "0".

If you have any questions or problems with these scripts, please contact Belinda Giardine.

parseHgvsName2: USAGE: parseHgvsName2 alignmentFile.psl referenceSequenceName 'hgvsName' ['pslLine']

This script calls the others as needed to find the chromosome position for a variant, given its HGVS-style name. The hgvsName is quoted to keep the shell from acting on special characters in it. (Note that some name formats may not be supported.)
convert_coors2: USAGE: convert_coors2 alignmentFile.psl referenceSequenceName 999[±99] ['pslLine']

This script finds the chromosome coordinate for a mutation position, using a PSL-formatted alignment between the database's reference sequence and the chromosome. The numeric parameter specifies the nucleotide position in the reference sequence. If the database uses a coding sequence as the reference instead of a genomic sequence, positions within introns are specified as a coding position plus or minus an offset into the intron, e.g. "128+2". (See the HGVS nomenclature for a more detailed discussion of this convention.)
convert_prot_coors2: USAGE: convert_prot_coors2 alignmentFile.psl referenceSequenceName 999 ['pslLine']

This is similar to convert_coors2, but it converts a codon position instead of a nucleotide position. It uses an alignment where the lengths in the PSL format are in codons, not nucleotides.
convert_prot_coors3: USAGE: convert_prot_coors3 alignmentFile.psl referenceSequenceName 999 ['pslLine']

This converts a codon position like convert_prot_coors2 does, but uses an alignment with normal PSL numbering where the lengths are in nucleotides.
sequenceCheck: USAGE: sequenceCheck sequenceFile.fa 'hgvsName'

This script will verify the sequence for a variant where possible. For example, it will check the wild-type sequence in the case of substitutions or deletions where the deleted nucleotides are listed. It uses getSubSeq to extract the sequence for comparison.
getSubSeq: USAGE: getSubSeq from,to sequenceFile.fa

This extracts a particular range from a FASTA-formatted sequence. The 'from,to' parameter specifies the endpoint positions (inclusive) and must not contain any spaces; e.g. "1,100" returns the first 100 nucleotides.