Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

cobilab/jarvis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

93 Commits

Repository files navigation

Build Status Conda License: GPL v3

JARVIS

Efficient lossless compression of genomic sequences

As a compression tool, JARVIS is able to provide additional compression gains over GeCo [ https://github.com/pratas/geco ] and GeCo2 [ https://github.com/pratas/geco2 ], however it uses slightly higher computational resources (RAM and processing time). JARVIS only affords reference-free compression. The core of the JARVIS method is a competitive prediction between two different classes of models: weighted stochastic repeat models and weighted context models. The latter models can be either context models or substitutional tolerant context models.

INSTALLATION

Conda

Install Miniconda, then run the following:

conda install -y -c bioconda jarvis

Unix

git clone https://github.com/pratas/jarvis.git
cd jarvis/src/
make

EXECUTION

Run JARVIS

Run JARVIS using level 3:

./JARVIS -v -l 3 File.seq

PARAMETERS

To see the possible options type

./JARVIS

or

./JARVIS -h

This will print the following options:

SYNOPSIS 
 ./JARVIS [OPTION]... [FILE] 
 
SAMPLE 
 Run Compression : ./JARVIS -v -l 4 sequence.txt 
 Run Decompression : ./JARVIS -v -d sequence.txt.jc 
 
DESCRIPTION 
 Compress and decompress lossless genomic sequences for 
 storage and analysis purposes. 
 Measure an upper bound of the sequence entropy. 
 
 -h, --help 
 usage guide (help menu). 
 
 -V, --version 
 Display program and version information. 
 
 -F, --force 
 force mode. Overwrites old files. 
 
 -v, --verbose 
 verbose mode (more information). 
 
 -s, --show-levels 
 show pre-computed compression levels (configured). 
 
 -l [NUMBER], --level [NUMBER] 
 Compression level (integer). 
 Default level: 1. 
 It defines compressibility in balance with computational 
 resources (RAM & time). Use -s for levels perception. 
 
 -cm [NB_C]:[NB_D]:[NB_I]:[NB_G]/[NB_S]:[NB_E]:[NB_I]:[NB_A] 
 Template of a context model. 
 Parameters: 
 [NB_C]: (integer [1;20]) order size of the regular context 
 model. Higher values use more RAM but, usually, are 
 related to a better compression score. 
 [NB_D]: (integer [1;5000]) denominator to build alpha, which 
 is a parameter estimator. Alpha is given by 1/[NB_D].
 Higher values are usually used with higher [NB_C], 
 and related to confiant bets. When [NB_D] is one, 
 the probabilities assume a Laplacian distribution. 
 [NB_I]: (integer {0,1}) number to define if a sub-program 
 which addresses the specific properties of DNA 
 sequences (Inverted repeats) is used or not. The 
 number 1 turns ON the sub-program using at the same 
 time the regular context model. The number 0 does 
 not contemple its use (Inverted repeats OFF). The 
 use of this sub-program increases the necessary time 
 to compress but it does not affect the RAM. 
 [NB_G]: (real [0;1)) real number to define gamma. This value 
 represents the decayment forgetting factor of the 
 regular context model in definition. 
 [NB_S]: (integer [0;20]) maximum number of editions allowed 
 to use a substitutional tolerant model with the same 
 memory model of the regular context model with 
 order size equal to [NB_C]. The value 0 stands for 
 turning the tolerant context model off. When the 
 model is on, it pauses when the number of editions 
 is higher that [NB_C], while it is turned on when 
 a complete match of size [NB_C] is seen again. This 
 is probabilistic-algorithmic model very usefull to 
 handle the high substitutional nature of genomic 
 sequences. When [NB_S] > 0, the compressor used more 
 processing time, but uses the same RAM and, usually, 
 achieves a substantial higher compression ratio. The 
 impact of this model is usually only noticed for 
 [NB_C] >= 14. 
 [NB_R]: (integer {0,1}) number to define if a sub-program 
 which addresses the specific properties of DNA 
 sequences (Inverted repeats) is used or not. It is 
 similar to the [NR_I] but for tolerant models. 
 [NB_E]: (integer [1;5000]) denominator to build alpha for 
 substitutional tolerant context model. It is 
 analogous to [NB_D], however to be only used in the 
 probabilistic model for computing the statistics of 
 the substitutional tolerant context model. 
 [NB_A]: (real [0;1)) real number to define gamma. This value 
 represents the decayment forgetting factor of the 
 substitutional tolerant context model in definition. 
 Its definition and use is analogus to [NB_G]. 
 
 ... (you may use several context models) 
 
 
 -rm [NB_R]:[NB_C]:[NB_A]:[NB_B]:[NB_L]:[NB_G]:[NB_I] 
 Template of a repeat model. 
 Parameters: 
 [NB_R]: (integer [1;10000] maximum number of repeat models 
 for the class. On very repetive sequences the RAM 
 increases along with this value, however it also 
 improves the compression capability. 
 [NB_C]: (integer [1;20]) order size of the repeat context 
 model. Higher values use more RAM but, usually, are 
 related to a better compression score. 
 [NB_A]: (real (0;1]) alpha is a real value, which is a 
 parameter estimator. Higher values are usually used 
 in lower [NB_C]. When [NB_A] is one, the 
 probabilities assume a Laplacian distribution. 
 [NB_B]: (real (0;1]) beta is a real value, which is a 
 parameter for discarding or maintaining a certain 
 repeat model. 
 [NB_L]: (integer (1;20]) a limit threshold to play with 
 [NB_B]. It accepts or not a certain repeat model. 
 [NB_G]: (real [0;1)) real number to define gamma. This value 
 represents the decayment forgetting factor of the 
 regular context model in definition. 
 [NB_I]: (integer {0,1}) number to define if a sub-program 
 which addresses the specific properties of DNA 
 sequences (Inverted repeats) is used or not. The 
 number 1 turns ON the sub-program using at the same 
 time the regular context model. The number 0 does 
 not contemple its use (Inverted repeats OFF). The 
 use of this sub-program increases the necessary time 
 to compress but it does not affect the RAM. 
 
 -z [NUMBER], --selection [NUMBER] 
 Size of the context selection model (integer). 
 Default context selection: 12. 
 
 [FILE] 
 Input sequence filename (to compress) -- MANDATORY. 
 File to compress is the last argument. 

If you are not interested in setting the template for each model, then use the levels mode. To see the possible levels type:

./JARVIS -s

CITATION

On using this software/method please cite:

Pratas D, Hosseini M, Silva JM, Pinho AJ. A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models. Entropy. 2019 Nov;21(11):1074.

ISSUES

For any issue let us know at issues link.

LICENSE

GPL v3.

For more information:

http://www.gnu.org/licenses/gpl-3.0.html

About

Efficient lossless compression of genomic sequences

Topics

Resources

License

Stars

Watchers

Forks

Packages

Contributors

AltStyle によって変換されたページ (->オリジナル) /