GitHub - cobilab/jarvis: Efficient lossless compression of genomic sequences

Name	Name	Last commit message	Last commit date
Latest commit History 93 Commits
CPCM_context_order	CPCM_context_order
benchmark	benchmark
bin	bin
imgs	imgs
profiles	profiles
src	src
.travis.yml	.travis.yml
LICENSE	LICENSE
README.md	README.md

Efficient lossless compression of genomic sequences

As a compression tool, JARVIS is able to provide additional compression gains over GeCo [ https://github.com/pratas/geco ] and GeCo2 [ https://github.com/pratas/geco2 ], however it uses slightly higher computational resources (RAM and processing time). JARVIS only affords reference-free compression. The core of the JARVIS method is a competitive prediction between two different classes of models: weighted stochastic repeat models and weighted context models. The latter models can be either context models or substitutional tolerant context models.

INSTALLATION

Conda

Install Miniconda, then run the following:

conda install -y -c bioconda jarvis

Unix

git clone https://github.com/pratas/jarvis.git
cd jarvis/src/
make

EXECUTION

Run JARVIS

Run JARVIS using level 3:

./JARVIS -v -l 3 File.seq

PARAMETERS

To see the possible options type

./JARVIS

./JARVIS -h

This will print the following options:

SYNOPSIS 
 ./JARVIS [OPTION]... [FILE] 
 
SAMPLE 
 Run Compression : ./JARVIS -v -l 4 sequence.txt 
 Run Decompression : ./JARVIS -v -d sequence.txt.jc 
 
DESCRIPTION 
 Compress and decompress lossless genomic sequences for 
 storage and analysis purposes. 
 Measure an upper bound of the sequence entropy. 
 
 -h, --help 
 usage guide (help menu). 
 
 -V, --version 
 Display program and version information. 
 
 -F, --force 
 force mode. Overwrites old files. 
 
 -v, --verbose 
 verbose mode (more information). 
 
 -s, --show-levels 
 show pre-computed compression levels (configured). 
 
 -l [NUMBER], --level [NUMBER] 
 Compression level (integer). 
 Default level: 1. 
 It defines compressibility in balance with computational 
 resources (RAM & time). Use -s for levels perception. 
 
 -cm [NB_C]:[NB_D]:[NB_I]:[NB_G]/[NB_S]:[NB_E]:[NB_I]:[NB_A] 
 Template of a context model. 
 Parameters: 
 [NB_C]: (integer [1;20]) order size of the regular context 
 model. Higher values use more RAM but, usually, are 
 related to a better compression score. 
 [NB_D]: (integer [1;5000]) denominator to build alpha, which 
 is a parameter estimator. Alpha is given by 1/[NB_D].
 Higher values are usually used with higher [NB_C], 
 and related to confiant bets. When [NB_D] is one, 
 the probabilities assume a Laplacian distribution. 
 [NB_I]: (integer {0,1}) number to define if a sub-program 
 which addresses the specific properties of DNA 
 sequences (Inverted repeats) is used or not. The 
 number 1 turns ON the sub-program using at the same 
 time the regular context model. The number 0 does 
 not contemple its use (Inverted repeats OFF). The 
 use of this sub-program increases the necessary time 
 to compress but it does not affect the RAM. 
 [NB_G]: (real [0;1)) real number to define gamma. This value 
 represents the decayment forgetting factor of the 
 regular context model in definition. 
 [NB_S]: (integer [0;20]) maximum number of editions allowed 
 to use a substitutional tolerant model with the same 
 memory model of the regular context model with 
 order size equal to [NB_C]. The value 0 stands for 
 turning the tolerant context model off. When the 
 model is on, it pauses when the number of editions 
 is higher that [NB_C], while it is turned on when 
 a complete match of size [NB_C] is seen again. This 
 is probabilistic-algorithmic model very usefull to 
 handle the high substitutional nature of genomic 
 sequences. When [NB_S] > 0, the compressor used more 
 processing time, but uses the same RAM and, usually, 
 achieves a substantial higher compression ratio. The 
 impact of this model is usually only noticed for 
 [NB_C] >= 14. 
 [NB_R]: (integer {0,1}) number to define if a sub-program 
 which addresses the specific properties of DNA 
 sequences (Inverted repeats) is used or not. It is 
 similar to the [NR_I] but for tolerant models. 
 [NB_E]: (integer [1;5000]) denominator to build alpha for 
 substitutional tolerant context model. It is 
 analogous to [NB_D], however to be only used in the 
 probabilistic model for computing the statistics of 
 the substitutional tolerant context model. 
 [NB_A]: (real [0;1)) real number to define gamma. This value 
 represents the decayment forgetting factor of the 
 substitutional tolerant context model in definition. 
 Its definition and use is analogus to [NB_G]. 
 
 ... (you may use several context models) 
 
 
 -rm [NB_R]:[NB_C]:[NB_A]:[NB_B]:[NB_L]:[NB_G]:[NB_I] 
 Template of a repeat model. 
 Parameters: 
 [NB_R]: (integer [1;10000] maximum number of repeat models 
 for the class. On very repetive sequences the RAM 
 increases along with this value, however it also 
 improves the compression capability. 
 [NB_C]: (integer [1;20]) order size of the repeat context 
 model. Higher values use more RAM but, usually, are 
 related to a better compression score. 
 [NB_A]: (real (0;1]) alpha is a real value, which is a 
 parameter estimator. Higher values are usually used 
 in lower [NB_C]. When [NB_A] is one, the 
 probabilities assume a Laplacian distribution. 
 [NB_B]: (real (0;1]) beta is a real value, which is a 
 parameter for discarding or maintaining a certain 
 repeat model. 
 [NB_L]: (integer (1;20]) a limit threshold to play with 
 [NB_B]. It accepts or not a certain repeat model. 
 [NB_G]: (real [0;1)) real number to define gamma. This value 
 represents the decayment forgetting factor of the 
 regular context model in definition. 
 [NB_I]: (integer {0,1}) number to define if a sub-program 
 which addresses the specific properties of DNA 
 sequences (Inverted repeats) is used or not. The 
 number 1 turns ON the sub-program using at the same 
 time the regular context model. The number 0 does 
 not contemple its use (Inverted repeats OFF). The 
 use of this sub-program increases the necessary time 
 to compress but it does not affect the RAM. 
 
 -z [NUMBER], --selection [NUMBER] 
 Size of the context selection model (integer). 
 Default context selection: 12. 
 
 [FILE] 
 Input sequence filename (to compress) -- MANDATORY. 
 File to compress is the last argument.

If you are not interested in setting the template for each model, then use the levels mode. To see the possible levels type:

./JARVIS -s

CITATION

On using this software/method please cite:

Pratas D, Hosseini M, Silva JM, Pinho AJ. A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models. Entropy. 2019 Nov;21(11):1074.

ISSUES

For any issue let us know at issues link.

LICENSE

GPL v3.

For more information:

http://www.gnu.org/licenses/gpl-3.0.html

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cobilab/jarvis

Folders and files

Latest commit

History

Repository files navigation

INSTALLATION

Conda

Unix

EXECUTION

Run JARVIS

PARAMETERS

CITATION

ISSUES

LICENSE

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

INSTALLATION

Conda

Unix

EXECUTION

Run JARVIS

PARAMETERS

CITATION

ISSUES

LICENSE

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages