seqmagick: sequence manipulation

Guangchuang Yu

Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University
guangchuangyu@gmail.com

2024年01月09日

Download sequences

Genbank

tmpgb <- tempfile(fileext = '.gb')
tmpfa <- tempfile(fileext = '.fa')
 download_genbank(acc='AB115403', format='genbank', outfile=tmpgb)
 download_genbank(acc='AB115403', format='fasta', outfile=tmpfa)
 ## readLines(tmpgb)[1:10]
 ## readLines(tmpfa)

File conversion

fasta and phylip conversion

fa_file <- system.file("extdata/HA.fas", package="seqmagick")
 ## use the small subset to save compilation time of the vignette
fa2 <- tempfile(fileext = '.fa')
 fa_read(fa_file) %>% bs_filter('ATGAAAGTAAAA', by='sequence') %>% fa_write(fa2, type='interleaved')
 
 
alnfas <- tempfile(fileext = ".fas")
 fa_read(fa2) %>% bs_aln(quiet=TRUE) %>% fa_write(alnfas)
 
 ## phylip format is only for aligned sequences
tmpphy <- tempfile(fileext = ".phy")
 fas2phy(alnfas, tmpphy, type = 'sequential')

seqmagick supports both sequential and interleaved formats, users can specify the format by type parameter.

 phy2fas(tmpphy, alnfas, type = 'interleaved')

interleaved and sequential format conversion

tmpfas <- tempfile(fileext='.fa')
 fa_read(fa2) %>% fa_write(tmpfas, type="sequential")
tmpphy2 <- tempfile(fileext = '.phy')
 phy_read(tmpphy) %>% phy_write(tmpphy2, type="interleaved")

Sequence manipulation

bs <- fa_read(fa_file)
 bs_filter(bs, 'ATGAAAGTAAAA', by='sequence')
 
aln <- bs_filter(bs, 'ATGAAAGTAAAA', by='sequence') %>% bs_aln(quiet=TRUE)
 
 bs_consensus(aln)

Bugs/Feature requests

If you have any, let me know. Thx!

Session info

Here is the output of sessionInfo() on the system on which this document was compiled:

## R version 4.3.2 (2023年10月31日 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 22621)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=C 
## [2] LC_CTYPE=Chinese (Simplified)_China.utf8 
## [3] LC_MONETARY=Chinese (Simplified)_China.utf8
## [4] LC_NUMERIC=C 
## [5] LC_TIME=Chinese (Simplified)_China.utf8 
## 
## time zone: Asia/Shanghai
## tzcode source: internal
## 
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods 
## [8] base 
## 
## other attached packages:
## [1] seqmagick_0.1.7 Biostrings_2.70.1 GenomeInfoDb_1.38.1
## [4] XVector_0.42.0 IRanges_2.36.0 S4Vectors_0.40.2 
## [7] BiocGenerics_0.48.1 magrittr_2.0.3 
## 
## loaded via a namespace (and not attached):
## [1] crayon_1.5.2 cli_3.6.1 knitr_1.45 
## [4] rlang_1.1.2 xfun_0.41 jsonlite_1.8.7 
## [7] RCurl_1.98-1.13 htmltools_0.5.7 sass_0.4.7 
## [10] rmarkdown_2.25 evaluate_0.23 jquerylib_0.1.4 
## [13] prettydoc_0.4.1 bitops_1.0-7 fastmap_1.1.1 
## [16] yaml_2.3.7 lifecycle_1.0.4 memoise_2.0.1 
## [19] compiler_4.3.2 fs_1.6.3 digest_0.6.33 
## [22] R6_2.5.1 GenomeInfoDbData_1.2.11 bslib_0.6.0 
## [25] tools_4.3.2 zlibbioc_1.48.0 yulab.utils_0.1.3 
## [28] cachem_1.0.8

References

AltStyle によって変換されたページ (->オリジナル) /