Next-Gen Sequencing

Nov 19, 2008

Metagenomics of the effects of antibiotics on the human gut

The Pervasive Effects of an Antibiotic on the Human Gut Microbiota, as Revealed by Deep 16S rRNA Sequencing

Dethlefsen L, Huse S, Sogin ML, Relman DA
PLoS Biology Vol. 6, No. 11, e280 doi:10.1371/journal.pbio.0060280

A paper in PLOS Biology from the Relman lab investigates the effect of a treatment with the antibiotic ciprofloxacin on the bacteria in the intestine. They collected over 7,000 full-length 16S rDNA sequences (1100-1400 bp) by Sanger sequencing and over 900,000 reads (~250 bp) from 454 sequencing of the V3 and the V6 regions.

There are many important results in this paper, but it is particularly relevant that 454 sequencing reveals more taxonomic variation with greater stability than traditional sequencing. In my own work, I have found that sequence variants that occur only once in the experiment cannot be used to differentiate samples. Deep sequencing reveals more taxa, and also reduces the frequency of singletons. A rare sequence variant (OTU) that occurs only once in the ~7000 full-length sequences occurs about 65 times in the 454 data set, providing more than enough "probability of detection" to be used for comparisons between samples.

"This set of 7,208 sequences is among the largest datasets of full-length 16S rRNA sequences from the human microbiota (or any environment), the rarefaction curves for V6 and V3 tag pyrosequencing eventually rise higher and display more curvature toward the horizontal than the OTU0.01 curve. These features show that a single run of the [454] FLX sequencer targeting V6 or V3 tags from the human gut microbiota can reveal more taxa, and capture a larger proportion of the detectable taxa, than a more extensive effort directed toward full-length 16S rRNA clone sequencing."

[画像:journal-pbio-0060280-g003]

Posted by Health & Wisdom at 10:04 AM 7 comments:

Nov 12, 2008

CisGenome new software for Chip-Seq

CisGenome - just published in Nov. Nature Biotechnology.

An integrated software system for analyzing ChIP-chip and ChIP-seq data.
Ji H, Jiang H, Ma W, Johnson DS, Myers RM, Wong WH.
Nat Biotechnol. 2008 Nov;26(11):1293-300.

A full-function integrated bioinformatics suite for ChIP-chip and ChIP-Seq including peak-finding, FDR control for single samples, subtraction of control lane, visualization and annotation of peaks on known genomes, and Motif finding. Functional GUI on Windows and Mac. Wow.

Software website here: CisGenome

http://www.biostat.jhsph.edu/~hji/cisgenome/index.htm

Abstract:

We present CisGenome, a software system for analyzing genome-wide chromatin immunoprecipitation (ChIP) data. CisGenome
is designed to meet all basic needs of ChIP data analyses, including visualization, data normalization, peak detection, false
discovery rate computation, gene-peak association, and sequence and motif analysis. In addition to implementing previously
published ChIP–microarray (ChIP-chip) analysis methods, the software contains statistical methods designed specifically
for ChlP sequencing (ChIP-seq) data obtained by coupling ChIP with massively parallel sequencing. The modular design of
CisGenome enables it to support interactive analyses through a graphic user interface as well as customized batch-mode
computation for advanced data mining. A built-in browser allows visualization of array images, signals, gene structure,
conservation, and DNA sequence and motif information. We demonstrate the use of these tools by a comparative analysis of
ChIP-chip and ChIP-seq data for the transcription factor NRSF/REST, a study of ChIP-seq analysis with or without a negative
control sample, and an analysis of a new motif in Nanog- and Sox2-binding regions.

Posted by Health & Wisdom at 12:51 PM 14 comments:

Oct 28, 2008

Gene-Boosted Assembly

Steven Salzberg describes a method for de novo assembly of a bacterial genome (Pseudomonas aeruginosa strain PAb1 = 6.2 MB) from a set of 33 bp Solexa fragments, using two closely related strains as reference sequences, and "boosting" assembly using predicted protein coding regions.

PLOS Computational Biology 4(9), Sept 26, 2008

Salzberg SL, Sommer DD, Puiu D, Lee VT (2008) Gene-Boosted Assembly of a Novel Bacterial Genome from Very Short Reads. PLoS Comput Biol 4(9): e1000186. doi:10.1371/journal.pcbi.1000186

The AMOS assembler used in this project employs several different software modules and a considerable amount of hands-on effort.

AMOScmp is a comparative alignment tool - it aligns short reads to a similar reference genome, and then builds contigs. This avoids the challenge of all-vs-all assembly for de novo genome sequencing projects.

Minimus is a highly stringent assembler that uses Smith-Waterman alignments to identify overlaps between reads.

Contigs were then scanned for protein coding sequences using a combination of Glimmer and BLAST. The ABBA program uses protein coding information - especially at the ends of contings and singletons to close gaps.

Velvet was also used to independently assemble all the reads into contigs, them MUMMer was used to combine contigs and fill gaps.

==================

This method is not going to work for every de novo sequencing problem, but we are going to try something similar for some new Plasmodium and Trichomonas species.

All software from the Salzberg lab at the Univ. of Maryland is freely available here:

http://cbcb.umd.edu/software/

and a page describing the Short Read Assembly methods here:

http://www.cbcb.umd.edu/research/SR-assembly.shtml

Posted by Health & Wisdom at 9:25 AM 8 comments:

Oct 20, 2008

Public Chip-Seq Data

Here are some Chip-Seq data sets that have been published and are out there in the public domain.

Broad Institute

NHLBI

Jothi et al, - Site Identification from Short Sequence Reads

Barski et al - High-Resolution Profiling of Histone Methylations

Valouev et al, Sidow lab @ Stanford,

sample data to validate QuEST software

Robertson et al, 2007, Nature Methods 4(8) 651-7.

Eland processed sequence reads and FindPeaks output for Stat1 and FoxA2 transcription factors

NCBI GEO

NCBI Short Read Archive

Posted by Health & Wisdom at 4:08 PM 9 comments:

File Formats

What is it with bioinformatics people and file formats?!

Why is it so bloody hard to produce and agree on a single standard to represent sequence data (with quality scores) and a standard for sequence reads aligned on a reference genome? With so many formats, we are all spending exponential amounts of time writing converters between all possible combinations.

Here are some of the file formats that I've dealt with in the past couple of weeks:

SEQUENCE FORMATS

FASTQ

Sequence plus Phred quality score encoded as single ascii bytes

@NCYC361-11a03.q1k bases 1 to 1576

GCGTGCCCGAAAAAATGCTTTTGGAGCCGCGCGTGAAAT

+NCYC361-11a03.q1k bases 1 to 1576

!)))))****(((***%%((((*(((+,**(((+**+,-

Solexa/Illumina FASTQ like thing...

s_*_sequence.txt

@HWI-EAS305_3-30gf5aaxx:8:1:415:1852
GTTAGATTTTGTGTAACTTGCATGTAATGTTAAAA
+HWI-EAS305_3-30gf5aaxx:8:1:415:1852
YYYYYYYYYYYYVYYYYYYVYYYYYYYYVYVVTUU
@HWI-EAS305_3-30gf5aaxx:8:1:187:1286
GTTACACTGAAAAACAAATTCGTTGGAAACGGGAT
+HWI-EAS305_3-30gf5aaxx:8:1:187:1286
YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYTVVVV
@HWI-EAS305_3-30gf5aaxx:8:1:202:440
GTGAAAAATGAGAAATGCACACTGAAGGACCTGGA
+HWI-EAS305_3-30gf5aaxx:8:1:202:440
YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYVVUVV

s_*_eland_extended.txt

Solexa output format from Eland extended

>HWI-EAS305_3-30gf5aaxx:8:1:63:487 GGAGGTAGAGGTATATGGCAAGAAAACTGAAAATC NM -
>HWI-EAS305_3-30gf5aaxx:8:1:415:1852 GTTAGATTTTGTGTAACTTGCATGTAATGTTAAAA 3:1:0 chr14.fa:35121238F35,35121282F35,35121326F32T1T,351
21354F4T30
>HWI-EAS305_3-30gf5aaxx:8:1:187:1286 GTTACACTGAAAAACAAATTCGTTGGAAACGGGAT 0:4:5 chr6.fa:103599157R16C17A,chr2.fa:98502709R16C18,985
02829R6A9C18,98505080F4AC29,98505200F1A14C18,98505320F16C18,98506416R16C13C2CA,98506537R16C18,chrX.fa:139917587R16C2A13CA
>HWI-EAS305_3-30gf5aaxx:8:1:202:440 GTGAAAAATGAGAAATGCACACTGAAGGACCTGGA 3:87:58 chr2.fa:98503100F33T1,98506780F35,98507265F35
>HWI-EAS305_3-30gf5aaxx:8:1:359:505 TATTCAATTTACATACTCTGGCTTTGCCAACATTT 1:0:0 chr9.fa:31339651R35
>HWI-EAS305_3-30gf5aaxx:8:1:1290:135 TTGATTGTATAGTAGGGGTGAAATGGAATTTTATC 1:0:1 chrM.fa:14790R35
>HWI-EAS305_3-30gf5aaxx:8:1:627:596 GTGATTTTGAAAGTTGTAGATTGTGTGTTTGTGAT NM -
>HWI-EAS305_3-30gf5aaxx:8:1:379:298 GACGTGAAATATGGCGAGGAAAACTGAAAAAGGTG 31:56:28 -

s_*_eland_multi.txt

Solexa output format from Eland extended

>HWI-EAS305_3-30gf5aaxx:8:1:414:208 GTAAACTATCAATAAAATAATTTGTTACTCTGTAT 20:7:0
>HWI-EAS305_3-30gf5aaxx:8:1:59:857 TAAATTGTCCACCTTTTTCAGTTTTCCTCGCTATA 0:0:35
>HWI-EAS305_3-30gf5aaxx:8:1:1414:307 GAGAAAACTGTAAATAAAGGTAAATGAGAAAAAAA NM
>HWI-EAS305_3-30gf5aaxx:8:1:330:1758 GGTAAAGTCCACTAAGGAAAAGAAAGAAACAATGT 1:0:0 chr7.fa:97764095R0
>HWI-EAS305_3-30gf5aaxx:8:1:576:127 GAAGTCAATCTTATGAGTTATTAGGATGGCTACTC 0:7:255 chr7.fa:111867683F1,chr12.fa:51788781R1,115833262F1
,chr6.fa:21403822R1,89734675R1,89780759R1,chrX.fa:15525553R1
>HWI-EAS305_3-30gf5aaxx:8:1:88:1045 GTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCC 66:110:72
>HWI-EAS305_3-30gf5aaxx:8:1:939:613 TACTTTACTTTCTAGGGAATGTTCACTTCTAAGTG 1:0:0 chr1.fa:150051845R0

s_*_sorted.txt

filtered eland_extended alignments w/ quality scores and genome positions

HWI-EAS305 3-30gf5aaxx 8 66 580 1584 AGTATGGGTATCGGTTGGTGCAGAGAACTACTGCA YYYYYYYYYYYYYYYYYYY
YYYYYVYYYYYVVUVU chr10.fa 3001045 F 35 11
HWI-EAS305 3-30gf5aaxx 8 100 534 1062 ATTTTCAGGTTGGAGTGACTCGCTAAAACAGCCAA YYYYYYYYYYYYYYYYYYY
YYYYYYYYYYYTVVVV chr10.fa 3002892 R 35 29
HWI-EAS305 3-30gf5aaxx 8 59 199 495 CCACATGCTGTGGCAAAGCCCTTCTGAGCGGGGCG YYYYTYYYYYYYYYYYRYY
YYYYYYYYYYYTVUVV chr10.fa 3008958 F 34A 20
HWI-EAS305 3-30gf5aaxx 8 76 779 1406 AGATGTACAAATGCTCCTCAGATGTTTGTGTCATA YYYYYYYYYYYYYYYYYYY
YYYYYYYYYYYVVVVV chr10.fa 3009290 F 35 3
HWI-EAS305 3-30gf5aaxx 8 83 547 1480 ATCCAAACAGTTACACAAAGTTTTGAGAACATTAT YYYYYYYYYYYYYYYYYYY
YYYYYYYYYYYVVVVV

GENOME ALIGNMENT FORMATS

SGA ('Simplified' Genome Annotation)

GFF (General Feature Format)

UCSC Genome Browser

Sanger

EXAMPLE:

track name=regulatory description="TeleGene(tm) Regulatory Regions"
chr22 TeleGene enhancer 1000000 1001000 500 + . touch1
chr22 TeleGene promoter 1010000 1010100 900 + . touch1
chr22 TeleGene promoter 1020000 1020000 800 - . touch2

FPS (Functional Position Set)

Native format for Eukaryotic Promoter Database

EXAMPLE:

FP Pv snRNA U1 :+S EM:J03563.1 1+ 352; 17001.098
FP Ath snRNA U2.5 :+S EM:AL353994.1 1- 73709; 24016.116
FP Ath snRNA U5 :+S EM:X13012.1 1+ 678; 23040.
FP Ta histone H3 :+S EM:X00937.1 1+ 186; 07001.

WIG (Wiggle)

UCSC Genome Browser track format

EXAMPLE

track type=wiggle_0 name="Bed Format" description="BED format" \
visibility=full color=200,100,0 altColor=0,100,200 priority=20
chr19 59302000 59302300 -1.0
chr19 59302300 59302600 -0.75
chr19 59302600 59302900 -0.50

BED

UCSC Genome Browser

Example:
Here's an example of an annotation track that uses a complete BED definition:

track name=pairedReads description="Clone Paired Reads" useScore=1
chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512
chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601

ALN

Alignment format for CisGenome


chr1[tab]359077[tab]F
chr1[tab]376890[tab]R

….

column1 = chromosome where the read is aligned;
column2 = coordinate where the read is aligned;
column3 = ‘F’ or ‘+’: if the read is aligned to the forward strand of the genome assembly;
 ‘R’ or ‘-’: if the read is aligned to the reverse complement strand of the genome.

Posted by Health & Wisdom at 2:38 PM 3 comments:

Stuart Brown

View my complete profile

Resources

Blog Archive

2018 (3)
- October (1)
  - Some tips to optimize bacterial genome assembly
- January (2)

2017 (5)
- December (1)
- August (1)
- July (1)
- February (1)
- January (1)

2016 (6)
- December (1)
- May (1)
- March (1)
- February (1)
- January (2)

2015 (7)
- October (1)
- September (1)
- July (3)
- May (1)
- February (1)

2014 (2)
- September (1)
- August (1)

2013 (10)
- November (1)
- October (1)
- September (1)
- August (1)
- May (1)
- March (1)
- February (2)
- January (2)

2012 (9)
- December (1)
- October (2)
- August (3)
- June (1)
- March (1)
- February (1)

2011 (10)
- December (1)
- October (2)
- September (1)
- July (1)
- June (2)
- May (1)
- February (1)
- January (1)

2010 (4)
- December (1)
- September (1)
- March (2)

2009 (1)
- May (1)

2008 (15)
- November (2)
- October (13)

List of Blogs relevant to NG Seq

Omics! Omics!

ASHG Posters: The Agony and The Ecstasy - ASHG is a huge meeting, probably the second largest I've ever attended after ASCO. ASBMB is similar in size perhaps, though I think a hair smaller and def...
17 hours ago
RNA-Seq Blog

A review of technical considerations for planning an RNA-Sequencing experiment - This article highlights key technical and analytical considerations for RNA sequencing experiments, helping researchers plan effective strategies for data ...
19 hours ago
The Tree of Life

Avoiding shoulder surgery for now ... - Well as some know out there it has been a rough month or so for me recovering from a bicycling accident. I had a lot of injuries. I am STILL dealing with ...
1 week ago
Getting Genetics Done

Repost: Construct objects with idiomatic R code - *Reposted from the original at https://blog.stephenturner.us/p/construct-objects-with-idiomatic-r-code* --- Today I discovered the constructive package ...
1 week ago
NIGMS Feedback Loop - National Institute of General Medical Sciences

Remembering L. Tony Beck - Dr. Tony Beck. Credit: NIGMS. We’re greatly saddened to share that L. Tony Beck, Ph.D., died of natural causes on April 7, 2025, while on the NIH campus. H...
5 months ago
Fighting Pseudoscience

Pollsters: Stop Reporting Random Variation Like It’s News - Polls will change daily even if no one changes their mind about who they will vote for. The media seems confused about this.
11 months ago
The Genome Factory

25 reasons assemblies don't make it into Refseq - Introduction When you submit a genome assembly, or NCBI assembles the reads you submitted, it ends up in Genbank. If the assembly is of sufficient qualit...
6 years ago
MassGenomics

MassGenomics is Closed, but KidsGenomics is Open - Thank you to everyone who sent kind messages after I announced the end of MassGenomics earlier this month. Please rest assured that this website and all of...
7 years ago
Building confidence.

Annual call for AMIA Year-in-Review Talk! - Friends in Translational Bioinformatics (most in BCC to avoid long list), Once again, I am preparing an annual review of progress in translational bioinfor...
7 years ago
The OpenHelix Blog

Friday SNPpets - This week we’ve got DNA in the gig economy and for sports fans (?), new software resources for virus and lipids, a handy collection of cancer genomics pape...
8 years ago
Homologus

Suicide Epidemic: Since NIH-funded Clowns Do Not Want to Discuss It, We Will - A large number of NIH-funded parasites waste taxpayers’ money with the excuse that they are working toward improving the health of Americans. Francis Col...
9 years ago
SEQanswers.com

Lotsa new toys from Illumina: HiSeq X Five, 3000, 4000, NextSeq 550 - Hiseq 4000 Hiseq 3000 Nextseq 550 HiseqX 5
10 years ago
Genomena

Threesomics - Her heart beat in frightened counterpoint to the rhythm of the mitochondrion. — Madeleine l’Engle, A Wind in the Door (1973) As you’ve likely heard,...
11 years ago
Genetic Future

-
Discovering Biology in a Digital World

-