Nov 19, 2008

Metagenomics of the effects of antibiotics on the human gut


Dethlefsen L, Huse S, Sogin ML, Relman DA
PLoS Biology Vol. 6, No. 11, e280 doi:10.1371/journal.pbio.0060280

A paper in PLOS Biology from the Relman lab investigates the effect of a treatment with the antibiotic ciprofloxacin on the bacteria in the intestine. They collected over 7,000 full-length 16S rDNA sequences (1100-1400 bp) by Sanger sequencing and over 900,000 reads (~250 bp) from 454 sequencing of the V3 and the V6 regions. 

There are many important results in this paper, but it is particularly relevant that 454 sequencing reveals more taxonomic variation with greater stability than traditional sequencing. In my own work, I have found that sequence variants that occur only once in the experiment cannot be used to differentiate samples. Deep sequencing reveals more taxa, and also reduces the frequency of singletons. A rare sequence variant (OTU) that occurs only once in the ~7000 full-length sequences occurs about 65 times in the 454 data set, providing more than enough "probability of detection" to be used for comparisons between samples. 


"This set of 7,208 sequences is among the largest datasets of full-length 16S rRNA sequences from the human microbiota (or any environment), the rarefaction curves for V6 and V3 tag pyrosequencing eventually rise higher and display more curvature toward the horizontal than the OTU0.01 curve. These features show that a single run of the [454] FLX sequencer targeting V6 or V3 tags from the human gut microbiota can reveal more taxa, and capture a larger proportion of the detectable taxa, than a more extensive effort directed toward full-length 16S rRNA clone sequencing."



Nov 12, 2008

CisGenome new software for Chip-Seq

CisGenome - just published in Nov. Nature Biotechnology.
An integrated software system for analyzing ChIP-chip and ChIP-seq data.
Ji H, Jiang H, Ma W, Johnson DS, Myers RM, Wong WH.
Nat Biotechnol. 2008 Nov;26(11):1293-300.

A full-function integrated bioinformatics suite for ChIP-chip and ChIP-Seq including peak-finding, FDR control for single samples, subtraction of control lane, visualization and annotation of peaks on known genomes, and Motif finding.  Functional GUI on Windows and Mac. Wow. 

Software website here:  CisGenome
http://www.biostat.jhsph.edu/~hji/cisgenome/index.htm

Abstract:
We present CisGenome, a software system for analyzing genome-wide chromatin immunoprecipitation (ChIP) data. CisGenome
is designed to meet all basic needs of ChIP data analyses, including visualization, data normalization, peak detection, false
discovery rate computation, gene-peak association, and sequence and motif analysis. In addition to implementing previously
published ChIP–microarray (ChIP-chip) analysis methods, the software contains statistical methods designed specifically
for ChlP sequencing (ChIP-seq) data obtained by coupling ChIP with massively parallel sequencing. The modular design of
CisGenome enables it to support interactive analyses through a graphic user interface as well as customized batch-mode
computation for advanced data mining. A built-in browser allows visualization of array images, signals, gene structure,
conservation, and DNA sequence and motif information. We demonstrate the use of these tools by a comparative analysis of
ChIP-chip and ChIP-seq data for the transcription factor NRSF/REST, a study of ChIP-seq analysis with or without a negative
control sample, and an analysis of a new motif in Nanog- and Sox2-binding regions.

Oct 28, 2008

Gene-Boosted Assembly

Steven Salzberg describes a method for de novo assembly of a bacterial genome (Pseudomonas aeruginosa strain PAb1 = 6.2 MB) from a set of 33 bp Solexa fragments, using two closely related strains as reference sequences, and "boosting" assembly using predicted protein coding regions.

Salzberg SL, Sommer DD, Puiu D, Lee VT (2008) Gene-Boosted Assembly of a Novel Bacterial Genome from Very Short Reads. PLoS Comput Biol 4(9): e1000186. doi:10.1371/journal.pcbi.1000186

The AMOS assembler used in this project employs several different software modules and a considerable amount of hands-on effort. 

AMOScmp is a comparative alignment tool - it aligns short reads to a similar reference genome, and then builds contigs. This avoids the challenge of all-vs-all assembly for de novo genome sequencing projects. 

Minimus is a highly stringent assembler that uses Smith-Waterman alignments to identify overlaps between reads.

Contigs were then scanned for protein coding sequences using a combination of Glimmer and BLAST. The ABBA program uses protein coding information - especially at the ends of contings and singletons to close gaps.

Velvet was also used to independently assemble all the reads into contigs, them MUMMer was used to combine contigs and fill gaps. 

==================

This method is not going to work for every de novo sequencing problem, but we are going to try something similar for some new Plasmodium and Trichomonas species. 

All software from the Salzberg lab at the Univ. of Maryland is freely available here:

and a page describing the Short Read Assembly methods here:




Oct 20, 2008

Public Chip-Seq Data

Here are some Chip-Seq data sets that have been published and are out there in the public domain.



NHLBI

Valouev et al, Sidow lab @ Stanford, 

Robertson et al, 2007, Nature Methods  4(8) 651-7.
Eland processed sequence reads and FindPeaks output for Stat1 and FoxA2 transcription factors






File Formats

What is it with bioinformatics people and file formats?!

Why is it so bloody hard to produce and agree on a single standard to represent sequence data (with quality scores) and a standard for sequence reads aligned on a reference genome? With so many formats, we are all spending exponential amounts of time writing converters between all possible combinations. 

Here are some of the file formats that I've dealt with in the past couple of weeks:

SEQUENCE FORMATS

Sequence plus Phred quality score encoded as single ascii bytes

@NCYC361-11a03.q1k bases 1 to 1576
GCGTGCCCGAAAAAATGCTTTTGGAGCCGCGCGTGAAAT
+NCYC361-11a03.q1k bases 1 to 1576
!)))))****(((***%%((((*(((+,**(((+**+,-


Solexa/Illumina FASTQ like thing...
s_*_sequence.txt
@HWI-EAS305_3-30gf5aaxx:8:1:415:1852
GTTAGATTTTGTGTAACTTGCATGTAATGTTAAAA
+HWI-EAS305_3-30gf5aaxx:8:1:415:1852
YYYYYYYYYYYYVYYYYYYVYYYYYYYYVYVVTUU
@HWI-EAS305_3-30gf5aaxx:8:1:187:1286
GTTACACTGAAAAACAAATTCGTTGGAAACGGGAT
+HWI-EAS305_3-30gf5aaxx:8:1:187:1286
YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYTVVVV
@HWI-EAS305_3-30gf5aaxx:8:1:202:440
GTGAAAAATGAGAAATGCACACTGAAGGACCTGGA
+HWI-EAS305_3-30gf5aaxx:8:1:202:440
YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYVVUVV

s_*_eland_extended.txt
Solexa output format from Eland extended
>HWI-EAS305_3-30gf5aaxx:8:1:63:487 GGAGGTAGAGGTATATGGCAAGAAAACTGAAAATC NM -
>HWI-EAS305_3-30gf5aaxx:8:1:415:1852 GTTAGATTTTGTGTAACTTGCATGTAATGTTAAAA 3:1:0 chr14.fa:35121238F35,35121282F35,35121326F32T1T,351
21354F4T30
>HWI-EAS305_3-30gf5aaxx:8:1:187:1286 GTTACACTGAAAAACAAATTCGTTGGAAACGGGAT 0:4:5 chr6.fa:103599157R16C17A,chr2.fa:98502709R16C18,985
02829R6A9C18,98505080F4AC29,98505200F1A14C18,98505320F16C18,98506416R16C13C2CA,98506537R16C18,chrX.fa:139917587R16C2A13CA
>HWI-EAS305_3-30gf5aaxx:8:1:202:440 GTGAAAAATGAGAAATGCACACTGAAGGACCTGGA 3:87:58 chr2.fa:98503100F33T1,98506780F35,98507265F35
>HWI-EAS305_3-30gf5aaxx:8:1:359:505 TATTCAATTTACATACTCTGGCTTTGCCAACATTT 1:0:0 chr9.fa:31339651R35
>HWI-EAS305_3-30gf5aaxx:8:1:1290:135 TTGATTGTATAGTAGGGGTGAAATGGAATTTTATC 1:0:1 chrM.fa:14790R35
>HWI-EAS305_3-30gf5aaxx:8:1:627:596 GTGATTTTGAAAGTTGTAGATTGTGTGTTTGTGAT NM -
>HWI-EAS305_3-30gf5aaxx:8:1:379:298 GACGTGAAATATGGCGAGGAAAACTGAAAAAGGTG 31:56:28 -


s_*_eland_multi.txt
Solexa output format from Eland extended
>HWI-EAS305_3-30gf5aaxx:8:1:414:208 GTAAACTATCAATAAAATAATTTGTTACTCTGTAT 20:7:0
>HWI-EAS305_3-30gf5aaxx:8:1:59:857 TAAATTGTCCACCTTTTTCAGTTTTCCTCGCTATA 0:0:35
>HWI-EAS305_3-30gf5aaxx:8:1:1414:307 GAGAAAACTGTAAATAAAGGTAAATGAGAAAAAAA NM
>HWI-EAS305_3-30gf5aaxx:8:1:330:1758 GGTAAAGTCCACTAAGGAAAAGAAAGAAACAATGT 1:0:0 chr7.fa:97764095R0
>HWI-EAS305_3-30gf5aaxx:8:1:576:127 GAAGTCAATCTTATGAGTTATTAGGATGGCTACTC 0:7:255 chr7.fa:111867683F1,chr12.fa:51788781R1,115833262F1
,chr6.fa:21403822R1,89734675R1,89780759R1,chrX.fa:15525553R1
>HWI-EAS305_3-30gf5aaxx:8:1:88:1045 GTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCC 66:110:72
>HWI-EAS305_3-30gf5aaxx:8:1:939:613 TACTTTACTTTCTAGGGAATGTTCACTTCTAAGTG 1:0:0 chr1.fa:150051845R0

s_*_sorted.txt
filtered eland_extended alignments w/ quality  scores and genome positions
HWI-EAS305 3-30gf5aaxx 8 66 580 1584 AGTATGGGTATCGGTTGGTGCAGAGAACTACTGCA YYYYYYYYYYYYYYYYYYY
YYYYYVYYYYYVVUVU chr10.fa 3001045 F 35 11
HWI-EAS305 3-30gf5aaxx 8 100 534 1062 ATTTTCAGGTTGGAGTGACTCGCTAAAACAGCCAA YYYYYYYYYYYYYYYYYYY
YYYYYYYYYYYTVVVV chr10.fa 3002892 R 35 29
HWI-EAS305 3-30gf5aaxx 8 59 199 495 CCACATGCTGTGGCAAAGCCCTTCTGAGCGGGGCG YYYYTYYYYYYYYYYYRYY
YYYYYYYYYYYTVUVV chr10.fa 3008958 F 34A 20
HWI-EAS305 3-30gf5aaxx 8 76 779 1406 AGATGTACAAATGCTCCTCAGATGTTTGTGTCATA YYYYYYYYYYYYYYYYYYY
YYYYYYYYYYYVVVVV chr10.fa 3009290 F 35 3
HWI-EAS305 3-30gf5aaxx 8 83 547 1480 ATCCAAACAGTTACACAAAGTTTTGAGAACATTAT YYYYYYYYYYYYYYYYYYY
YYYYYYYYYYYVVVVV



GENOME ALIGNMENT FORMATS

SGA ('Simplified' Genome Annotation)

GFF  (General Feature Format)
EXAMPLE:
track name=regulatory description="TeleGene(tm) Regulatory Regions"
chr22 TeleGene enhancer 1000000 1001000 500 + . touch1
chr22 TeleGene promoter 1010000 1010100 900 + . touch1
chr22 TeleGene promoter 1020000 1020000 800 - . touch2

FPS (Functional Position Set)
Native format for Eukaryotic Promoter Database

EXAMPLE:
FP Pv snRNA U1 :+S EM:J03563.1 1+ 352; 17001.098
FP Ath snRNA U2.5 :+S EM:AL353994.1 1- 73709; 24016.116
FP Ath snRNA U5 :+S EM:X13012.1 1+ 678; 23040.
FP Ta histone H3 :+S EM:X00937.1 1+ 186; 07001.

WIG (Wiggle)
UCSC Genome Browser track format

EXAMPLE
track type=wiggle_0 name="Bed Format" description="BED format" \
visibility=full color=200,100,0 altColor=0,100,200 priority=20
chr19 59302000 59302300 -1.0
chr19 59302300 59302600 -0.75
chr19 59302600 59302900 -0.50

UCSC Genome Browser
Example:
Here's an example of an annotation track that uses a complete BED definition:

track name=pairedReads description="Clone Paired Reads" useScore=1
chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512
chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601

ALN
Alignment format for CisGenome 

chr1[tab]359077[tab]F
chr1
[tab]376890[tab]R

….

column1 = chromosome where the read is aligned;
column2 = coordinate where the read is aligned;
column3 = ‘F’ or ‘+’: if the read is aligned to the forward strand of the genome assembly;
‘R’ or ‘-’: if the read is aligned to the reverse complement strand of the genome.
Subscribe to: Posts (Atom)

AltStyle によって変換されたページ (->オリジナル) /