You can subscribe to this list here.
| 2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(28) |
Nov
(87) |
Dec
(16) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003 |
Jan
(109) |
Feb
(107) |
Mar
(117) |
Apr
(5) |
May
(156) |
Jun
(83) |
Jul
(86) |
Aug
(25) |
Sep
(17) |
Oct
(14) |
Nov
(82) |
Dec
(50) |
| 2004 |
Jan
(14) |
Feb
(75) |
Mar
(110) |
Apr
(83) |
May
(20) |
Jun
(36) |
Jul
(12) |
Aug
(37) |
Sep
(9) |
Oct
(11) |
Nov
(52) |
Dec
(68) |
| 2005 |
Jan
(46) |
Feb
(94) |
Mar
(68) |
Apr
(55) |
May
(67) |
Jun
(65) |
Jul
(67) |
Aug
(96) |
Sep
(79) |
Oct
(46) |
Nov
(24) |
Dec
(64) |
| 2006 |
Jan
(39) |
Feb
(31) |
Mar
(48) |
Apr
(58) |
May
(31) |
Jun
(57) |
Jul
(29) |
Aug
(40) |
Sep
(22) |
Oct
(31) |
Nov
(44) |
Dec
(51) |
| 2007 |
Jan
(103) |
Feb
(172) |
Mar
(59) |
Apr
(41) |
May
(33) |
Jun
(50) |
Jul
(60) |
Aug
(51) |
Sep
(21) |
Oct
(40) |
Nov
(89) |
Dec
(39) |
| 2008 |
Jan
(28) |
Feb
(20) |
Mar
(19) |
Apr
(29) |
May
(29) |
Jun
(24) |
Jul
(32) |
Aug
(16) |
Sep
(35) |
Oct
(23) |
Nov
(17) |
Dec
(19) |
| 2009 |
Jan
(4) |
Feb
(23) |
Mar
(16) |
Apr
(16) |
May
(38) |
Jun
(54) |
Jul
(18) |
Aug
(40) |
Sep
(58) |
Oct
(6) |
Nov
(8) |
Dec
(29) |
| 2010 |
Jan
(40) |
Feb
(40) |
Mar
(63) |
Apr
(95) |
May
(136) |
Jun
(58) |
Jul
(91) |
Aug
(55) |
Sep
(77) |
Oct
(52) |
Nov
(85) |
Dec
(37) |
| 2011 |
Jan
(22) |
Feb
(46) |
Mar
(73) |
Apr
(138) |
May
(75) |
Jun
(35) |
Jul
(41) |
Aug
(13) |
Sep
(13) |
Oct
(11) |
Nov
(21) |
Dec
(5) |
| 2012 |
Jan
(13) |
Feb
(34) |
Mar
(59) |
Apr
(4) |
May
(13) |
Jun
(1) |
Jul
(1) |
Aug
(1) |
Sep
(3) |
Oct
(2) |
Nov
(4) |
Dec
(1) |
| 2013 |
Jan
(18) |
Feb
(28) |
Mar
(19) |
Apr
(42) |
May
(43) |
Jun
(41) |
Jul
(41) |
Aug
(31) |
Sep
(6) |
Oct
(2) |
Nov
(2) |
Dec
(70) |
| 2014 |
Jan
(55) |
Feb
(98) |
Mar
(44) |
Apr
(40) |
May
(15) |
Jun
(18) |
Jul
(20) |
Aug
(1) |
Sep
(13) |
Oct
(3) |
Nov
(37) |
Dec
(85) |
| 2015 |
Jan
(16) |
Feb
(12) |
Mar
(16) |
Apr
(13) |
May
(16) |
Jun
(3) |
Jul
(23) |
Aug
|
Sep
|
Oct
|
Nov
(9) |
Dec
(2) |
| 2016 |
Jan
(12) |
Feb
(1) |
Mar
(9) |
Apr
(13) |
May
(4) |
Jun
(5) |
Jul
|
Aug
|
Sep
(10) |
Oct
(11) |
Nov
(1) |
Dec
|
| 2017 |
Jan
|
Feb
(1) |
Mar
(11) |
Apr
(8) |
May
|
Jun
(6) |
Jul
|
Aug
|
Sep
|
Oct
(3) |
Nov
(2) |
Dec
(1) |
| 2018 |
Jan
(6) |
Feb
(6) |
Mar
(3) |
Apr
(9) |
May
(3) |
Jun
|
Jul
|
Aug
(3) |
Sep
(8) |
Oct
(1) |
Nov
(1) |
Dec
(4) |
| 2019 |
Jan
(4) |
Feb
|
Mar
(1) |
Apr
|
May
(2) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(2) |
Nov
(1) |
Dec
|
| 2020 |
Jan
(22) |
Feb
(4) |
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
(2) |
Aug
(2) |
Sep
(1) |
Oct
|
Nov
|
Dec
(1) |
| 2021 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
(2) |
Aug
(2) |
Sep
|
Oct
|
Nov
|
Dec
|
| 2022 |
Jan
(1) |
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
|
Oct
|
Nov
|
Dec
|
| 2023 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(5) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2024 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(3) |
Aug
(3) |
Sep
|
Oct
|
Nov
|
Dec
|
| 2025 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| S | M | T | W | T | F | S |
|---|---|---|---|---|---|---|
|
|
|
|
|
|
1
(6) |
2
|
|
3
|
4
(2) |
5
(3) |
6
(1) |
7
(1) |
8
(5) |
9
|
|
10
|
11
(2) |
12
|
13
(1) |
14
|
15
|
16
|
|
17
|
18
|
19
|
20
|
21
|
22
(3) |
23
|
|
24
|
25
|
26
|
27
|
28
|
29
(1) |
30
|
|
31
|
|
|
|
|
|
|
Get it at ftp://ftp.fruitfly.org/pub/people/cwiel/chado_gadflly_08.gz Changes from version 0.7: 1. Eliminated duplicate analysis_results. 2. Fixed timeaccessioned for genes and exons. 3. Fixed problem causing some FBti's to be missing. 4. Fixed implementation of feature_synonym.is_current. 5. Fixed seqlen for transcripts. There are no known bugs. Cheers, Colin
for wildcard queries you can use ILIKE instead of LIKE to do case-insensitive matching. i don't know if this is any faster than lower(col_x). -allen On 2003年8月22日, Don Gilbert wrote: > > > Charles, this PG doc suggests you have to explicitly lower/upper text > data to do case-insensitive queries -- Don > > PostgreSQL 7.3.2 Documentation > > Chapter 8. Indexes > 8.5. Functional Indexes > ... > For example, a common way to do case-insensitive comparisons is to use the > lower function: > SELECT * FROM test1 WHERE lower(col1) = 'value'; > > This query can use an index, if one has been defined on the result of the > lower(column) operation: > CREATE INDEX test1_lower_col1_idx ON test1 (lower(col1)); > > > ------------------------------------------------------- > This SF.net email is sponsored by: VM Ware > With VMware you can run multiple operating systems on a single machine. > WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines > at the same time. Free trial click here:http://www.vmware.com/wl/offer/358/0 > _______________________________________________ > Gmod-schema mailing list > Gmo...@li... > https://lists.sourceforge.net/lists/listinfo/gmod-schema >
Charles, this PG doc suggests you have to explicitly lower/upper text data to do case-insensitive queries -- Don PostgreSQL 7.3.2 Documentation Chapter 8. Indexes 8.5. Functional Indexes ... For example, a common way to do case-insensitive comparisons is to use the lower function: SELECT * FROM test1 WHERE lower(col1) = 'value'; This query can use an index, if one has been defined on the result of the lower(column) operation: CREATE INDEX test1_lower_col1_idx ON test1 (lower(col1));
All, Curious is postgres is case-sensitive or insensitive(have not found it in the docs yet). when stored as varchar will: rps7 be distinguished from Rps7 and RPS7? Charles
Hello, This is a reminder that the quarterly GMOD meeting is being held at Lawrence Berkeley Lab on September 15-16. The Bancroft Hotel is holding rooms and giving a group discount for the 14-16th, but they are going to release that hold 30 days prior, which is tomorrow. So if you haven't booked a room and plan on attending from out of town, now is the time! Contact the hotel at 510-549-1000 and mention the GMOD meeting to receive the discount. Also, if you have topics in mind that you would like to see added to the agenda, please let me know, as I am formulating it now. And I don't know why I keep forgetting this: please let me know if you plan to attend! Thanks, Scott -- ------------------------------------------------------------------------ Scott Cain, Ph. D. ca...@cs... GMOD Coordinator (http://www.gmod.org/) 216-392-3087 Cold Spring Harbor Laboratory
XML::Twig is one of the easier and fast perl libs. On 2003年8月11日, Charles Hauser wrote: > Stan, > > Thanks for the overview. > > I'll make a go at the conversion using perl. Not sure if I'll use SAX > or DOM, so pointers to appropriate libraries for either appreciated. > > regards, > Charles > > > On Fri, 2003年08月08日 at 13:52, SLe...@ao... wrote: > > Charles, > > > > There is no magic to it, you just need to parse the input format and > > write out > > the output format. XML has an associated technology called XSLT which > > represents > > XML rewrite rules in an astonishingly cumbersome and impenetrable way; you > > can look into that but I predict you will find it is too ugly to warrant the > > effort > > of climbing the learning curve. Failing that, Java and Perl both have > > libraries > > that simplify parsing XML documents (two flavors: SAX parses an element at > > a time and lets you supply handlers that can grab control at the begin and > > end of each element and do things; DOM slurps in the whole thing and gives > > it to you as a big structure). Once you have it in memory you can either > > traverse the tree and issue print statements to generate XML, or you can > > construct > > another XML data structure in the new form and call routines to externalize. > > > > If you tell us your language of choice people can probably point you to > > the appropriate XML libraries. > > > > Regarding the all important understanding of the data formats, I assume there > > is documentation on AceXML. There is some spotty documentation of the XORT > > chado XML on the gmod site, including the autogenerated header in > > gmod/schema/chado/doc/chado.dtd. Pinglei and I can answer questions as > > needed. > > > > Cheers, -Stan > > > > In a message dated 8/8/2003 1:01:11 PM Eastern Standard Time, > > ch...@du... writes: > > > > > Subj: [Gmod-schema] Re: Acedb -> AceXML -> Chado XML -> Chado DB > > > Date: 8/8/2003 1:01:11 PM Eastern Standard Time > > > From: <A HREF="mailto:ch...@du...">ch...@du...</A> > > > To: <A HREF="mailto:gmo...@li...">gmo...@li...</A> > > > CC: <A HREF="mailto:SLe...@ao...">SLe...@ao...</A>, <A HREF="mailto:cj...@fr...">cj...@fr...</A> > > > Sent from the Internet > > > > > > > > > > > > All, > > > > > > I'm blissfully ignorant of XML. > > > > > > I have generated an AceXML dump from our AceDB and am pondering how one > > > goes about converting one flavor of XML to another, AceXML ->XORT/chado > > > XML in particular. > > > > > > If someone could shed some light on this I'd appreciate it. > > > > > > Charles > > > > > > > > > > > > > > > > > > ------------------------------------------------------- > > > This SF.Net email sponsored by: Free pre-built ASP.NET sites including > > > Data Reports, E-commerce, Portals, and Forums are available now. > > > Download today and enter to win an XBOX or Visual Studio .NET. > > > http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01 > > > _______________________________________________ > > > Gmod-schema mailing list > > > Gmo...@li... > > > https://lists.sourceforge.net/lists/listinfo/gmod-schema > > > > > > > > > > ------------------------------------------------------- > This SF.Net email sponsored by: Free pre-built ASP.NET sites including > Data Reports, E-commerce, Portals, and Forums are available now. > Download today and enter to win an XBOX or Visual Studio .NET. > http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01 > _______________________________________________ > Gmod-schema mailing list > Gmo...@li... > https://lists.sourceforge.net/lists/listinfo/gmod-schema > -- Jason Stajich Duke University jason at cgt.mc.duke.edu
Stan, Thanks for the overview. I'll make a go at the conversion using perl. Not sure if I'll use SAX or DOM, so pointers to appropriate libraries for either appreciated. regards, Charles On Fri, 2003年08月08日 at 13:52, SLe...@ao... wrote: > Charles, > > There is no magic to it, you just need to parse the input format and > write out > the output format. XML has an associated technology called XSLT which > represents > XML rewrite rules in an astonishingly cumbersome and impenetrable way; you > can look into that but I predict you will find it is too ugly to warrant the > effort > of climbing the learning curve. Failing that, Java and Perl both have > libraries > that simplify parsing XML documents (two flavors: SAX parses an element at > a time and lets you supply handlers that can grab control at the begin and > end of each element and do things; DOM slurps in the whole thing and gives > it to you as a big structure). Once you have it in memory you can either > traverse the tree and issue print statements to generate XML, or you can > construct > another XML data structure in the new form and call routines to externalize. > > If you tell us your language of choice people can probably point you to > the appropriate XML libraries. > > Regarding the all important understanding of the data formats, I assume there > is documentation on AceXML. There is some spotty documentation of the XORT > chado XML on the gmod site, including the autogenerated header in > gmod/schema/chado/doc/chado.dtd. Pinglei and I can answer questions as > needed. > > Cheers, -Stan > > In a message dated 8/8/2003 1:01:11 PM Eastern Standard Time, > ch...@du... writes: > > > Subj: [Gmod-schema] Re: Acedb -> AceXML -> Chado XML -> Chado DB > > Date: 8/8/2003 1:01:11 PM Eastern Standard Time > > From: <A HREF="mailto:ch...@du...">ch...@du...</A> > > To: <A HREF="mailto:gmo...@li...">gmo...@li...</A> > > CC: <A HREF="mailto:SLe...@ao...">SLe...@ao...</A>, <A HREF="mailto:cj...@fr...">cj...@fr...</A> > > Sent from the Internet > > > > > > > > All, > > > > I'm blissfully ignorant of XML. > > > > I have generated an AceXML dump from our AceDB and am pondering how one > > goes about converting one flavor of XML to another, AceXML ->XORT/chado > > XML in particular. > > > > If someone could shed some light on this I'd appreciate it. > > > > Charles > > > > > > > > > > > > ------------------------------------------------------- > > This SF.Net email sponsored by: Free pre-built ASP.NET sites including > > Data Reports, E-commerce, Portals, and Forums are available now. > > Download today and enter to win an XBOX or Visual Studio .NET. > > http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01 > > _______________________________________________ > > Gmod-schema mailing list > > Gmo...@li... > > https://lists.sourceforge.net/lists/listinfo/gmod-schema > > >
Don, Attached are excerpts from the various classes within ChlamyDB in AceXML. As you can tell this is used by the stock center and is used primarily to handle the genetics, strains and pubs with no sequence data being maintained in it. Understandably, the sequence module of chado is furthest along and my suspicion is that the genetics module of chado, while still being actively developed(?), may be the most difficult the generalize due to how each group defines locus, allele etc. For Chlamy Lib has a brief synopsis at: http://www.biology.duke.edu/chlamy_genome/locus_allele.html Appreciate the interest. regards, Charles On Fri, 2003年08月08日 at 15:00, Don Gilbert wrote: > Charles, > > Stan is right about XSLT - hard to use for anything complex. > If you want to share some of your AceXML format data with > us to look at, there may be someone here with an xml tool > that could be adapted to get it into chado format. The main > problem I see is that Ace data structure is very object oriented, > and Chado is more relationally structured, so a conversion > will need more than renaming or slight restructuring. > > - Don > > > -- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405 > -- gil...@bi...--http://marmot.bio.indiana.edu/ >
Charles, Stan is right about XSLT - hard to use for anything complex. If you want to share some of your AceXML format data with us to look at, there may be someone here with an xml tool that could be adapted to get it into chado format. The main problem I see is that Ace data structure is very object oriented, and Chado is more relationally structured, so a conversion will need more than renaming or slight restructuring. - Don -- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405 -- gil...@bi...--http://marmot.bio.indiana.edu/
Charles, There is no magic to it, you just need to parse the input format and write out the output format. XML has an associated technology called XSLT which represents XML rewrite rules in an astonishingly cumbersome and impenetrable way; you can look into that but I predict you will find it is too ugly to warrant the effort of climbing the learning curve. Failing that, Java and Perl both have libraries that simplify parsing XML documents (two flavors: SAX parses an element at a time and lets you supply handlers that can grab control at the begin and end of each element and do things; DOM slurps in the whole thing and gives it to you as a big structure). Once you have it in memory you can either traverse the tree and issue print statements to generate XML, or you can construct another XML data structure in the new form and call routines to externalize. If you tell us your language of choice people can probably point you to the appropriate XML libraries. Regarding the all important understanding of the data formats, I assume there is documentation on AceXML. There is some spotty documentation of the XORT chado XML on the gmod site, including the autogenerated header in gmod/schema/chado/doc/chado.dtd. Pinglei and I can answer questions as needed. Cheers, -Stan In a message dated 8/8/2003 1:01:11 PM Eastern Standard Time, ch...@du... writes: > Subj: [Gmod-schema] Re: Acedb -> AceXML -> Chado XML -> Chado DB > Date: 8/8/2003 1:01:11 PM Eastern Standard Time > From: <A HREF="mailto:ch...@du...">ch...@du...</A> > To: <A HREF="mailto:gmo...@li...">gmo...@li...</A> > CC: <A HREF="mailto:SLe...@ao...">SLe...@ao...</A>, <A HREF="mailto:cj...@fr...">cj...@fr...</A> > Sent from the Internet > > > > All, > > I'm blissfully ignorant of XML. > > I have generated an AceXML dump from our AceDB and am pondering how one > goes about converting one flavor of XML to another, AceXML ->XORT/chado > XML in particular. > > If someone could shed some light on this I'd appreciate it. > > Charles > > > > > > ------------------------------------------------------- > This SF.Net email sponsored by: Free pre-built ASP.NET sites including > Data Reports, E-commerce, Portals, and Forums are available now. > Download today and enter to win an XBOX or Visual Studio .NET. > http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01 > _______________________________________________ > Gmod-schema mailing list > Gmo...@li... > https://lists.sourceforge.net/lists/listinfo/gmod-schema >
All, I'm blissfully ignorant of XML. I have generated an AceXML dump from our AceDB and am pondering how one goes about converting one flavor of XML to another, AceXML ->XORT/chado XML in particular. If someone could shed some light on this I'd appreciate it. Charles
On 7 Aug 2003, Scott Cain wrote: > Hello, > > I mentioned on the Harvard call earlier this week that some of the > Postgres developers found it unbelievable that I got better performance > doing substring operations on shredded DNA then on a properly configured > text column in Postgres. Well, it turns out that they were right to be > unbelieving. While I did change the storage method on the column, > Postgres never got around to uncompressing and restoring the DNA. When > I forced it to do so, the substring operations got much faster. Here is > a summary of the average time per operation: > > substr in perl 0.83sec/op > substring on default text column 0.24sec/op > substring on EXTERNAL column 0.0069sec/op > > That is pretty darn fast! This was done in the current build of gadfly, > not with the human data I was working with before, so the results are > not directly comparable. Others on the Pg mailing list indicated that > the function I wrote would get nearly as good performance, but the Pg > substring should nearly always win. nice work! this seems to be the patter quite often - pg at first seems disappointingly slow at doing something, then one of the friendly folk on pg-admin suggest something that radically speeds things up. > Presumably, we should still move residues out of feature to avoid any > penalties associated with really large rows, but when we do, can > configure the table to have EXTERNAL storage for the DNA strings. Hmm, is it possible to preserve the logical model (everything in the feature table) but change the underlying physical model (residues in a seperate table) using views/triggers? > Scott > > > On Mon, 2003年08月04日 at 14:07, Scott Cain wrote: > > Hello, > > > > As promised, I have been looking at storing DNA in a table separate from > > the feature table so that it can be "shredded" in to smallish chucks so > > that substrings can be assembled quickly. > > > > I performed tests in the following way: I identified the "worst case" > > scenario: human chromosome 1 (250 Mb). Then I created two tables: > > > > string=> \d longdna > > Table "public.longdna" > > Column | Type | Modifiers > > --------+------+----------- > > dna | text | > > > > string=> \d dna > > Table "public.dna" > > Column | Type | Modifiers > > ---------+---------+----------- > > foffset | integer | > > pdna | text | > > Indexes: foffset_idx btree (foffset) > > > > In practice of course, the dna table would have another column for the > > foreign key to its feature_id, as well as a primary key. In to the > > longdna table, I loaded chr1 as a long string. I did this two times, > > once with the default text storage mechanism and once with the storage > > set to EXTENDED as described in ALTER TABLE > > (http://www.postgresql.org/docs/7.3/static/sql-altertable.html). > > > > Into the dna table, I put dna shredded to 2000 bp per chunk and the > > cumulative offset for each chunk. I wrote a function (at the bottom of > > this email) to assembly a substring given a start and stop coordinate. > > (I just realized I used base coordinates, but it should be easy to > > convert to interbase.) > > > > Then I wrote a perl script to do the substring operation over chromosome > > 1 at various locations and for lengths of 1000, 5000, 10000 and 40000 > > bp. Initially, I had it run over 4000 substrings to get a good average, > > but I had to shorten it for the standard SQL based substrings because > > they took so long. > > > > Here's the data: > > ave. time/substring operation > > shredded DNA 0.27 sec > > EXTENDED storage 38 sec > > normal text storage 40 sec > > > > Clearly, shredding is a preferable method for substringing. The only > > reason I can think of to not use shredding is for searching for a > > particular substring, but I don't think a relational database is well > > suited for that type of search anyway--best to use BLAST in some way for > > that. > > > > So, what do you think? > > Scott > > > > --------------------------- > > Here's the function: > > > > CREATE OR REPLACE FUNCTION dna_string (integer, integer) RETURNS TEXT AS ' > > DECLARE > > smin ALIAS FOR 1ドル; > > smax ALIAS FOR 2ドル; > > longdna TEXT := ''''; > > dna_row dna%ROWTYPE; > > dnastring TEXT; > > firstchunk INTEGER; > > lastchunk INTEGER; > > in_longdnastart INTEGER; > > in_longdnalen INTEGER; > > chunksize INTEGER; > > BEGIN > > SELECT INTO chunksize min(foffset) FROM dna WHERE foffset>0; > > firstchunk := chunksize*(smin/chunksize); > > lastchunk := chunksize*(smax/chunksize); > > > > --RAISE NOTICE ''first chunks is %'',firstchunk; > > --RAISE NOTICE ''last chunk is %'',lastchunk; > > > > in_longdnastart := smin % chunksize; > > in_longdnalen := smax - smin + 1; > > > > --RAISE NOTICE ''in_longdnastart is %'',in_longdnastart; > > --RAISE NOTICE ''in_longdnalen is %'',in_longdnalen; > > > > FOR dna_row IN > > --FOR i IN smin/chunksize .. (smax/chunksize+1) LOOP > > SELECT * FROM dna > > WHERE foffset >= firstchunk AND foffset <= lastchunk > > ORDER BY foffset > > LOOP > > > > --RAISE NOTICE ''foffset is %'',dna_row.foffset; > > > > --SELECT INTO chunk pdna FROM dna WHERE foffset = (i*chunksize); > > longdna := longdna || dna_row.pdna; > > END LOOP; > > > > dnastring := substring(longdna FROM in_longdnastart FOR in_longdnalen); > > > > RETURN dnastring; > > END; > > ' LANGUAGE 'plpgsql'; >
Hello, I mentioned on the Harvard call earlier this week that some of the Postgres developers found it unbelievable that I got better performance doing substring operations on shredded DNA then on a properly configured text column in Postgres. Well, it turns out that they were right to be unbelieving. While I did change the storage method on the column, Postgres never got around to uncompressing and restoring the DNA. When I forced it to do so, the substring operations got much faster. Here is a summary of the average time per operation: substr in perl 0.83sec/op substring on default text column 0.24sec/op substring on EXTERNAL column 0.0069sec/op That is pretty darn fast! This was done in the current build of gadfly, not with the human data I was working with before, so the results are not directly comparable. Others on the Pg mailing list indicated that the function I wrote would get nearly as good performance, but the Pg substring should nearly always win. Presumably, we should still move residues out of feature to avoid any penalties associated with really large rows, but when we do, can configure the table to have EXTERNAL storage for the DNA strings. Scott On Mon, 2003年08月04日 at 14:07, Scott Cain wrote: > Hello, > > As promised, I have been looking at storing DNA in a table separate from > the feature table so that it can be "shredded" in to smallish chucks so > that substrings can be assembled quickly. > > I performed tests in the following way: I identified the "worst case" > scenario: human chromosome 1 (250 Mb). Then I created two tables: > > string=> \d longdna > Table "public.longdna" > Column | Type | Modifiers > --------+------+----------- > dna | text | > > string=> \d dna > Table "public.dna" > Column | Type | Modifiers > ---------+---------+----------- > foffset | integer | > pdna | text | > Indexes: foffset_idx btree (foffset) > > In practice of course, the dna table would have another column for the > foreign key to its feature_id, as well as a primary key. In to the > longdna table, I loaded chr1 as a long string. I did this two times, > once with the default text storage mechanism and once with the storage > set to EXTENDED as described in ALTER TABLE > (http://www.postgresql.org/docs/7.3/static/sql-altertable.html). > > Into the dna table, I put dna shredded to 2000 bp per chunk and the > cumulative offset for each chunk. I wrote a function (at the bottom of > this email) to assembly a substring given a start and stop coordinate. > (I just realized I used base coordinates, but it should be easy to > convert to interbase.) > > Then I wrote a perl script to do the substring operation over chromosome > 1 at various locations and for lengths of 1000, 5000, 10000 and 40000 > bp. Initially, I had it run over 4000 substrings to get a good average, > but I had to shorten it for the standard SQL based substrings because > they took so long. > > Here's the data: > ave. time/substring operation > shredded DNA 0.27 sec > EXTENDED storage 38 sec > normal text storage 40 sec > > Clearly, shredding is a preferable method for substringing. The only > reason I can think of to not use shredding is for searching for a > particular substring, but I don't think a relational database is well > suited for that type of search anyway--best to use BLAST in some way for > that. > > So, what do you think? > Scott > > --------------------------- > Here's the function: > > CREATE OR REPLACE FUNCTION dna_string (integer, integer) RETURNS TEXT AS ' > DECLARE > smin ALIAS FOR 1ドル; > smax ALIAS FOR 2ドル; > longdna TEXT := ''''; > dna_row dna%ROWTYPE; > dnastring TEXT; > firstchunk INTEGER; > lastchunk INTEGER; > in_longdnastart INTEGER; > in_longdnalen INTEGER; > chunksize INTEGER; > BEGIN > SELECT INTO chunksize min(foffset) FROM dna WHERE foffset>0; > firstchunk := chunksize*(smin/chunksize); > lastchunk := chunksize*(smax/chunksize); > > --RAISE NOTICE ''first chunks is %'',firstchunk; > --RAISE NOTICE ''last chunk is %'',lastchunk; > > in_longdnastart := smin % chunksize; > in_longdnalen := smax - smin + 1; > > --RAISE NOTICE ''in_longdnastart is %'',in_longdnastart; > --RAISE NOTICE ''in_longdnalen is %'',in_longdnalen; > > FOR dna_row IN > --FOR i IN smin/chunksize .. (smax/chunksize+1) LOOP > SELECT * FROM dna > WHERE foffset >= firstchunk AND foffset <= lastchunk > ORDER BY foffset > LOOP > > --RAISE NOTICE ''foffset is %'',dna_row.foffset; > > --SELECT INTO chunk pdna FROM dna WHERE foffset = (i*chunksize); > longdna := longdna || dna_row.pdna; > END LOOP; > > dnastring := substring(longdna FROM in_longdnastart FOR in_longdnalen); > > RETURN dnastring; > END; > ' LANGUAGE 'plpgsql'; -- ------------------------------------------------------------------------ Scott Cain, Ph. D. ca...@cs... GMOD Coordinator (http://www.gmod.org/) 216-392-3087 Cold Spring Harbor Laboratory
what's the recommended way to store weight matrices in the current chado schema, if at all possible? i have some motifs that look like this for a 4mer. bases across the top, positions in the motif down the side. A T C G 1 0.1 0.1 0.4 0.4 2 0.0 0.0 1.0 0.0 3 0.75 0.25 0.0 0.0 4 0.25 0.25 0.25 0.25 is it a feature with some properties to represent the weight matrix? it seems like a shoehorn fit to do it that way. -allen
In another life I developed an object-relational datatype for brain MRI
images
in Illustra (a PostGres relative) which stored the data and the file system
and implemented SQL functions on them in C, which then had access
to the files. Something similar might be possible here -- I think PGSQL
lacks file I/O, so you would need C, or maybe Perl.
However, it seems to me that Scott has already gotten this
down to the subsecond range that should be adequate for interactive
applications retrieving a single sequence. Further optimization might
make sense if you were worried about queries that do substrings
on lots of sequences in a single query (eg get me 1KB upstream from
all TSS's). Not sure if making this efficient in R-DBMSs is a priority
though, given that there are usually other ways...
Cheers, -Stan
In a message dated 8/4/2003 10:17:00 PM Eastern Standard Time, ca...@cs...
writes:
> Subj: Re: [Gmod-schema] shredded DNA in a separate table (ie, not in
> feature->residue)
> Date: 8/4/2003 10:17:00 PM Eastern Standard Time
> From: <A HREF="mailto:ca...@cs...">ca...@cs...</A>
> To: <A HREF="mailto:gil...@bi...">gil...@bi...</A>
> CC: <A HREF="mailto:gmo...@li...">gmo...@li...</A>
> Sent from the Internet
>
>
>
> Don,
>
> That doesn't surprise me at all. We have to fight a lot of overhead to
> do these operations in a RDMS. Doing the search in the file system
> should be faster; unfortunately, doing it in the file system would
> introduce more complexity into an already complex system, don't you
> think?
>
> Scott
>
>
> On Mon, 2003年08月04日 at 20:28, Don Gilbert wrote:
> >Scott,
> >
> >For comparison, I ran your dna segment fetch test using the raw dna
> >files that I keep for euGenes/gnomap displays (e.g. human chr1), with
> methods
> >like "fseek(seqstart); fread(buf,seqlength)". This file method is still
> >faster by an order of magnitude - 0.015 sec average compared to the 0.27
> >sec you got with shredded dna in RDBMS.
> >
> >-- Don
> >
> > for len in (1000,5000,10000,40000)
> > for start in (1..240) x 1MB
> > bases= readcsome('/bio/eugenes/genomes/man/dna-1.raw',start,len)
> >
> >-- for 960 trials: ave=0.015625 sec.
> >-- read total of 13,440,000 bases
> >-- Perl method (equivalents in C and Java)
> > sub readcsome($$$) {
> > my($seqfile, $seqstart, $seqlen)= @_;
> > return undef unless(open(FD,$seqfile));
> > return undef unless(seek(FD,$seqstart,0));
> > my $buf;
> > my $n= read(FD,$buf,$seqlen);
> > close(FD);
> > return $buf;
> > }
> >
> >
> >-------------------------------------------------------
> >This SF.Net email sponsored by: Free pre-built ASP.NET sites including
> >Data Reports, E-commerce, Portals, and Forums are available now.
> >Download today and enter to win an XBOX or Visual Studio .NET.
> >http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01
> >_______________________________________________
> >Gmod-schema mailing list
> >Gmo...@li...
> >https://lists.sourceforge.net/lists/listinfo/gmod-schema
> --
> ------------------------------------------------------------------------
> Scott Cain, Ph. D. ca...@cs...
> GMOD Coordinator (http://www.gmod.org/) 216-392-3087
> Cold Spring Harbor Laboratory
>
>
>
> -------------------------------------------------------
> This SF.Net email sponsored by: Free pre-built ASP.NET sites including
> Data Reports, E-commerce, Portals, and Forums are available now.
> Download today and enter to win an XBOX or Visual Studio .NET.
> http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01
> _______________________________________________
> Gmod-schema mailing list
> Gmo...@li...
> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>
Don,
That doesn't surprise me at all. We have to fight a lot of overhead to
do these operations in a RDMS. Doing the search in the file system
should be faster; unfortunately, doing it in the file system would
introduce more complexity into an already complex system, don't you
think?
Scott
On Mon, 2003年08月04日 at 20:28, Don Gilbert wrote:
> Scott,
>
> For comparison, I ran your dna segment fetch test using the raw dna
> files that I keep for euGenes/gnomap displays (e.g. human chr1), with methods
> like "fseek(seqstart); fread(buf,seqlength)". This file method is still
> faster by an order of magnitude - 0.015 sec average compared to the 0.27
> sec you got with shredded dna in RDBMS.
>
> -- Don
>
> for len in (1000,5000,10000,40000)
> for start in (1..240) x 1MB
> bases= readcsome('/bio/eugenes/genomes/man/dna-1.raw',start,len)
>
> -- for 960 trials: ave=0.015625 sec.
> -- read total of 13,440,000 bases
> -- Perl method (equivalents in C and Java)
> sub readcsome($$$) {
> my($seqfile, $seqstart, $seqlen)= @_;
> return undef unless(open(FD,$seqfile));
> return undef unless(seek(FD,$seqstart,0));
> my $buf;
> my $n= read(FD,$buf,$seqlen);
> close(FD);
> return $buf;
> }
>
>
> -------------------------------------------------------
> This SF.Net email sponsored by: Free pre-built ASP.NET sites including
> Data Reports, E-commerce, Portals, and Forums are available now.
> Download today and enter to win an XBOX or Visual Studio .NET.
> http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01
> _______________________________________________
> Gmod-schema mailing list
> Gmo...@li...
> https://lists.sourceforge.net/lists/listinfo/gmod-schema
--
------------------------------------------------------------------------
Scott Cain, Ph. D. ca...@cs...
GMOD Coordinator (http://www.gmod.org/) 216-392-3087
Cold Spring Harbor Laboratory
Scott,
For comparison, I ran your dna segment fetch test using the raw dna
files that I keep for euGenes/gnomap displays (e.g. human chr1), with methods
like "fseek(seqstart); fread(buf,seqlength)". This file method is still
faster by an order of magnitude - 0.015 sec average compared to the 0.27
sec you got with shredded dna in RDBMS.
-- Don
for len in (1000,5000,10000,40000)
for start in (1..240) x 1MB
bases= readcsome('/bio/eugenes/genomes/man/dna-1.raw',start,len)
-- for 960 trials: ave=0.015625 sec.
-- read total of 13,440,000 bases
-- Perl method (equivalents in C and Java)
sub readcsome($$$) {
my($seqfile, $seqstart, $seqlen)= @_;
return undef unless(open(FD,$seqfile));
return undef unless(seek(FD,$seqstart,0));
my $buf;
my $n= read(FD,$buf,$seqlen);
close(FD);
return $buf;
}
In a message dated 8/4/2003 2:09:45 PM Eastern Standard Time, ca...@cs... writes: > Here's the data: > ave. time/substring operation > shredded DNA 0.27 sec > EXTENDED storage 38 sec > normal text storage 40 sec > > So, what do you think? > Awesome!
Hello, As promised, I have been looking at storing DNA in a table separate from the feature table so that it can be "shredded" in to smallish chucks so that substrings can be assembled quickly. I performed tests in the following way: I identified the "worst case" scenario: human chromosome 1 (250 Mb). Then I created two tables: string=> \d longdna Table "public.longdna" Column | Type | Modifiers --------+------+----------- dna | text | string=> \d dna Table "public.dna" Column | Type | Modifiers ---------+---------+----------- foffset | integer | pdna | text | Indexes: foffset_idx btree (foffset) In practice of course, the dna table would have another column for the foreign key to its feature_id, as well as a primary key. In to the longdna table, I loaded chr1 as a long string. I did this two times, once with the default text storage mechanism and once with the storage set to EXTENDED as described in ALTER TABLE (http://www.postgresql.org/docs/7.3/static/sql-altertable.html). Into the dna table, I put dna shredded to 2000 bp per chunk and the cumulative offset for each chunk. I wrote a function (at the bottom of this email) to assembly a substring given a start and stop coordinate. (I just realized I used base coordinates, but it should be easy to convert to interbase.) Then I wrote a perl script to do the substring operation over chromosome 1 at various locations and for lengths of 1000, 5000, 10000 and 40000 bp. Initially, I had it run over 4000 substrings to get a good average, but I had to shorten it for the standard SQL based substrings because they took so long. Here's the data: ave. time/substring operation shredded DNA 0.27 sec EXTENDED storage 38 sec normal text storage 40 sec Clearly, shredding is a preferable method for substringing. The only reason I can think of to not use shredding is for searching for a particular substring, but I don't think a relational database is well suited for that type of search anyway--best to use BLAST in some way for that. So, what do you think? Scott --------------------------- Here's the function: CREATE OR REPLACE FUNCTION dna_string (integer, integer) RETURNS TEXT AS ' DECLARE smin ALIAS FOR 1ドル; smax ALIAS FOR 2ドル; longdna TEXT := ''''; dna_row dna%ROWTYPE; dnastring TEXT; firstchunk INTEGER; lastchunk INTEGER; in_longdnastart INTEGER; in_longdnalen INTEGER; chunksize INTEGER; BEGIN SELECT INTO chunksize min(foffset) FROM dna WHERE foffset>0; firstchunk := chunksize*(smin/chunksize); lastchunk := chunksize*(smax/chunksize); --RAISE NOTICE ''first chunks is %'',firstchunk; --RAISE NOTICE ''last chunk is %'',lastchunk; in_longdnastart := smin % chunksize; in_longdnalen := smax - smin + 1; --RAISE NOTICE ''in_longdnastart is %'',in_longdnastart; --RAISE NOTICE ''in_longdnalen is %'',in_longdnalen; FOR dna_row IN --FOR i IN smin/chunksize .. (smax/chunksize+1) LOOP SELECT * FROM dna WHERE foffset >= firstchunk AND foffset <= lastchunk ORDER BY foffset LOOP --RAISE NOTICE ''foffset is %'',dna_row.foffset; --SELECT INTO chunk pdna FROM dna WHERE foffset = (i*chunksize); longdna := longdna || dna_row.pdna; END LOOP; dnastring := substring(longdna FROM in_longdnastart FOR in_longdnalen); RETURN dnastring; END; ' LANGUAGE 'plpgsql'; -- ------------------------------------------------------------------------ Scott Cain, Ph. D. ca...@cs... GMOD Coordinator (http://www.gmod.org/) 216-392-3087 Cold Spring Harbor Laboratory
Also look in gmod/schema/chado/bin, there is a file in there called pg2cdbi.pl. you can run it using the makefile to autogenerate a Class::DBI middleware layer to interact with the database. This requires Class::DBI, Class::DBI::Pg, and the CVS version of SQL::Translator. -Allen On Fri, 1 Aug 2003, Don Gilbert wrote: > Simon, > > Here are some tools that flybase has in development for > getting and putting gene object info from/to PostgreSQL db: > > http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/gmod/schema/XMLTools/ > > XORT/ Postgres Chado DB <--> Chado XML 'objects', for loading/dumping > to/from Postgres DB, including gene annotation records (perl) > > Convertor/ Chado XML < -- > GAME XML object conversion, including > gene objects (java) ; flybase uses include Apollo <> Postgres data passing > > ChadoSax/ Chado XML --> parse to other flatfile, software objects (perl,java) > -- still in progress; flybase will use to support current public > web software > > -- Don > -- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405 > -- gil...@in...--http://marmot.bio.indiana.edu/ > > > ------------------------------------------------------- > This SF.Net email sponsored by: Free pre-built ASP.NET sites including > Data Reports, E-commerce, Portals, and Forums are available now. > Download today and enter to win an XBOX or Visual Studio .NET. > http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01 > _______________________________________________ > Gmod-schema mailing list > Gmo...@li... > https://lists.sourceforge.net/lists/listinfo/gmod-schema >
Simon, Here are some tools that flybase has in development for getting and putting gene object info from/to PostgreSQL db: http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/gmod/schema/XMLTools/ XORT/ Postgres Chado DB <--> Chado XML 'objects', for loading/dumping to/from Postgres DB, including gene annotation records (perl) Convertor/ Chado XML < -- > GAME XML object conversion, including gene objects (java) ; flybase uses include Apollo <> Postgres data passing ChadoSax/ Chado XML --> parse to other flatfile, software objects (perl,java) -- still in progress; flybase will use to support current public web software -- Don -- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405 -- gil...@in...--http://marmot.bio.indiana.edu/
For the next chado release I think it would be useful to have feature.sequence_alphabet the alphabet cv should probably be encoded in the actual relational model - enum DNA, RNA, AA. RNA would virtually never be used, but there for completeness. Most of the time sequence_alphabet is derivable from feature.type, but not always DNA vs AA is often derivable from feature.residues - but this can go wrong, and certainly will if degenerate DNA symbols are used. So I think it is highly desirable to have this column.
Hi Dave, PubSearch stores data about Genes so we can use that info when screening article text for gene symbols, etc. to identify literature pertinent to a particular gene. Its also used as a curation tool so extra info about these genes is then added into pubsearch and then piped back to the parent db. As PubSearch is currently built to be standalone we need to import gene data from the parent MOD so it can be used in the article identification steps. This necessitates going via a flat file format that is exported from the parent MOD and imported into PubSearch. The current format is TAIR specific and we want to make it generic and hence adopting a Chado-compatible format early on seemed to be a good idea and would be a step towards integration of the tools within GMOD. We were hoping to match the Chado schema to what we have in PubSearch so we can see how the gene attributes map over, then see if there is already a file format that you have in place (Chado-XML?) that we could build our gene import and export scripts around so that PubSearch would be somewhat chado-compatible out of the box. Simon. On Friday, Aug 1, 2003, at 07:32 America/Chicago, David Emmert wrote: > Behzad, > > We did away with the "gene" table some time ago. I'm sorry we havn't > updated the diagram to correspond. If I can get some time, I'll make a > new diagram. > > Genes are implemented in the feature table (sequence module), along > with everything else. Are you looking for information on how to > implement a "gene" in chado? > > -Dave > > > > On 07/31/03 at 06:52 PM Behzad Mahini wrote: >> Hi, >> >> I am working on the GMOD's PubSearch module, and am in the process of >> generating a bulk data parser for uploading "genes" into our > database. We >> are interested in adapting Chado's model & schema when it comes to > genes, & >> found a "gene" table in Chado's Diagrams: >> >> >> http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/gmod/schema/chado/doc/ > diagrams/ >> >> (with field names: dbxref_id, gene_id, name, timeentered, > timelastmod, >> type_id, unnamed1, unnamed2). I am wondering if that is the only > information >> available on genes in Chado? >> >> Looking further, I noticed in regards to Chado's schema & sql tables > for >> specifc modules (of which genes are not one of them!), somewhere in > Chado's >> documentions there was a reference made to the following location on > CVS -- >> however the following URL did not help me in providing me with any >> additional information on genes: >> >> http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/gmod/schema/chado/ > modules/ >> >> Could you point me to the right place (assuming there is more > information >> beyond what I found in the diagrams section of Chado)? >> >> Thanks, >> Behzad >> >> >> >> ------------------------------------------------------- >> This SF.Net email sponsored by: Free pre-built ASP.NET sites > including >> Data Reports, E-commerce, Portals, and Forums are available now. >> Download today and enter to win an XBOX or Visual Studio .NET. >> http://aspnet.click-url.com/go/psa00100003ave/ > direct;at.aspnet_072303_01/01 >> _______________________________________________ >> Gmod-schema mailing list >> Gmo...@li... >> https://lists.sourceforge.net/lists/listinfo/gmod-schema > > > > ------------------------------------------------------- > This SF.Net email sponsored by: Free pre-built ASP.NET sites including > Data Reports, E-commerce, Portals, and Forums are available now. > Download today and enter to win an XBOX or Visual Studio .NET. > http://aspnet.click-url.com/go/psa00100003ave/ > direct;at.aspnet_072303_01/01 > _______________________________________________ > Gmod-devel mailing list > Gmo...@li... > https://lists.sourceforge.net/lists/listinfo/gmod-devel > > ------------------------------------------------------------------------ -------------------------- Simon Twigger, Ph.D. Assistant Professor, Bioinformatics Research Center Medical College of Wisconsin 8701 Watertown Plank Road, Milwaukee, WI, 53226 tel. 414-456-8802, fax 414-456-6595
Behzad, We did away with the "gene" table some time ago. I'm sorry we havn't updated the diagram to correspond. If I can get some time, I'll make a new diagram. Genes are implemented in the feature table (sequence module), along with everything else. Are you looking for information on how to implement a "gene" in chado? -Dave On 07/31/03 at 06:52 PM Behzad Mahini wrote: > Hi, > > I am working on the GMOD's PubSearch module, and am in the process of > generating a bulk data parser for uploading "genes" into our database. We > are interested in adapting Chado's model & schema when it comes to genes, & > found a "gene" table in Chado's Diagrams: > > > http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/gmod/schema/chado/doc/ diagrams/ > > (with field names: dbxref_id, gene_id, name, timeentered, timelastmod, > type_id, unnamed1, unnamed2). I am wondering if that is the only information > available on genes in Chado? > > Looking further, I noticed in regards to Chado's schema & sql tables for > specifc modules (of which genes are not one of them!), somewhere in Chado's > documentions there was a reference made to the following location on CVS -- > however the following URL did not help me in providing me with any > additional information on genes: > > http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/gmod/schema/chado/ modules/ > > Could you point me to the right place (assuming there is more information > beyond what I found in the diagrams section of Chado)? > > Thanks, > Behzad > > > > ------------------------------------------------------- > This SF.Net email sponsored by: Free pre-built ASP.NET sites including > Data Reports, E-commerce, Portals, and Forums are available now. > Download today and enter to win an XBOX or Visual Studio .NET. > http://aspnet.click-url.com/go/psa00100003ave/ direct;at.aspnet_072303_01/01 > _______________________________________________ > Gmod-schema mailing list > Gmo...@li... > https://lists.sourceforge.net/lists/listinfo/gmod-schema
Hi, I am working on the GMOD's PubSearch module, and am in the process of generating a bulk data parser for uploading "genes" into our database. We are interested in adapting Chado's model & schema when it comes to genes, & found a "gene" table in Chado's Diagrams: http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/gmod/schema/chado/doc/diagrams/ (with field names: dbxref_id, gene_id, name, timeentered, timelastmod, type_id, unnamed1, unnamed2). I am wondering if that is the only information available on genes in Chado? Looking further, I noticed in regards to Chado's schema & sql tables for specifc modules (of which genes are not one of them!), somewhere in Chado's documentions there was a reference made to the following location on CVS -- however the following URL did not help me in providing me with any additional information on genes: http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/gmod/schema/chado/modules/ Could you point me to the right place (assuming there is more information beyond what I found in the diagrams section of Chado)? Thanks, Behzad