Generic Model Organism Database Project

gmod-schema Mailing List for Generic Model Organism Database Project

Brought to you by: daveclements, girlwithglasses, gk_fan, hueyling, and 10 others

gmod-schema — For discussion of GMOD schema development

You can subscribe to this list here.

2002	_Jan	_Feb	_Mar	_Apr	_May	_Jun	_Jul	_Aug	_Sep	_Oct (28)	_Nov (87)	_Dec (16)
2003	_Jan (109)	_Feb (107)	_Mar (117)	_Apr (5)	_May (156)	_Jun (83)	_Jul (86)	_Aug (25)	_Sep (17)	_Oct (14)	_Nov (82)	_Dec (50)
2004	_Jan (14)	_Feb (75)	_Mar (110)	_Apr (83)	_May (20)	_Jun (36)	_Jul (12)	_Aug (37)	_Sep (9)	_Oct (11)	_Nov (52)	_Dec (68)
2005	_Jan (46)	_Feb (94)	_Mar (68)	_Apr (55)	_May (67)	_Jun (65)	_Jul (67)	_Aug (96)	_Sep (79)	_Oct (46)	_Nov (24)	_Dec (64)
2006	_Jan (39)	_Feb (31)	_Mar (48)	_Apr (58)	_May (31)	_Jun (57)	_Jul (29)	_Aug (40)	_Sep (22)	_Oct (31)	_Nov (44)	_Dec (51)
2007	_Jan (103)	_Feb (172)	_Mar (59)	_Apr (41)	_May (33)	_Jun (50)	_Jul (60)	_Aug (51)	_Sep (21)	_Oct (40)	_Nov (89)	_Dec (39)
2008	_Jan (28)	_Feb (20)	_Mar (19)	_Apr (29)	_May (29)	_Jun (24)	_Jul (32)	_Aug (16)	_Sep (35)	_Oct (23)	_Nov (17)	_Dec (19)
2009	_Jan (4)	_Feb (23)	_Mar (16)	_Apr (16)	_May (38)	_Jun (54)	_Jul (18)	_Aug (40)	_Sep (58)	_Oct (6)	_Nov (8)	_Dec (29)
2010	_Jan (40)	_Feb (40)	_Mar (63)	_Apr (95)	_May (136)	_Jun (58)	_Jul (91)	_Aug (55)	_Sep (77)	_Oct (52)	_Nov (85)	_Dec (37)
2011	_Jan (22)	_Feb (46)	_Mar (73)	_Apr (138)	_May (75)	_Jun (35)	_Jul (41)	_Aug (13)	_Sep (13)	_Oct (11)	_Nov (21)	_Dec (5)
2012	_Jan (13)	_Feb (34)	_Mar (59)	_Apr (4)	_May (13)	_Jun (1)	_Jul (1)	_Aug (1)	_Sep (3)	_Oct (2)	_Nov (4)	_Dec (1)
2013	_Jan (18)	_Feb (28)	_Mar (19)	_Apr (42)	_May (43)	_Jun (41)	_Jul (41)	_Aug (31)	_Sep (6)	_Oct (2)	_Nov (2)	_Dec (70)
2014	_Jan (55)	_Feb (98)	_Mar (44)	_Apr (40)	_May (15)	_Jun (18)	_Jul (20)	_Aug (1)	_Sep (13)	_Oct (3)	_Nov (37)	_Dec (85)
2015	_Jan (16)	_Feb (12)	_Mar (16)	_Apr (13)	_May (16)	_Jun (3)	_Jul (23)	_Aug	_Sep	_Oct	_Nov (9)	_Dec (2)
2016	_Jan (12)	_Feb (1)	_Mar (9)	_Apr (13)	_May (4)	_Jun (5)	_Jul	_Aug	_Sep (10)	_Oct (11)	_Nov (1)	_Dec
2017	_Jan	_Feb (1)	_Mar (11)	_Apr (8)	_May	_Jun (6)	_Jul	_Aug	_Sep	_Oct (3)	_Nov (2)	_Dec (1)
2018	_Jan (6)	_Feb (6)	_Mar (3)	_Apr (9)	_May (3)	_Jun	_Jul	_Aug (3)	_Sep (8)	_Oct (1)	_Nov (1)	_Dec (4)
2019	_Jan (4)	_Feb	_Mar (1)	_Apr	_May (2)	_Jun	_Jul	_Aug	_Sep	_Oct (2)	_Nov (1)	_Dec
2020	_Jan (22)	_Feb (4)	_Mar	_Apr	_May	_Jun (1)	_Jul (2)	_Aug (2)	_Sep (1)	_Oct	_Nov	_Dec (1)
2021	_Jan	_Feb	_Mar	_Apr	_May (1)	_Jun	_Jul (2)	_Aug (2)	_Sep	_Oct	_Nov	_Dec
2022	_Jan (1)	_Feb	_Mar (1)	_Apr	_May	_Jun	_Jul	_Aug (2)	_Sep	_Oct	_Nov	_Dec
2023	_Jan	_Feb	_Mar (1)	_Apr (1)	_May (5)	_Jun	_Jul	_Aug	_Sep	_Oct	_Nov	_Dec
2024	_Jan	_Feb	_Mar	_Apr	_May	_Jun	_Jul (3)	_Aug (3)	_Sep	_Oct	_Nov	_Dec
2025	_Jan	_Feb	_Mar	_Apr (1)	_May	_Jun	_Jul	_Aug	_Sep	_Oct	_Nov	_Dec

S	M	T	W	T	F	S
						1 (1)
2	3 (7)	4 (5)	5 (6)	6 (5)	7 (15)	8 (2)
9 (1)	10 (4)	11 (17)	12 (11)	13 (10)	14 (5)	15
16	17	18 (1)	19 (1)	20 (2)	21	22
23	24	25 (16)	26 (4)	27 (4)	28	29
30	31

Flat | Threaded

1 2 3 .. 5 > >> (Page 1 of 5)

Re: [Gmod-schema] New benchmarking results

From: Hilmar L. <hl...@gn...> - 2003年03月27日 20:31:39

On Thursday, March 27, 2003, at 12:09 PM, Scott Cain wrote:
> On Thu, 2003年03月27日 at 14:52, Hilmar Lapp wrote:
>>
>> BTW the fact that turning off X-windows helps means you don't have a
>> lot of memory on the box? What that would do is not give PostgreSQL
>> more memory, but more memory available to the kernel disk cache (which
>> is essentially what Postgres needs).
>
> Hilmar,
>
> "a lot of memory" is very subjective. When I ordered this laptop (a
> Dell Latitude C840) I thought a half a gig was a lot of memory. Is 
> more
> memory added to the disk cache via /etc/sysctl.conf by changing
> (upping?) kernel.shmmax?
The kernel should grab whatever is available in free buffers. But you 
may indeed have to adjust the kernel parameters for shmmax before your 
shared_buffers setting will take full effect. There is a doc on the Pg 
site I believe that says how to do this. Don't have the link off hand 
and don't remember the commands anymore, but I can tell that I did have 
to do this for Mac OSX.
	-hilmar
> That was a suggested optimization given at
> http://www.lyris.com/lm_help/6.0/tuning_postgresql.html.
>
> Thanks,
> Scott
>
> -- 
> ----------------------------------------------------------------------- 
> -
> Scott Cain, Ph. D. 
> ca...@cs...
> GMOD Coordinator (http://www.gmod.org/) 
> 216-392-3087
> Cold Spring Harbor Laboratory
>
>
-- 
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------

Re: [Gmod-schema] New benchmarking results

From: Scott C. <ca...@cs...> - 2003年03月27日 20:10:18

On Thu, 2003年03月27日 at 14:52, Hilmar Lapp wrote:
> 
> BTW the fact that turning off X-windows helps means you don't have a 
> lot of memory on the box? What that would do is not give PostgreSQL 
> more memory, but more memory available to the kernel disk cache (which 
> is essentially what Postgres needs).
Hilmar,
"a lot of memory" is very subjective. When I ordered this laptop (a
Dell Latitude C840) I thought a half a gig was a lot of memory. Is more
memory added to the disk cache via /etc/sysctl.conf by changing
(upping?) kernel.shmmax? That was a suggested optimization given at
http://www.lyris.com/lm_help/6.0/tuning_postgresql.html.
Thanks,
Scott
-- 
------------------------------------------------------------------------
Scott Cain, Ph. D. ca...@cs...
GMOD Coordinator (http://www.gmod.org/) 216-392-3087
Cold Spring Harbor Laboratory

Re: [Gmod-schema] New benchmarking results

From: Hilmar L. <hl...@gn...> - 2003年03月27日 19:52:33

On Thursday, March 27, 2003, at 10:17 AM, Scott Cain wrote:
> So, what did I miss?
>
Nothing at first sight. The most notable (and totally expected) result 
of the timings is that the range overlap query has huge variance and 
hence its performance is unreliable, although excellent if you happen 
to slice the data in a fortunate way. The geometric variant won't 
return lightning fast ever, but it also won't be terribly slow ever.
The reason the range overlap query is unreliable is that it depends so 
much on how you slice the index tree with your first two conditions 
(feature_id and max, assuming the index is (feature_id,max,min)). It is 
easily possible that you have to read half of the index from disk in 
order to filter for the third condition (min), which is going to be 
expensive (and given enough memory for disk cache even more expensive 
than a sequential table scan, because you need to read from the table 
anyway subsequently). With the geometric query it appears the size of 
the slice is much more consistent, although never as small as with the 
composite index in the lucky cases.
BTW the fact that turning off X-windows helps means you don't have a 
lot of memory on the box? What that would do is not give PostgreSQL 
more memory, but more memory available to the kernel disk cache (which 
is essentially what Postgres needs).
	-hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------

[Gmod-schema] New benchmarking results

From: Scott C. <ca...@cs...> - 2003年03月27日 18:17:24

Attachments: benchmark.pl parse_log.pl

Hello all,
I've spent the last couple of days learning what I can about Postgres
tuning, though I'm not sure it did much good. Here is what I did to
benchmark comparisons between two queries, one using min/max columns in
featureloc (Query1), the other using an RTree index in featureloc
(Query2). I wrote a short Perl script to run several ranges on all of
the arms in gadfly (the specific build of gadfly-chado I used was 3b
without residues for the arms). It ran 720 queries for each Query,
covering ranges from 1000 to 500000 bp. Here is a table of my results
(all numbers are in seconds):
 Query1 | Query2
 mean stdev var min max | mean stdev var min max
------------------------------------------------------------------------------
native 2.47 8.27 68.47 0.0015 62.2 | 1.82 2.00 4.02 0.028 14.16
 |
opt1 2.57 8.85 78.32 0.0002 69.3 | 1.76 1.91 3.66 0.013 12.74
 |
opt2 2.57 8.64 74.58 0.0019 67.9 | 1.77 2.14 4.57 0.014 29.66
 |
opt3 2.47 8.29 68.78 0.0004 62.9 | 1.77 1.93 3.71 0.019 12.64
 |
opt4 2.16 8.51 72.49 0.0034 66.6 | 1.52 1.65 2.71 0.016 12.04
Now for boatloads of notes:
native: no optimizations done, only VACUUM ANALYZE before running
opt1: effective_cache_size = 2000 #default 1000
 sort_mem = 4096 #default 1000
 shared_buffers = 2000 #default 64
opt2: effective_cache_size = 2000 
 sort_mem = 4096 
 shared_buffers = 1000 
opt3: effective_cache_size = 2000 
 sort_mem = 2048 
 shared_buffers = 1000 
opt4: effective_cache_size = 2000 
 sort_mem = 2048 
 shared_buffers = 1000 
 XWindows OFF
The times are in wall clock seconds, as extracted from syslog. EXPLAIN
ANALYZE generally gives more optimistic numbers.
Query1:
select distinct f.name,fl.min,fl.max,fl.strand,f.type_id,f.feature_id
from feature f, featureloc fl 
where fl.srcfeature_id = ? and 
 f.feature_id = fl.feature_id and 
 fl.max >= ? and fl.min <= ?
Query2:
select distinct f.name,fl.min,fl.max,fl.strand,f.type_id,f.feature_id
from feature f, featureslice(?,?) fl 
where fl.srcfeature_id = ? and 
 f.feature_id = fl.feature_id
Comments about the data:
These data present a mixed bag. Clearly, if all we look at is lump
statistics, Query2 wins. In every case, it is faster, and has smaller
standard deviations and variances. However, it is interesting to note
that Query1 nearly always easily wins in the minimum time column. So
why would it be so fast sometimes and so slow at others? While I don't
have a good answer to that, I looked at the raw data to look for
trends. Query1 and Query2 perform comparably for most of the test,
however, when Query1 gets some distance into srcfeature_id 6 (arm X), it
falls apart for every range size, with query times going over 60 seconds
consistently. (Let me explain "some distance" a little better: when it
starts with srcfeature_id 6, it does fine for 20 or so queries (times
less than 3 seconds, then abruptly query times go to about 60 seconds
and stays there.) That explains the very large variance for that data
set. As I recall from statistics, variance is a measure of how
symmetric a data set is, and this data set is bimodal. I can tell you
that it is not because postgres suddenly decided to start using
seqscans. I interrupted one run while it was doing these slow queries
and ran EXPLAIN ANALYZE on a query and it was using appropriate indexes
for each table. The most time consuming step was the index scan on
feature_id in feature.
As for the "optimizations," they mostly don't seem to matter; in fact,
for Query1, it made it worse as often as better. I believe this is
because I tried to hard to optimize, allocating more memory than I had
to give, causing disk swapping sometimes. The only "optimization" that
mattered noticeably was running with XWindows off, which is essentially
giving the database more memory. Go figure, give the database more
memory and it behaves better. I believe that the take home message is
this: while sometimes Query1 performs well, Query2 is generally safer
and probably ought to be used for ROI queries.
A few comments about methods:
I've attached the two Perl scripts I wrote to do this work. The first,
benchmark.pl, uses DBI to run though arms and ranges. The order in which
it does things is this: pick a query, pick an arm, pick a starting
point, pick a range. Putting range iteration on the inner loop was done
on purpose to simulate what you might expect to happen when using
gbrowse, and perhaps letting postgres take advantage of caching. The
second script parses /var/log/messages to get durations for each query
and perform statistics on the data. To get duration data to go to
syslog, I set log_pid = true, log_statement = true, log_duration = true
in postgres.conf and restarted postgres. As I noted above, this puts
wall clock time in syslog, so it is actually a better representation of
real world performance. I used /usr/sbin/logrotate -f
/etc/logrotate.conf to force log rotation between runs.
So, what did I miss?
Scott
-- 
------------------------------------------------------------------------
Scott Cain, Ph. D. ca...@cs...
GMOD Coordinator (http://www.gmod.org/) 216-392-3087
Cold Spring Harbor Laboratory