From 242fca75910fce347dfc7764dda4891097b43dd5 Mon Sep 17 00:00:00 2001 From: Bruce Momjian <bruce@momjian.us> Date: Wed, 10 Nov 2004 02:48:59 +0000 Subject: [PATCH] Remove performance TODO.detail. In TODO. --- doc/TODO.detail/performance | 5156 ----------------------------------- 1 file changed, 5156 deletions(-) delete mode 100644 doc/TODO.detail/performance diff --git a/doc/TODO.detail/performance b/doc/TODO.detail/performance deleted file mode 100644 index 90397ba2bc4..00000000000 --- a/doc/TODO.detail/performance +++ /dev/null @@ -1,5156 +0,0 @@ -From owner-pgsql-hackers@hub.org Sun Jun 14 18:45:04 1998 -Received: from hub.org (hub.org [209.47.148.200]) - by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id SAA03690 - for <maillist@candle.pha.pa.us>; Sun, 14 Jun 1998 18:45:00 -0400 (EDT) -Received: from localhost (majordom@localhost) by hub.org (8.8.8/8.7.5) with SMTP id SAA28049; Sun, 14 Jun 1998 18:39:42 -0400 (EDT) -Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Sun, 14 Jun 1998 18:36:06 +0000 (EDT) -Received: (from majordom@localhost) by hub.org (8.8.8/8.7.5) id SAA27943 for pgsql-hackers-outgoing; Sun, 14 Jun 1998 18:36:04 -0400 (EDT) -Received: from angular.illustra.com (ifmxoak.illustra.com [206.175.10.34]) by hub.org (8.8.8/8.7.5) with ESMTP id SAA27925 for <pgsql-hackers@postgresql.org>; Sun, 14 Jun 1998 18:35:47 -0400 (EDT) -Received: from hawk.illustra.com (hawk.illustra.com [158.58.61.70]) by angular.illustra.com (8.7.4/8.7.3) with SMTP id PAA21293 for <pgsql-hackers@postgresql.org>; Sun, 14 Jun 1998 15:35:12 -0700 (PDT) -Received: by hawk.illustra.com (5.x/smail2.5/06-10-94/S) - id AA07922; Sun, 14 Jun 1998 15:35:13 -0700 -From: dg@illustra.com (David Gould) -Message-Id: <9806142235.AA07922@hawk.illustra.com> -Subject: [HACKERS] performance tests, initial results -To: pgsql-hackers@postgreSQL.org -Date: Sun, 14 Jun 1998 15:35:13 -0700 (PDT) -Mime-Version: 1.0 -Content-Type: text/plain; charset=US-ASCII -Content-Transfer-Encoding: 7bit -Sender: owner-pgsql-hackers@hub.org -Precedence: bulk -Status: RO - - -I have been playing a little with the performance tests found in -pgsql/src/tests/performance and have a few observations that might be of -minor interest. - -The tests themselves are simple enough although the result parsing in the -driver did not work on Linux. I am enclosing a patch below to fix this. I -think it will also work better on the other systems. - -A summary of results from my testing are below. Details are at the bottom -of this message. - -My test system is 'leslie': - - linux 2.0.32, gcc version 2.7.2.3 - P133, HX chipset, 512K L2, 32MB mem - NCR810 fast scsi, Quantum Atlas 2GB drive (7200 rpm). - - - Results Summary (times in seconds) - - Single txn 8K txn Create 8K idx 8K random Simple -Case Description 8K insert 8K insert Index Insert Scans Orderby -=================== ========== ========= ====== ====== ========= ======= -1 From Distribution - P90 FreeBsd -B256 39.56 1190.98 3.69 46.65 65.49 2.27 - IDE - -2 Running on leslie - P133 Linux 2.0.32 15.48 326.75 2.99 20.69 35.81 1.68 - SCSI 32M - -3 leslie, -o -F - no forced writes 15.90 24.98 2.63 20.46 36.43 1.69 - -4 leslie, -o -F - no ASSERTS 14.92 23.23 1.38 18.67 33.79 1.58 - -5 leslie, -o -F -B2048 - more buffers 21.31 42.28 2.65 25.74 42.26 1.72 - -6 leslie, -o -F -B2048 - more bufs, no ASSERT 20.52 39.79 1.40 24.77 39.51 1.55 - - - - - Case to Case Difference Factors (+ is faster) - - Single txn 8K txn Create 8K idx 8K random Simple -Case Description 8K insert 8K insert Index Insert Scans Orderby -=================== ========== ========= ====== ====== ========= ======= - -leslie vs BSD P90. 2.56 3.65 1.23 2.25 1.83 1.35 - -(noflush -F) vs no -F -1.03 13.08 1.14 1.01 -1.02 1.00 - -No Assert vs Assert 1.05 1.07 1.90 1.06 1.07 1.09 - --B256 vs -B2048 1.34 1.69 1.01 1.26 1.16 1.02 - - -Observations: - - - leslie (P133 linux) appears to be about 1.8 times faster than the - P90 BSD system used for the test result distributed with the source, not - counting the 8K txn insert case which was completely disk bound. - - - SCSI disks make a big (factor of 3.6) difference. During this test the - disk was hammering and cpu utilization was < 10%. - - - Assertion checking seems to cost about 7% except for create index where - it costs 90% - - - the -F option to avoid flushing buffers has tremendous effect if there are - many very small transactions. Or, another way, flushing at the end of the - transaction is a major disaster for performance. - - - Something is very wrong with our buffer cache implementation. Going from - 256 buffers to 2048 buffers costs an average of 25%. In the 8K txn case - it costs about 70%. I see looking at the code and profiling that in the 8K - txn case this is in BufferSync() which examines all the buffers at commit - time. I don't quite understand why it is so costly for the single 8K row - txn (35%) though. - -It would be nice to have some more tests. Maybe the Wisconsin stuff will -be useful. - - - ------------------ patch to test harness. apply from pgsql ------------ -*** src/test/performance/runtests.pl.orig Sun Jun 14 11:34:04 1998 - -Differences % - - ------------------ patch to test harness. apply from pgsql ------------ -*** src/test/performance/runtests.pl.orig Sun Jun 14 11:34:04 1998 ---- src/test/performance/runtests.pl Sun Jun 14 12:07:30 1998 -*************** -*** 84,123 **** - open (STDERR, ">$TmpFile") or die; - select (STDERR); $| = 1; - -! for ($i = 0; $i <= $#perftests; $i++) -! { - $test = $perftests[$i]; - ($test, $XACTBLOCK) = split (/ /, $test); - $runtest = $test; -! if ( $test =~ /\.ntm/ ) -! { -! # - # No timing for this queries -- # - close (STDERR); # close $TmpFile - open (STDERR, ">/dev/null") or die; - $runtest =~ s/\.ntm//; - } -! else -! { - close (STDOUT); - open(STDOUT, ">&SAVEOUT"); - print STDOUT "\nRunning: $perftests[$i+1] ..."; - close (STDOUT); - open (STDOUT, ">/dev/null") or die; - select (STDERR); $| = 1; -! printf "$perftests[$i+1]: "; - } - - do "sqls/$runtest"; - - # Restore STDERR to $TmpFile -! if ( $test =~ /\.ntm/ ) -! { - close (STDERR); - open (STDERR, ">>$TmpFile") or die; - } -- - select (STDERR); $| = 1; - $i++; - } ---- 84,116 ---- - open (STDERR, ">$TmpFile") or die; - select (STDERR); $| = 1; - -! for ($i = 0; $i <= $#perftests; $i++) { - $test = $perftests[$i]; - ($test, $XACTBLOCK) = split (/ /, $test); - $runtest = $test; -! if ( $test =~ /\.ntm/ ) { - # No timing for this queries - close (STDERR); # close $TmpFile - open (STDERR, ">/dev/null") or die; - $runtest =~ s/\.ntm//; - } -! else { - close (STDOUT); - open(STDOUT, ">&SAVEOUT"); - print STDOUT "\nRunning: $perftests[$i+1] ..."; - close (STDOUT); - open (STDOUT, ">/dev/null") or die; - select (STDERR); $| = 1; -! print "$perftests[$i+1]: "; - } - - do "sqls/$runtest"; - - # Restore STDERR to $TmpFile -! if ( $test =~ /\.ntm/ ) { - close (STDERR); - open (STDERR, ">>$TmpFile") or die; - } - select (STDERR); $| = 1; - $i++; - } -*************** -*** 128,138 **** - open (TMPF, "<$TmpFile") or die; - open (RESF, ">$ResFile") or die; - -! while (<TMPF>) -! { -! $str = $_; -! ($test, $rtime) = split (/:/, $str); -! ($tmp, $rtime, $rest) = split (/[ ]+/, $rtime); -! print RESF "$test: $rtime\n"; - } - ---- 121,130 ---- - open (TMPF, "<$TmpFile") or die; - open (RESF, ">$ResFile") or die; - -! while (<TMPF>) { -! if (m/^(.*: ).* ([0-9:.]+) *elapsed/) { -! ($test, $rtime) = ($1, $2); -! print RESF $test, $rtime, "\n"; -! } - } - ------------------------------------------------------------------------- - - -------------------------- testcase detail -------------------------- - -1. from distribution - DBMS: PostgreSQL 6.2b10 - OS: FreeBSD 2.1.5-RELEASE - HardWare: i586/90, 24M RAM, IDE - StartUp: postmaster -B 256 '-o -S 2048' -S - Compiler: gcc 2.6.3 - Compiled: -O, without CASSERT checking, with - -DTBL_FREE_CMD_MEMORY (to free memory - if BEGIN/END after each query execution) - DB connection startup: 0.20 - 8192 INSERTs INTO SIMPLE (1 xact): 39.58 - 8192 INSERTs INTO SIMPLE (8192 xacts): 1190.98 - Create INDEX on SIMPLE: 3.69 - 8192 INSERTs INTO SIMPLE with INDEX (1 xact): 46.65 - 8192 random INDEX scans on SIMPLE (1 xact): 65.49 - ORDER BY SIMPLE: 2.27 - - -2. run on leslie with asserts - DBMS: PostgreSQL 6.3.2 (plus changes to 98/06/01) - OS: Linux 2.0.32 leslie - HardWare: i586/133 HX 512, 32M RAM, fast SCSI, 7200rpm - StartUp: postmaster -B 256 '-o -S 2048' -S - Compiler: gcc 2.7.2.3 - Compiled: -O, WITH CASSERT checking, with - -DTBL_FREE_CMD_MEMORY (to free memory - if BEGIN/END after each query execution) - DB connection startup: 0.10 - 8192 INSERTs INTO SIMPLE (1 xact): 15.48 - 8192 INSERTs INTO SIMPLE (8192 xacts): 326.75 - Create INDEX on SIMPLE: 2.99 - 8192 INSERTs INTO SIMPLE with INDEX (1 xact): 20.69 - 8192 random INDEX scans on SIMPLE (1 xact): 35.81 - ORDER BY SIMPLE: 1.68 - - -3. with -F to avoid forced i/o - DBMS: PostgreSQL 6.3.2 (plus changes to 98/06/01) - OS: Linux 2.0.32 leslie - HardWare: i586/133 HX 512, 32M RAM, fast SCSI, 7200rpm - StartUp: postmaster -B 256 '-o -S 2048 -F' -S - Compiler: gcc 2.7.2.3 - Compiled: -O, WITH CASSERT checking, with - -DTBL_FREE_CMD_MEMORY (to free memory - if BEGIN/END after each query execution) - DB connection startup: 0.10 - 8192 INSERTs INTO SIMPLE (1 xact): 15.90 - 8192 INSERTs INTO SIMPLE (8192 xacts): 24.98 - Create INDEX on SIMPLE: 2.63 - 8192 INSERTs INTO SIMPLE with INDEX (1 xact): 20.46 - 8192 random INDEX scans on SIMPLE (1 xact): 36.43 - ORDER BY SIMPLE: 1.69 - - -4. no asserts, -F to avoid forced I/O - DBMS: PostgreSQL 6.3.2 (plus changes to 98/06/01) - OS: Linux 2.0.32 leslie - HardWare: i586/133 HX 512, 32M RAM, fast SCSI, 7200rpm - StartUp: postmaster -B 256 '-o -S 2048' -S - Compiler: gcc 2.7.2.3 - Compiled: -O, No CASSERT checking, with - -DTBL_FREE_CMD_MEMORY (to free memory - if BEGIN/END after each query execution) - DB connection startup: 0.10 - 8192 INSERTs INTO SIMPLE (1 xact): 14.92 - 8192 INSERTs INTO SIMPLE (8192 xacts): 23.23 - Create INDEX on SIMPLE: 1.38 - 8192 INSERTs INTO SIMPLE with INDEX (1 xact): 18.67 - 8192 random INDEX scans on SIMPLE (1 xact): 33.79 - ORDER BY SIMPLE: 1.58 - - -5. with more buffers (2048 vs 256) and -F to avoid forced i/o - DBMS: PostgreSQL 6.3.2 (plus changes to 98/06/01) - OS: Linux 2.0.32 leslie - HardWare: i586/133 HX 512, 32M RAM, fast SCSI, 7200rpm - StartUp: postmaster -B 2048 '-o -S 2048 -F' -S - Compiler: gcc 2.7.2.3 - Compiled: -O, WITH CASSERT checking, with - -DTBL_FREE_CMD_MEMORY (to free memory - if BEGIN/END after each query execution) - DB connection startup: 0.11 - 8192 INSERTs INTO SIMPLE (1 xact): 21.31 - 8192 INSERTs INTO SIMPLE (8192 xacts): 42.28 - Create INDEX on SIMPLE: 2.65 - 8192 INSERTs INTO SIMPLE with INDEX (1 xact): 25.74 - 8192 random INDEX scans on SIMPLE (1 xact): 42.26 - ORDER BY SIMPLE: 1.72 - - -6. No Asserts, more buffers (2048 vs 256) and -F to avoid forced i/o - DBMS: PostgreSQL 6.3.2 (plus changes to 98/06/01) - OS: Linux 2.0.32 leslie - HardWare: i586/133 HX 512, 32M RAM, fast SCSI, 7200rpm - StartUp: postmaster -B 2048 '-o -S 2048 -F' -S - Compiler: gcc 2.7.2.3 - Compiled: -O, No CASSERT checking, with - -DTBL_FREE_CMD_MEMORY (to free memory - if BEGIN/END after each query execution) - DB connection startup: 0.11 - 8192 INSERTs INTO SIMPLE (1 xact): 20.52 - 8192 INSERTs INTO SIMPLE (8192 xacts): 39.79 - Create INDEX on SIMPLE: 1.40 - 8192 INSERTs INTO SIMPLE with INDEX (1 xact): 24.77 - 8192 random INDEX scans on SIMPLE (1 xact): 39.51 - ORDER BY SIMPLE: 1.55 ---------------------------------------------------------------------- - --dg - -David Gould dg@illustra.com 510.628.3783 or 510.305.9468 -Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612 -"Don't worry about people stealing your ideas. If your ideas are any - good, you'll have to ram them down people's throats." -- Howard Aiken - - -From owner-pgsql-hackers@hub.org Tue Oct 19 10:31:10 1999 -Received: from renoir.op.net (root@renoir.op.net [209.152.193.4]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id KAA29087 - for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 10:31:08 -0400 (EDT) -Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.17 $) with ESMTP id KAA27535 for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 10:19:47 -0400 (EDT) -Received: from localhost (majordom@localhost) - by hub.org (8.9.3/8.9.3) with SMTP id KAA30328; - Tue, 19 Oct 1999 10:12:10 -0400 (EDT) - (envelope-from owner-pgsql-hackers) -Received: by hub.org (bulk_mailer v1.5); Tue, 19 Oct 1999 10:11:55 -0400 -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id KAA30030 - for pgsql-hackers-outgoing; Tue, 19 Oct 1999 10:11:00 -0400 (EDT) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from sss.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2]) - by hub.org (8.9.3/8.9.3) with ESMTP id KAA29914 - for <pgsql-hackers@postgreSQL.org>; Tue, 19 Oct 1999 10:10:33 -0400 (EDT) - (envelope-from tgl@sss.pgh.pa.us) -Received: from sss.sss.pgh.pa.us (localhost [127.0.0.1]) - by sss.sss.pgh.pa.us (8.9.1/8.9.1) with ESMTP id KAA09038; - Tue, 19 Oct 1999 10:09:15 -0400 (EDT) -To: "Hiroshi Inoue" <Inoue@tpf.co.jp> -cc: "Vadim Mikheev" <vadim@krs.ru>, pgsql-hackers@postgreSQL.org -Subject: Re: [HACKERS] mdnblocks is an amazing time sink in huge relations -In-reply-to: Your message of Tue, 19 Oct 1999 19:03:22 +0900 - <000801bf1a19$2d88ae20$2801007e@cadzone.tpf.co.jp> -Date: Tue, 19 Oct 1999 10:09:15 -0400 -Message-ID: <9036.940342155@sss.pgh.pa.us> -From: Tom Lane <tgl@sss.pgh.pa.us> -Sender: owner-pgsql-hackers@postgreSQL.org -Status: RO - -"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: -> 1. shared cache holds committed system tuples. -> 2. private cache holds uncommitted system tuples. -> 3. relpages of shared cache are updated immediately by -> phisical change and corresponding buffer pages are -> marked dirty. -> 4. on commit, the contents of uncommitted tuples except -> relpages,reltuples,... are copied to correponding tuples -> in shared cache and the combined contents are -> committed. -> If so,catalog cache invalidation would be no longer needed. -> But synchronization of the step 4. may be difficult. - -I think the main problem is that relpages and reltuples shouldn't -be kept in pg_class columns at all, because they need to have -very different update behavior from the other pg_class columns. - -The rest of pg_class is update-on-commit, and we can lock down any one -row in the normal MVCC way (if transaction A has modified a row and -transaction B also wants to modify it, B waits for A to commit or abort, -so it can know which version of the row to start from). Furthermore, -there can legitimately be several different values of a row in use in -different places: the latest committed, an uncommitted modification, and -one or more old values that are still being used by active transactions -because they were current when those transactions started. (BTW, the -present relcache is pretty bad about maintaining pure MVCC transaction -semantics like this, but it seems clear to me that that's the direction -we want to go in.) - -relpages cannot operate this way. To be useful for avoiding lseeks, -relpages *must* change exactly when the physical file changes. It -matters not at all whether the particular transaction that extended the -file ultimately commits or not. Moreover there can be only one correct -value (per relation) across the whole system, because there is only one -length of the relation file. - -If we want to take reltuples seriously and try to maintain it -on-the-fly, then I think it needs still a third behavior. Clearly -it cannot be updated using MVCC rules, or we lose all writer -concurrency (if A has added tuples to a rel, B would have to wait -for A to commit before it could update reltuples...). Furthermore -"updating" isn't a simple matter of storing what you think the new -value is; otherwise two transactions adding tuples in parallel would -leave the wrong answer after B commits and overwrites A's value. -I think it would work for each transaction to keep track of a net delta -in reltuples for each table it's changed (total tuples added less total -tuples deleted), and then atomically add that value to the table's -shared reltuples counter during commit. But that still leaves the -problem of how you use the counter during a transaction to get an -accurate answer to the question "If I scan this table now, how many tuples -will I see?" At the time the question is asked, the current shared -counter value might include the effects of transactions that have -committed since your transaction started, and therefore are not visible -under MVCC rules. I think getting the correct answer would involve -making an instantaneous copy of the current counter at the start of -your xact, and then adding your own private net-uncommitted-delta to -the saved shared counter value when asked the question. This doesn't -look real practical --- you'd have to save the reltuples counts of -*all* tables in the database at the start of each xact, on the off -chance that you might need them. Ugh. Perhaps someone has a better -idea. In any case, reltuples clearly needs different mechanisms than -the ordinary fields in pg_class do, because updating it will be a -performance bottleneck otherwise. - -If we allow reltuples to be updated only by vacuum-like events, as -it is now, then I think keeping it in pg_class is still OK. - -In short, it seems clear to me that relpages should be removed from -pg_class and kept somewhere else if we want to make it more reliable -than it is now, and the same for reltuples (but reltuples doesn't -behave the same as relpages, and probably ought to be handled -differently). - - regards, tom lane - -************ - -From owner-pgsql-hackers@hub.org Tue Oct 19 21:25:30 1999 -Received: from renoir.op.net (root@renoir.op.net [209.152.193.4]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA28130 - for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 21:25:26 -0400 (EDT) -Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.17 $) with ESMTP id VAA10512 for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 21:15:28 -0400 (EDT) -Received: from localhost (majordom@localhost) - by hub.org (8.9.3/8.9.3) with SMTP id VAA50745; - Tue, 19 Oct 1999 21:07:23 -0400 (EDT) - (envelope-from owner-pgsql-hackers) -Received: by hub.org (bulk_mailer v1.5); Tue, 19 Oct 1999 21:07:01 -0400 -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id VAA50644 - for pgsql-hackers-outgoing; Tue, 19 Oct 1999 21:06:06 -0400 (EDT) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from sd.tpf.co.jp (sd.tpf.co.jp [210.161.239.34]) - by hub.org (8.9.3/8.9.3) with ESMTP id VAA50584 - for <pgsql-hackers@postgreSQL.org>; Tue, 19 Oct 1999 21:05:26 -0400 (EDT) - (envelope-from Inoue@tpf.co.jp) -Received: from cadzone ([126.0.1.40] (may be forged)) - by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP - id KAA01715; Wed, 20 Oct 1999 10:05:14 +0900 -From: "Hiroshi Inoue" <Inoue@tpf.co.jp> -To: "Tom Lane" <tgl@sss.pgh.pa.us> -Cc: <pgsql-hackers@postgreSQL.org> -Subject: RE: [HACKERS] mdnblocks is an amazing time sink in huge relations -Date: Wed, 20 Oct 1999 10:09:13 +0900 -Message-ID: <000501bf1a97$b925a860$2801007e@cadzone.tpf.co.jp> -MIME-Version: 1.0 -Content-Type: text/plain; - charset="iso-8859-1" -Content-Transfer-Encoding: 7bit -X-Priority: 3 (Normal) -X-MSMail-Priority: Normal -X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0 -X-Mimeole: Produced By Microsoft MimeOLE V4.72.2106.4 -Importance: Normal -Sender: owner-pgsql-hackers@postgreSQL.org -Status: RO - -> -----Original Message----- -> From: Hiroshi Inoue [mailto:Inoue@tpf.co.jp] -> Sent: Tuesday, October 19, 1999 6:45 PM -> To: Tom Lane -> Cc: pgsql-hackers@postgreSQL.org -> Subject: RE: [HACKERS] mdnblocks is an amazing time sink in huge -> relations -> -> -> > -> > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes: -> -> [snip] -> -> > -> > > Deletion is necessary only not to consume disk space. -> > > -> > > For example vacuum could remove not deleted files. -> > -> > Hmm ... interesting idea ... but I can hear the complaints -> > from users already... -> > -> -> My idea is only an analogy of PostgreSQL's simple recovery -> mechanism of tuples. -> -> And my main point is -> "delete fails after commit" doesn't harm the database -> except that not deleted files consume disk space. -> -> Of cource,it's preferable to delete relation files immediately -> after(or just when) commit. -> Useless files are visible though useless tuples are invisible. -> - -Anyway I don't need "DROP TABLE inside transactions" now -and my idea is originally for that issue. - -After a thought,I propose the following solution. - -1. mdcreate() couldn't create existent relation files. - If the existent file is of length zero,we would overwrite - the file.(seems the comment in md.c says so but the - code doesn't do so). - If the file is an Index relation file,we would overwrite - the file. - -2. mdunlink() couldn't unlink non-existent relation files. - mdunlink() doesn't call elog(ERROR) even if the file - doesn't exist,though I couldn't find where to change - now. - mdopen() doesn't call elog(ERROR) even if the file - doesn't exist and leaves the relation as CLOSED. - -Comments ? - -Regards. - -Hiroshi Inoue -Inoue@tpf.co.jp - -************ - -From pgsql-hackers-owner+M6267@hub.org Sun Aug 27 21:46:37 2000 -Received: from hub.org (root@hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA07972 - for <pgman@candle.pha.pa.us>; Sun, 27 Aug 2000 20:46:36 -0400 (EDT) -Received: from hub.org (majordom@localhost [127.0.0.1]) - by hub.org (8.10.1/8.10.1) with SMTP id e7S0kaL27996; - Sun, 27 Aug 2000 20:46:36 -0400 (EDT) -Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2]) - by hub.org (8.10.1/8.10.1) with ESMTP id e7S05aL24107 - for <pgsql-hackers@postgreSQL.org>; Sun, 27 Aug 2000 20:05:36 -0400 (EDT) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id UAA01604 - for <pgsql-hackers@postgreSQL.org>; Sun, 27 Aug 2000 20:05:29 -0400 (EDT) -To: pgsql-hackers@postgreSQL.org -Subject: [HACKERS] Possible performance improvement: buffer replacement policy -Date: Sun, 27 Aug 2000 20:05:29 -0400 -Message-ID: <1601.967421129@sss.pgh.pa.us> -From: Tom Lane <tgl@sss.pgh.pa.us> -X-Mailing-List: pgsql-hackers@postgresql.org -Precedence: bulk -Sender: pgsql-hackers-owner@hub.org -Status: RO - -Those of you with long memories may recall a benchmark that Edmund Mergl -drew our attention to back in May '99. That test showed extremely slow -performance for updating a table with many indexes (about 20). At the -time, it seemed the problem was due to bad performance of btree with -many equal keys, so I thought I'd go back and retry the benchmark after -this latest round of btree hackery. - -The good news is that btree itself seems to be pretty well fixed; the -bad news is that the benchmark is still slow for large numbers of rows. -The problem is I/O: the CPU mostly sits idle waiting for the disk. -As best I can tell, the difficulty is that the working set of pages -needed to update this many indexes is too large compared to the number -of disk buffers Postgres is using. (I was running with -B 1000 and -looking at behavior for a 100000-row test table. This gave me a table -size of 3876 pages, plus 11526 pages in 20 indexes.) - -Of course, there's only so much we can do when the number of buffers -is too small, but I still started to wonder if we are using the buffers -as effectively as we can. Some tracing showed that most of the pages -of the indexes were being read and written multiple times within a -single UPDATE query, while most of the pages of the table proper were -fetched and written only once. That says we're not using the buffers -as well as we could; the index pages are not being kept in memory when -they should be. In a query like this, we should displace main-table -pages sooner to allow keeping more index pages in cache --- but with -the simple LRU replacement method we use, once a page has been loaded -it will stay in cache for at least the next NBuffers (-B) page -references, no matter what. With a large NBuffers that's a long time. - -I've come across an interesting article: - The LRU-K Page Replacement Algorithm For Database Disk Buffering - Elizabeth J. O'Neil, Patrick E. O'Neil, Gerhard Weikum - Proceedings of the 1993 ACM SIGMOD international conference - on Management of Data, May 1993 -(If you subscribe to the ACM digital library, you can get a PDF of this -from there.) This article argues that standard LRU buffer management is -inherently not great for database caches, and that it's much better to -replace pages on the basis of time since the K'th most recent reference, -not just time since the most recent one. K=2 is enough to get most of -the benefit. The big win is that you are measuring an actual page -interreference time (between the last two references) and not just -dealing with a lower-bound guess on the interreference time. Frequently -used pages are thus much more likely to stay in cache. - -It looks like it wouldn't take too much work to replace shared buffers -on the basis of LRU-2 instead of LRU, so I'm thinking about trying it. - -Has anyone looked into this area? Is there a better method to try? - - regards, tom lane - -From prlw1@newn.cam.ac.uk Fri Jan 19 12:54:45 2001 -Received: from henry.newn.cam.ac.uk (henry.newn.cam.ac.uk [131.111.204.130]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id MAA29822 - for <pgman@candle.pha.pa.us>; Fri, 19 Jan 2001 12:54:44 -0500 (EST) -Received: from [131.111.204.180] (helo=quartz.newn.cam.ac.uk) - by henry.newn.cam.ac.uk with esmtp (Exim 3.13 #1) - id 14JfkU-0001WA-00; Fri, 19 Jan 2001 17:54:54 +0000 -Received: from prlw1 by quartz.newn.cam.ac.uk with local (Exim 3.13 #1) - id 14Jfj6-0001cq-00; Fri, 19 Jan 2001 17:53:28 +0000 -Date: Fri, 19 Jan 2001 17:53:28 +0000 -From: Patrick Welche <prlw1@newn.cam.ac.uk> -To: Bruce Momjian <pgman@candle.pha.pa.us> -Cc: Tom Lane <tgl@sss.pgh.pa.us>, pgsql-hackers@postgreSQL.org -Subject: Re: [HACKERS] Possible performance improvement: buffer replacement policy -Message-ID: <20010119175328.A6223@quartz.newn.cam.ac.uk> -Reply-To: prlw1@cam.ac.uk -References: <1601.967421129@sss.pgh.pa.us> <200101191703.MAA25873@candle.pha.pa.us> -Mime-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -Content-Disposition: inline -User-Agent: Mutt/1.2i -In-Reply-To: <200101191703.MAA25873@candle.pha.pa.us>; from pgman@candle.pha.pa.us on Fri, Jan 19, 2001 at 12:03:58PM -0500 -Status: RO - -On Fri, Jan 19, 2001 at 12:03:58PM -0500, Bruce Momjian wrote: -> -> Tom, did we ever test this? I think we did and found that it was the -> same or worse, right? - -(Funnily enough, I just read that message:) - -To: Bruce Momjian <pgman@candle.pha.pa.us> -cc: pgsql-hackers@postgreSQL.org -Subject: Re: [HACKERS] Possible performance improvement: buffer replacement policy -In-reply-to: <200010161541.LAA06653@candle.pha.pa.us> -References: <200010161541.LAA06653@candle.pha.pa.us> -Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us> - message dated "Mon, 16 Oct 2000 11:41:41 -0400" -Date: Mon, 16 Oct 2000 11:49:52 -0400 -Message-ID: <26100.971711392@sss.pgh.pa.us> -From: Tom Lane <tgl@sss.pgh.pa.us> -X-Mailing-List: pgsql-hackers@postgresql.org -Precedence: bulk -Sender: pgsql-hackers-owner@hub.org -Status: RO -Content-Length: 947 -Lines: 19 - -Bruce Momjian <pgman@candle.pha.pa.us> writes: ->> It looks like it wouldn't take too much work to replace shared buffers ->> on the basis of LRU-2 instead of LRU, so I'm thinking about trying it. ->> ->> Has anyone looked into this area? Is there a better method to try? - -> Sounds like a perfect idea. Good luck. :-) - -Actually, the idea went down in flames :-(, but I neglected to report -back to pghackers about it. I did do some code to manage buffers as -LRU-2. I didn't have any good performance test cases to try it with, -but Richard Brosnahan was kind enough to re-run the TPC tests previously -published by Great Bridge with that code in place. Wasn't any faster, -in fact possibly a little slower, likely due to the extra CPU time spent -on buffer freelist management. It's possible that other scenarios might -show a better result, but right now I feel pretty discouraged about the -LRU-2 idea and am not pursuing it. - - regards, tom lane - - -From pgsql-hackers-owner+M3455@postgresql.org Fri Jan 19 13:18:12 2001 -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA02092 - for <pgman@candle.pha.pa.us>; Fri, 19 Jan 2001 13:18:12 -0500 (EST) -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f0JIFJ037872; - Fri, 19 Jan 2001 13:15:19 -0500 (EST) - (envelope-from pgsql-hackers-owner+M3455@postgresql.org) -Received: from sectorbase2.sectorbase.com ([208.48.122.131]) - by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id f0JI7V036780 - for <pgsql-hackers@postgreSQL.org>; Fri, 19 Jan 2001 13:07:31 -0500 (EST) - (envelope-from vmikheev@SECTORBASE.COM) -Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19) - id <DG1W4LRZ>; Fri, 19 Jan 2001 09:46:14 -0800 -Message-ID: <8F4C99C66D04D4118F580090272A7A234D329F@sectorbase1.sectorbase.com> -From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM> -To: "'Tom Lane'" <tgl@sss.pgh.pa.us>, Bruce Momjian <pgman@candle.pha.pa.us> -Cc: pgsql-hackers@postgresql.org -Subject: RE: [HACKERS] Possible performance improvement: buffer replacemen - t policy -Date: Fri, 19 Jan 2001 10:07:27 -0800 -MIME-Version: 1.0 -X-Mailer: Internet Mail Service (5.5.2653.19) -Content-Type: text/plain; - charset="iso-8859-1" -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: RO - -> > Tom, did we ever test this? I think we did and found that -> > it was the same or worse, right? -> -> I tried it and didn't see any noticeable improvement on the particular -> test case I was using, so I got discouraged and didn't pursue the idea -> further. I'd like to come back to it someday, though. - -I don't know how much useful could be LRU-2 but with WAL we should try -to reuse undirty free buffers first, not dirty ones, just to postpone -writes as long as we can. (BTW, this is what Oracle does.) -So, we probably should put new unfree dirty buffer just before first -dirty one in LRU. - -Vadim - -From markw@mohawksoft.com Thu Jun 7 14:40:02 2001 -Return-path: <markw@mohawksoft.com> -Received: from gromit.dotclick.com (ipn9-f8366.net-resource.net [216.204.83.66]) - by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f57Ie1c14004 - for <pgman@candle.pha.pa.us>; Thu, 7 Jun 2001 14:40:02 -0400 (EDT) -Received: from mohawksoft.com (IDENT:markw@localhost.localdomain [127.0.0.1]) - by gromit.dotclick.com (8.9.3/8.9.3) with ESMTP id OAA04973; - Thu, 7 Jun 2001 14:37:00 -0400 -Sender: markw@gromit.dotclick.com -Message-ID: <3B1FC9CB.57C72AD6@mohawksoft.com> -Date: Thu, 07 Jun 2001 14:36:59 -0400 -From: mlw <markw@mohawksoft.com> -X-Mailer: Mozilla 4.75 [en] (X11; U; Linux 2.4.2 i686) -X-Accept-Language: en -MIME-Version: 1.0 -To: Bruce Momjian <pgman@candle.pha.pa.us>, - "pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org> -Subject: Re: 7.2 items -References: <200106071503.f57F32n03924@candle.pha.pa.us> -Content-Type: text/plain; charset=us-ascii -Content-Transfer-Encoding: 7bit -Status: RO - -Bruce Momjian wrote: - -> > Bruce Momjian <pgman@candle.pha.pa.us> writes: -> > -> > > Here is a small list of big TODO items. I was wondering which ones -> > > people were thinking about for 7.2? -> > -> > A friend of mine wants to use PostgreSQL instead of Oracle for a large -> > application, but has run into a snag when speed comparisons looked -> > good until the Oracle folks added a couple of BITMAP indexes. I can't -> > recall seeing any discussion about that here -- are there any plans? -> -> It is not on our list and I am not sure what they do. - -Do you have access to any Oracle Documentation? There is a good explanation -of them. - -However, I will try to explain. - -If you have a table, locations. It has 1,000,000 records. - -In oracle you do this: - -create bitmap index bitmap_foo on locations (state) ; - -For each unique value of 'state' oracle will create a bitmap with 1,000,000 -bits in it. With a one representing a match and a zero representing no -match. Record '0' in the table is represented by bit '0' in the bitmap, -record '1' is represented by bit '1', record two by bit '2' and so on. - -In a table where comparatively few different values are to be indexed in a -large table, a bitmap index can be quite small and not suffer the N * log(N) -disk I/O most tree based indexes suffer. If the bitmap is fairly sparse or -dense (or have periods of denseness and sparseness), it can be compressed -very efficiently as well. - -When the statement: - -select * from locations where state = 'MA'; - -Is executed, the bitmap is read into memory in very few disk operations. -(Perhaps even as few as one or two). It is a simple operation of rifling -through the bitmap for '1's that indicate the record has the property, -'state' = 'MA'; - - -From mascarm@mascari.com Thu Jun 7 15:36:25 2001 -Return-path: <mascarm@mascari.com> -Received: from corvette.mascari.com (dhcp065-024-161-045.columbus.rr.com [65.24.161.45]) - by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f57JaOc21943 - for <pgman@candle.pha.pa.us>; Thu, 7 Jun 2001 15:36:24 -0400 (EDT) -Received: from ferrari (ferrari.mascari.com [192.168.2.1]) - by corvette.mascari.com (8.9.3/8.9.3) with SMTP id PAA25607; - Thu, 7 Jun 2001 15:29:31 -0400 -Received: by localhost with Microsoft MAPI; Thu, 7 Jun 2001 15:34:18 -0400 -Message-ID: <01C0EF67.5105D2E0.mascarm@mascari.com> -From: Mike Mascari <mascarm@mascari.com> -Reply-To: "mascarm@mascari.com" <mascarm@mascari.com> -To: "'mlw'" <markw@mohawksoft.com>, Bruce Momjian <pgman@candle.pha.pa.us>, - "pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org> -Subject: RE: [HACKERS] Re: 7.2 items -Date: Thu, 7 Jun 2001 15:34:17 -0400 -Organization: Mascari Development Inc. -X-Mailer: Microsoft Internet E-mail/MAPI - 8.0.0.4211 -MIME-Version: 1.0 -Content-Type: text/plain; charset="us-ascii" -Content-Transfer-Encoding: 7bit -Status: RO - -And in addition, - -If you submitted the query: - -SELECT * FROM addresses WHERE state = 'OH' -AND areacode = '614' - -Then, with bitmap indexes, the bitmaps are just logically ANDed -together, and the final bitmap determines the matching rows. - -Mike Mascari -mascarm@mascari.com - ------Original Message----- -From: mlw [SMTP:markw@mohawksoft.com] - -Bruce Momjian wrote: - -> > Bruce Momjian <pgman@candle.pha.pa.us> writes: -> > -> > > Here is a small list of big TODO items. I was wondering which -ones -> > > people were thinking about for 7.2? -> > -> > A friend of mine wants to use PostgreSQL instead of Oracle for a -large -> > application, but has run into a snag when speed comparisons -looked -> > good until the Oracle folks added a couple of BITMAP indexes. I -can't -> > recall seeing any discussion about that here -- are there any -plans? -> -> It is not on our list and I am not sure what they do. - -Do you have access to any Oracle Documentation? There is a good -explanation -of them. - -However, I will try to explain. - -If you have a table, locations. It has 1,000,000 records. - -In oracle you do this: - -create bitmap index bitmap_foo on locations (state) ; - -For each unique value of 'state' oracle will create a bitmap with -1,000,000 -bits in it. With a one representing a match and a zero representing -no -match. Record '0' in the table is represented by bit '0' in the -bitmap, -record '1' is represented by bit '1', record two by bit '2' and so -on. - -In a table where comparatively few different values are to be indexed -in a -large table, a bitmap index can be quite small and not suffer the N * -log(N) -disk I/O most tree based indexes suffer. If the bitmap is fairly -sparse or -dense (or have periods of denseness and sparseness), it can be -compressed -very efficiently as well. - -When the statement: - -select * from locations where state = 'MA'; - -Is executed, the bitmap is read into memory in very few disk -operations. -(Perhaps even as few as one or two). It is a simple operation of -rifling -through the bitmap for '1's that indicate the record has the -property, -'state' = 'MA'; - - - -From oleg@sai.msu.su Thu Jun 7 15:39:15 2001 -Return-path: <oleg@sai.msu.su> -Received: from ra.sai.msu.su (ra.sai.msu.su [158.250.29.2]) - by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f57Jd7c22010 - for <pgman@candle.pha.pa.us>; Thu, 7 Jun 2001 15:39:08 -0400 (EDT) -Received: from ra (ra [158.250.29.2]) - by ra.sai.msu.su (8.9.3/8.9.3) with ESMTP id WAA07783; - Thu, 7 Jun 2001 22:38:20 +0300 (GMT) -Date: Thu, 7 Jun 2001 22:38:20 +0300 (GMT) -From: Oleg Bartunov <oleg@sai.msu.su> -X-X-Sender: <megera@ra.sai.msu.su> -To: mlw <markw@mohawksoft.com> -cc: Bruce Momjian <pgman@candle.pha.pa.us>, - "pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org> -Subject: Re: [HACKERS] Re: 7.2 items -In-Reply-To: <3B1FC9CB.57C72AD6@mohawksoft.com> -Message-ID: <Pine.GSO.4.33.0106072234120.6015-100000@ra.sai.msu.su> -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Status: RO - -I think it's possible to implement bitmap indexes with a little -effort using GiST. at least I know one implementation -http://www.it.iitb.ernet.in/~rvijay/dbms/proj/ -if you have interests you could implement bitmap indexes yourself -unfortunately, we're very busy - - Oleg -On Thu, 7 Jun 2001, mlw wrote: - -> Bruce Momjian wrote: -> -> > > Bruce Momjian <pgman@candle.pha.pa.us> writes: -> > > -> > > > Here is a small list of big TODO items. I was wondering which ones -> > > > people were thinking about for 7.2? -> > > -> > > A friend of mine wants to use PostgreSQL instead of Oracle for a large -> > > application, but has run into a snag when speed comparisons looked -> > > good until the Oracle folks added a couple of BITMAP indexes. I can't -> > > recall seeing any discussion about that here -- are there any plans? -> > -> > It is not on our list and I am not sure what they do. -> -> Do you have access to any Oracle Documentation? There is a good explanation -> of them. -> -> However, I will try to explain. -> -> If you have a table, locations. It has 1,000,000 records. -> -> In oracle you do this: -> -> create bitmap index bitmap_foo on locations (state) ; -> -> For each unique value of 'state' oracle will create a bitmap with 1,000,000 -> bits in it. With a one representing a match and a zero representing no -> match. Record '0' in the table is represented by bit '0' in the bitmap, -> record '1' is represented by bit '1', record two by bit '2' and so on. -> -> In a table where comparatively few different values are to be indexed in a -> large table, a bitmap index can be quite small and not suffer the N * log(N) -> disk I/O most tree based indexes suffer. If the bitmap is fairly sparse or -> dense (or have periods of denseness and sparseness), it can be compressed -> very efficiently as well. -> -> When the statement: -> -> select * from locations where state = 'MA'; -> -> Is executed, the bitmap is read into memory in very few disk operations. -> (Perhaps even as few as one or two). It is a simple operation of rifling -> through the bitmap for '1's that indicate the record has the property, -> 'state' = 'MA'; -> -> -> ---------------------------(end of broadcast)--------------------------- -> TIP 6: Have you searched our list archives? -> -> http://www.postgresql.org/search.mpl -> - - Regards, - Oleg -_____________________________________________________________ -Oleg Bartunov, sci.researcher, hostmaster of AstroNet, -Sternberg Astronomical Institute, Moscow University (Russia) -Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ -phone: +007(095)939-16-83, +007(095)939-23-83 - - -From pgsql-general-owner+M2497@hub.org Fri Jun 16 18:31:03 2000 -Received: from renoir.op.net (root@renoir.op.net [207.29.195.4]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id RAA04165 - for <pgman@candle.pha.pa.us>; Fri, 16 Jun 2000 17:31:01 -0400 (EDT) -Received: from hub.org (root@hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.17 $) with ESMTP id RAA13110 for <pgman@candle.pha.pa.us>; Fri, 16 Jun 2000 17:20:12 -0400 (EDT) -Received: from hub.org (majordom@localhost [127.0.0.1]) - by hub.org (8.10.1/8.10.1) with SMTP id e5GLDaM14477; - Fri, 16 Jun 2000 17:13:36 -0400 (EDT) -Received: from home.dialix.com ([203.15.150.26]) - by hub.org (8.10.1/8.10.1) with ESMTP id e5GLCQM14064 - for <pgsql-general@postgresql.org>; Fri, 16 Jun 2000 17:12:27 -0400 (EDT) -Received: from nemeton.com.au ([202.76.153.71]) - by home.dialix.com (8.9.3/8.9.3/JustNet) with SMTP id HAA95516 - for <pgsql-general@postgresql.org>; Sat, 17 Jun 2000 07:11:44 +1000 (EST) - (envelope-from giles@nemeton.com.au) -Received: (qmail 10213 invoked from network); 16 Jun 2000 09:52:29 -0000 -Received: from nemeton.com.au (203.8.3.17) - by nemeton.com.au with SMTP; 16 Jun 2000 09:52:29 -0000 -To: Jurgen Defurne <defurnj@glo.be> -cc: Mark Stier <kalium@gmx.de>, - postgreSQL general mailing list <pgsql-general@postgresql.org> -Subject: Re: [GENERAL] optimization by removing the file system layer? -In-Reply-To: Message from Jurgen Defurne <defurnj@glo.be> - of "Thu, 15 Jun 2000 20:26:57 +0200." <39491FF1.E1E583F8@glo.be> -Date: Fri, 16 Jun 2000 19:52:28 +1000 -Message-ID: <10210.961149148@nemeton.com.au> -From: Giles Lean <giles@nemeton.com.au> -X-Mailing-List: pgsql-general@postgresql.org -Precedence: bulk -Sender: pgsql-general-owner@hub.org -Status: OR - - - -> I think that the Un*x filesystem is one of the reasons that large -> database vendors rather use raw devices, than filesystem storage -> files. - -This used to be the preference, back in the late 80s and possibly -early 90s. I'm seeing a preference toward using the filesystem now, -possibly with some sort of async I/O and co-operation from the OS -filesystem about interactions with the filesystem cache. - -Performance preferences don't stand still. The hardware changes, the -software changes, the volume of data changes, and different solutions -become preferable. - -> Using a raw device on the disk gives them the possibility to have -> complete control over their files, indices and objects without being -> bothered by the operating system. -> -> This speeds up things in several ways : -> - the least possible OS intervention - -Not that this is especially useful, necessarily. If the "raw" device -is in fact managed by a logical volume manager doing mirroring onto -some sort of storage array there is still plenty of OS code involved. - -The cost of using a filesystem in addition may not be much if anything -and of course a filesystem is considerably more flexible to -administer (backup, move, change size, check integrity, etc.) - -> - choose block sizes according to applications -> - reducing fragmentation -> - packing data in nearby cilinders - -... but when this storage area is spread over multiple mechanisms in a -smart storage array with write caching, you've no idea what is where -anyway. Better to let the hardware or at least the OS manage this; -there are so many levels of caching between a database and the -magnetic media that working hard to influence layout is almost -certainly a waste of time. - -Kirk McKusick tells a lovely story that once upon a time it used to be -sensible to check some registers on a particular disk controller to -find out where the heads were when scheduling I/O. Needless to say, -that is history now! - -There's a considerable cost in complexity and code in using "raw" -storage too, and it's not a one off cost: as the technologies change, -the "fast" way to do things will change and the code will have to be -updated to match. Better to leave this to the OS vendor where -possible, and take advantage of the tuning they do. - -> - Anyone other ideas -> the sky is the limit here - -> It also aids portability, at least on platforms that have an -> equivalent of a raw device. - -I don't understand that claim. Not much is portable about raw -devices, and they're typically not nearlly as well documented as the -filesystem interfaces. - -> It is also independent of the standard implemented Un*x filesystems, -> for which you will have to pay extra if you want to take extra -> measures against power loss. - -Rather, it is worse. With a Unix filesystem you get quite defined -semantics about what is written when. - -> The problem with e.g. e2fs, is that it is not robust enough if a CPU -> fails. - -ext2fs doesn't even claim to have Unix filesystem semantics. - -Regards, - -Giles - - - -From pgsql-hackers-owner+M1795@postgresql.org Thu Dec 7 18:47:52 2000 -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id SAA09172 - for <pgman@candle.pha.pa.us>; Thu, 7 Dec 2000 18:47:52 -0500 (EST) -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by mail.postgresql.org (8.11.1/8.11.1) with SMTP id eB7NjFP10612; - Thu, 7 Dec 2000 18:45:15 -0500 (EST) - (envelope-from pgsql-hackers-owner+M1795@postgresql.org) -Received: from thor.tht.net (thor.tht.net [209.47.145.4]) - by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id eB7N6BP08233 - for <pgsql-hackers@postgresql.org>; Thu, 7 Dec 2000 18:06:11 -0500 (EST) - (envelope-from bright@fw.wintelcom.net) -Received: from fw.wintelcom.net (bright@ns1.wintelcom.net [209.1.153.20]) - by thor.tht.net (8.9.3/8.9.3) with ESMTP id SAA97456 - for <pgsql-hackers@postgresql.org>; Thu, 7 Dec 2000 18:57:32 GMT - (envelope-from bright@fw.wintelcom.net) -Received: (from bright@localhost) - by fw.wintelcom.net (8.10.0/8.10.0) id eB7MvWE21269 - for pgsql-hackers@postgresql.org; Thu, 7 Dec 2000 14:57:32 -0800 (PST) -Date: Thu, 7 Dec 2000 14:57:32 -0800 -From: Alfred Perlstein <bright@wintelcom.net> -To: pgsql-hackers@postgresql.org -Subject: [HACKERS] Patches with vacuum fixes available for 7.0.x -Message-ID: <20001207145732.X16205@fw.wintelcom.net> -MIME-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -Content-Disposition: inline -User-Agent: Mutt/1.2.5i -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: ORr - -We recently had a very satisfactory contract completed by -Vadim. - -Basically Vadim has been able to reduce the amount of time -taken by a vacuum from 10-15 minutes down to under 10 seconds. - -We've been running with these patches under heavy load for -about a week now without any problems except one: - don't 'lazy' (new option for vacuum) a table which has just - had an index created on it, or at least don't expect it to - take any less time than a normal vacuum would. - -There's three patchsets and they are available at: - -http://people.freebsd.org/~alfred/vacfix/ - -complete diff: -http://people.freebsd.org/~alfred/vacfix/v.diff - -only lazy vacuum option to speed up index vacuums: -http://people.freebsd.org/~alfred/vacfix/vlazy.tgz - -only lazy vacuum option to only scan from start of modified -data: -http://people.freebsd.org/~alfred/vacfix/mnmb.tgz - -Although the patches are for 7.0.x I'm hoping that they -can be forward ported (if Vadim hasn't done it already) -to 7.1. - -enjoy! - --- --Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] -"I have the heart of a child; I keep it in a jar on my desk." - -From pgsql-hackers-owner+M1809@postgresql.org Thu Dec 7 20:27:39 2000 -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA11827 - for <pgman@candle.pha.pa.us>; Thu, 7 Dec 2000 20:27:38 -0500 (EST) -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by mail.postgresql.org (8.11.1/8.11.1) with SMTP id eB81PsP22362; - Thu, 7 Dec 2000 20:25:54 -0500 (EST) - (envelope-from pgsql-hackers-owner+M1809@postgresql.org) -Received: from fw.wintelcom.net (ns1.wintelcom.net [209.1.153.20]) - by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id eB81JkP21783 - for <pgsql-hackers@postgresql.org>; Thu, 7 Dec 2000 20:19:46 -0500 (EST) - (envelope-from bright@fw.wintelcom.net) -Received: (from bright@localhost) - by fw.wintelcom.net (8.10.0/8.10.0) id eB81JwU25447; - Thu, 7 Dec 2000 17:19:58 -0800 (PST) -Date: Thu, 7 Dec 2000 17:19:58 -0800 -From: Alfred Perlstein <bright@wintelcom.net> -To: Tom Lane <tgl@sss.pgh.pa.us> -cc: pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Patches with vacuum fixes available for 7.0.x -Message-ID: <20001207171958.B16205@fw.wintelcom.net> -References: <20001207145732.X16205@fw.wintelcom.net> <28791.976236143@sss.pgh.pa.us> -MIME-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -Content-Disposition: inline -User-Agent: Mutt/1.2.5i -In-Reply-To: <28791.976236143@sss.pgh.pa.us>; from tgl@sss.pgh.pa.us on Thu, Dec 07, 2000 at 07:42:23PM -0500 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -* Tom Lane <tgl@sss.pgh.pa.us> [001207 17:10] wrote: -> Alfred Perlstein <bright@wintelcom.net> writes: -> > Basically Vadim has been able to reduce the amount of time -> > taken by a vacuum from 10-15 minutes down to under 10 seconds. -> -> Cool. What's it do, exactly? - -================================================================ - -The first is a bonus that Vadim gave us to speed up index -vacuums, I'm not sure I understand it completely, but it -work really well. :) - -here's the README he gave us: - - Vacuum LAZY index cleanup option - -LAZY vacuum option introduces new way of indices cleanup. -Instead of reading entire index file to remove index tuples -pointing to deleted table records, with LAZY option vacuum -performes index scans using keys fetched from table record -to be deleted. Vacuum checks each result returned by index -scan if it points to target heap record and removes -corresponding index tuple. -This can greatly speed up indices cleaning if not so many -table records were deleted/modified between vacuum runs. -Vacuum uses new option on user' demand. - -New vacuum syntax is: - -vacuum [verbose] [analyze] [lazy] [table [(columns)]] - -================================================================ - -The second is one of the suggestions I gave on the lists a while -back, keeping track of the "last dirtied" block in the data files -to only scan the tail end of the file for deleted rows, I think -what he instead did was keep a table that holds all the modified -blocks and vacuum only scans those: - - Minimal Number Modified Block (MNMB) - -This feature is to track MNMB of required tables with triggers -to avoid reading unmodified table pages by vacuum. Triggers -store MNMB in per-table files in specified directory -($LIBDIR/contrib/mnmb by default) and create these files if not -existed. - -Vacuum first looks up functions - -mnmb_getblock(Oid databaseId, Oid tableId) -mnmb_setblock(Oid databaseId, Oid tableId, Oid block) - -in catalog. If *both* functions were found *and* there was no -ANALYZE option specified then vacuum calls mnmb_getblock to obtain -MNMB for table being vacuumed and starts reading this table from -block number returned. After table was processed vacuum calls -mnmb_setblock to update data in file to last table block number. -Neither mnmb_getblock nor mnmb_setblock try to create file. -If there was no file for table being vacuumed then mnmb_getblock -returns 0 and mnmb_setblock does nothing. -mnmb_setblock() may be used to set in file MNMB to 0 and force -vacuum to read entire table if required. - -To compile MNMB you have to add -DMNMB to CUSTOM_COPT -in src/Makefile.custom. - --- --Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] -"I have the heart of a child; I keep it in a jar on my desk." - -From pgsql-general-owner+M4010@postgresql.org Mon Feb 5 18:50:47 2001 -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id SAA02209 - for <pgman@candle.pha.pa.us>; Mon, 5 Feb 2001 18:50:46 -0500 (EST) -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f15Nn8x86486; - Mon, 5 Feb 2001 18:49:08 -0500 (EST) - (envelope-from pgsql-general-owner+M4010@postgresql.org) -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f15N7Ux81124 - for <pgsql-general@postgresql.org>; Mon, 5 Feb 2001 18:07:30 -0500 (EST) - (envelope-from pgsql-general-owner@postgresql.org) -Received: from news.tht.net (news.hub.org [216.126.91.242]) - by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id f0V0Twq69854 - for <pgsql-general@postgresql.org>; Tue, 30 Jan 2001 19:29:58 -0500 (EST) - (envelope-from news@news.tht.net) -Received: (from news@localhost) - by news.tht.net (8.11.1/8.11.1) id f0V0RAO01011 - for pgsql-general@postgresql.org; Tue, 30 Jan 2001 19:27:10 -0500 (EST) - (envelope-from news) -From: Mike Hoskins <mikehoskins@yahoo.com> -X-Newsgroups: comp.databases.postgresql.general -Subject: Re: [GENERAL] MySQL file system -Date: Tue, 30 Jan 2001 18:30:36 -0600 -Organization: Hub.Org Networking Services (http://www.hub.org) -Lines: 120 -Message-ID: <3A775CAB.C416AA16@yahoo.com> -References: <016e01c080b7$ea554080$330a0a0a@6014cwpza006> -MIME-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -Content-Transfer-Encoding: 7bit -X-Complaints-To: scrappy@hub.org -X-Mailer: Mozilla 4.76 [en] (Windows NT 5.0; U) -X-Accept-Language: en -To: pgsql-general@postgresql.org -Precedence: bulk -Sender: pgsql-general-owner@postgresql.org -Status: OR - -This idea is such a popular (even old) one that Oracle developed it for 8i -- -IFS. Yep, AS/400 has had it forever, and BeOS is another example. Informix has -had its DataBlades for years, as well. In fact, Reiser-FS is an FS implemented -on a DB, albeit probably not a SQL DB. AIX's LVM and JFS is extent/DB-based, as -well. Let's see now, why would all those guys do that? (Now, some of those that -aren't SQL-based probably won't allow SQL queries on files, so just think about -those that do, for a minute).... - -Rather than asking why, a far better question is why not? There is SO much -functionality to be gained here that it's silly to ask why. At a higher level, -treating BLOBs as files and as DB entries simultaneously has so many uses, that -one has trouble answering the question properly without the puzzled stare back -at the questioner. Again, look at the above list, particularly at AS/400 -- the -entire OS's FS sits on top of DB/2! - -For example, think how easy dynamically generated web sites could access online -catalog information, with all those JPEG's, GIFs, PNGs, HTML files, Text files, -.PDF's, etc., both in the DB and in the FS. This would be so much easier to -maintain, when you have webmasters, web designers, artists, programmers, -sysadmins, dba's, etc., all trying to manage a big, dynamic, graphics-rich web -site. Who cares if the FS is a bit slow, as long as it's not too slow? That's -not the point, anyway. - -The point is easy access to data: asset management, version control, the -ability to access the same data as a file and as a BLOB simultaneously, the -ability to replicate easier, the ability to use more tools on the same info, -etc. It's not for speed, per se; instead, it's for accessibility. - -Think about this issue. You have some already compiled text-based program that -works on binary files, but not on databases -- it was simply never designed into -the program. How are you going to get your graphics BLOBs into that program? -Oh yeah, let's write another program to transform our data into files, first, -then after processing delete them in some cleanup routine.... Why? If you have -a DB'ed FS, then file data can simultaneously have two views -- one for the DB -and one as an FS. (You can easily reverse the scenario.) Not only does this -save time and disk space; it saves you from having to pay for the most expensive -element of all -- programmer time. - -BTW, once this FS-on-a-DB concept really sinks in, imagine how tightly -integrated Linux/Unix apps could be written. Imagine if a bunch of GPL'ed -software started coding for this and used this as a means to exchange data, all -using a common set of libraries. You could get to the point of uniting files, -BLOBs, data of all sorts, IPC, version control, etc., all under one umbrella, -especially if XML was the means data was exchanged. Heck, distributed -authentication, file access, data access, etc., could be improved greatly. -Well, this paragraph sounds like flame bait, but really consider the -ramifications. Also, read the next paragraph.... - -Something like this *has* existed for Postgres for a long time -- PGFS, by Brian -Bartholomew. It's even supposedly matured with age. Unfortunately, I cannot -get to http://www.wv.com/ (Working Version's main site). Working Version is a -version control system that keeps old versions of files around in the FS. It -uses PG as the back-end DB and lets you mount it like another FS. It's -supposedly an awesome system, but where is it? It's not some clunky korbit -thingy, either. (If someone can find it, please let me know by email, if -possible.) - -The only thing I can find on this is from a Google search, which caches -everything but the actual software: - -http://www.google.com/search?q=pgfs+postgres&num=100&hl=en&lr=lang_en&newwindow=1&safe=active - -Also, there is the Perl-FS that can be transformed into something like PGFS: -http://www.assurdo.com/perlfs/ It allows you to write Perl code that can mount -various protocols or data types as an FS, in user space. (One example is the -ability to mount FTP sites, BTW.) - -Instead of ridiculing something you've never tried, consider that MySQL-FS, -Oracle (IFS), Informix (DataBlades), AS/400 (DB/2), BeOS, and Reiser-FS are -doing this today. Do you want to be left behind and let them tell us what it's -good for? Or, do we want this for PG? (Reiser-FS, BTW, is FASTER than ext2, -but has no SQL hooks). - -There were many posts on this on slashdot: - http://slashdot.org/article.pl?sid=01/01/16/1855253&mode=thread - (I wrote some comments here, as well, just look for mikehoskins) - -I, for one, want to see this succeed for MySQL, PostgreSQL, msql, etc. It's an -awesome feature that doesn't need to be speedy because it can save HUMANS time. - -The question really is, "When do we want to catch up to everyone else?" We are -always moving to higher levels of abstraction, anyway, so it's just a matter of -time. PG should participate. - - -Adam Lang wrote: - -> I wasn't following the thread too closely, but database for a filesystem has -> been done. BeOS uses a database for a filesystem as well as AS/400 and -> Mainframes. -> -> Adam Lang -> Systems Engineer -> Rutgers Casualty Insurance Company -> http://www.rutgersinsurance.com -> ----- Original Message ----- -> From: "Alfred Perlstein" <bright@wintelcom.net> -> To: "Robert D. Nelson" <RDNELSON@co.centre.pa.us> -> Cc: "Joseph Shraibman" <jks@selectacast.net>; "Karl DeBisschop" -> <karl@debisschop.net>; "Ned Lilly" <ned@greatbridge.com>; "PostgreSQL -> General" <pgsql-general@postgresql.org> -> Sent: Wednesday, January 17, 2001 12:23 PM -> Subject: Re: [GENERAL] MySQL file system -> -> > * Robert D. Nelson <RDNELSON@co.centre.pa.us> [010117 05:17] wrote: -> > > >Raw disk access allows: -> > > -> > > If I'm correct, mysql is providing a filesystem, not a way to access raw -> > > disk, like Oracle does. Huge difference there - with a filesystem, you -> have -> > > overhead of FS *and* SQL at the same time. -> > -> > Oh, so it's sort of like /proc for mysql? -> > -> > What a terrible waste of time and resources. :( -> > -> > -- -> > -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] -> > "I have the heart of a child; I keep it in a jar on my desk." - - -From pgsql-general-owner+M4049@postgresql.org Tue Feb 6 01:26:19 2001 -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id BAA21425 - for <pgman@candle.pha.pa.us>; Tue, 6 Feb 2001 01:26:18 -0500 (EST) -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f166Nxx26400; - Tue, 6 Feb 2001 01:23:59 -0500 (EST) - (envelope-from pgsql-general-owner+M4049@postgresql.org) -Received: from simecity.com ([202.188.254.2]) - by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id f166GUx25754 - for <pgsql-general@postgresql.org>; Tue, 6 Feb 2001 01:16:30 -0500 (EST) - (envelope-from lyeoh@pop.jaring.my) -Received: (from mail@localhost) - by simecity.com (8.9.3/8.8.7) id OAA23910; - Tue, 6 Feb 2001 14:28:48 +0800 -Received: from <lyeoh@pop.jaring.my> (ilab2.mecomb.po.my [192.168.3.22]) by cirrus.simecity.com via smap (V2.1) - id xma023908; Tue, 6 Feb 01 14:28:34 +0800 -Message-ID: <3.0.5.32.20010206141555.00a3d100@192.228.128.13> -X-Sender: lyeoh@192.228.128.13 -X-Mailer: QUALCOMM Windows Eudora Light Version 3.0.5 (32) -Date: Tue, 06 Feb 2001 14:15:55 +0800 -To: Mike Hoskins <mikehoskins@yahoo.com>, pgsql-general@postgresql.org -From: Lincoln Yeoh <lyeoh@pop.jaring.my> -Subject: [GENERAL] Re: MySQL file system -In-Reply-To: <3A775CF7.3C5F1909@yahoo.com> -References: <016e01c080b7$ea554080$330a0a0a@6014cwpza006> -MIME-Version: 1.0 -Content-Type: text/plain; charset="us-ascii" -Precedence: bulk -Sender: pgsql-general-owner@postgresql.org -Status: OR - -What you're saying seems to be to have a data structure where the same data -can be accessed in both the filesystem style and the RDBMs style. How does -that work? How is the mapping done between both structures? Slapping a -filesystem on top of a RDBMs doesn't do that does it? - -Most filesystems are basically databases already, just differently -structured and featured databases. And so far most of them do their job -pretty well. You move a folder/directory somewhere, and everything inside -it moves. Tons of data are already arranged in that form. Though porting -over data from one filesystem to another is not always straightforward, -RDBMSes are far worse. - -Maybe what would be nice is not a filesystem based on a database, rather -one influenced by databases. One with a decent fulltextindex for data and -filenames, where you have the option to ignore or not ignore -nonalphanumerics and still get an indexed search. - -Then perhaps we could do something like the following: - -select file.name from path "/var/logs/" where file.name like "%.log%' and -file.lastmodified > '2000/1/1' and file.contents =~ 'te_st[0-9]+\.gif$' use -index - -Checkpoints would be nice too. Then I can rollback to a known point if I -screw up ;). - -In fact the SQL style interface doesn't have to be built in at all. Neither -does the index have to be realtime. I suppose there could be an option to -make it realtime if performance is not an issue. - -What could be done is to use some fast filesystem. Then we add tools to -maintain indexes, for SQL style interfaces and other style interfaces. -Checkpoints and rollbacks would be harder of course. - -Cheerio, -Link. - - -From pgsql-hackers-owner+M20329@postgresql.org Tue Mar 19 18:00:15 2002 -Return-path: <pgsql-hackers-owner+M20329@postgresql.org> -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g2K00EA02465 - for <pgman@candle.pha.pa.us>; Tue, 19 Mar 2002 19:00:14 -0500 (EST) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id 8C7164763EF; Tue, 19 Mar 2002 18:22:08 -0500 (EST) -Received: from CopelandConsulting.Net (dsl-24293-ld.customer.centurytel.net [209.142.135.135]) - by postgresql.org (Postfix) with ESMTP id E4DAD475F1F - for <pgsql-hackers@postgresql.org>; Tue, 19 Mar 2002 18:02:17 -0500 (EST) -Received: from mouse.copelandconsulting.net (mouse.copelandconsulting.net [192.168.1.2]) - by CopelandConsulting.Net (8.10.1/8.10.1) with ESMTP id g2JN0jh13185; - Tue, 19 Mar 2002 17:00:45 -0600 (CST) -X-Trade-Id: <CCC.Tue, 19 Mar 2002 17:00:45 -0600 (CST).Tue, 19 Mar 2002 17:00:45 -0600 (CST).200203192300.g2JN0jh13185.g2JN0jh13185@CopelandConsulting.Net. -Subject: Re: [HACKERS] Bitmap indexes? -From: Greg Copeland <greg@CopelandConsulting.Net> -To: Matthew Kirkwood <matthew@hairy.beasts.org> -cc: Oleg Bartunov <oleg@sai.msu.su>, - PostgresSQL Hackers Mailing List <pgsql-hackers@postgresql.org> - <Pine.LNX.4.33.0203192118140.29494-100000@sphinx.mythic-beasts.com> - <Pine.LNX.4.33.0203192118140.29494-100000@sphinx.mythic-beasts.com> -Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; - boundary="=-Ivchb84S75fOMzJ9DxwK" -X-Mailer: Evolution/1.0.2 -Date: 19 Mar 2002 17:00:53 -0600 -Message-ID: <1016578854.14670.450.camel@mouse.copelandconsulting.net> -MIME-Version: 1.0 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - ---=-Ivchb84S75fOMzJ9DxwK -Content-Type: text/plain -Content-Transfer-Encoding: quoted-printable - -On Tue, 2002-03-19 at 15:30, Matthew Kirkwood wrote: -> On Tue, 19 Mar 2002, Oleg Bartunov wrote: ->=20 -> Sorry to reply over you, Oleg. ->=20 -> > On 13 Mar 2002, Greg Copeland wrote: -> > -> > > One of the reasons why I originally stated following the hackers list= - is -> > > because I wanted to implement bitmap indexes. I found in the archive= -s, -> > > the follow link, http://www.it.iitb.ernet.in/~rvijay/dbms/proj/, which -> > > was extracted from this, -> > > http://groups.google.com/groups?hl=3Den&threadm=3D01C0EF67.5105D2E0.m= -ascarm%40mascari.com&rnum=3D1&prev=3D/groups%3Fq%3Dbitmap%2Bindex%2Bgroup:c= -omp.databases.postgresql.hackers%26hl%3Den%26selm%3D01C0EF67.5105D2E0.masca= -rm%2540mascari.com%26rnum%3D1, archive thread. ->=20 -> For every case I have used a bitmap index on Oracle, a -> partial index[0] made more sense (especialy since it -> could usefully be compound). - -That's very true, however, often bitmap indexes are used where partial -indexes may not work well. It maybe you were trying to apply the cure -for the wrong disease. ;) - ->=20 -> Our troublesome case (on Oracle) is a table of "events" -> where maybe fifty to a couple of hundred are "published" -> (ie. web-visible) at any time. The events are categorised -> by sport (about a dozen) and by "event type" (about five). -> We never really query events except by PK or by sport/type/ -> published. - -The reason why bitmap indexes are primarily used for DSS and data -wherehousing applications is because they are best used on extremely -large to very large tables which have low cardinality (e.g, 10,000,000 -rows having 200 distinct values). On top of that, bitmap indexes also -tend to be much smaller than their "standard" cousins. On large and -very tables tables, this can sometimes save gigs in index space alone -(serious space benefit). Plus, their small index size tends to result -in much less I/O (serious speed benefit). This, of course, can result -in several orders of magnitude speed improvements when index scans are -required. As an added bonus, using AND, OR, XOR and NOT predicates are -exceptionally fast and if implemented properly, can even take advantage -of some 64-bit hardware for further speed improvements. This, of -course, further speeds look ups. The primary down side is that inserts -and updates to bitmap indexes are very costly (comparatively) which is, -yet again, why they excel in read-only environments (DSS & data -wherehousing). - -It should also be noted that RDMS's, such as Oracle, often use multiple -types of bitmap indexes. This further impedes insert/update -performance, however, the additional bitmap index types usually allow -for range predicates while still making use of the bitmap index. If I'm -not mistaken, several other types of bitmaps are available as well as -many ways to encode and compress (rle, quad compression, etc) bitmap -indexes which further save on an already compact indexing scheme. - -Given the proper problem domain, index bitmaps can be a big win. - ->=20 -> We make a bitmap index on "published", and trust Oracle to -> use it correctly, and hope that our other indexes are also -> useful. ->=20 -> On Postgres[1] we would make a partial compound index: ->=20 -> create index ... on events(sport_id,event_type_id) -> where published=3D'Y'; - - -Generally speaking, bitmap indexes will not serve you very will on -tables having a low row counts, high cardinality or where they are -attached to tables which are primarily used in an OLTP capacity.=20 -Situations where you have a low row count and low cardinality or high -row count and high cardinality tend to be better addressed by partial -indexes; which seem to make much more sense. In your example, it sounds -like you did "the right thing"(tm). ;) - - -Greg - - ---=-Ivchb84S75fOMzJ9DxwK -Content-Type: application/pgp-signature; name=signature.asc -Content-Description: This is a digitally signed message part - ------BEGIN PGP SIGNATURE----- -Version: GnuPG v1.0.6 (GNU/Linux) -Comment: For info see http://www.gnupg.org - -iD8DBQA8l8Ml4lr1bpbcL6kRAhldAJ9Aoi9dwm1OteZjySfsd1o42trWLACfegQj -OEV6eO8MnBSlbJMHiQ08gNE= -=PQvW ------END PGP SIGNATURE----- - ---=-Ivchb84S75fOMzJ9DxwK-- - - -From pgsql-hackers-owner+M26157@postgresql.org Tue Aug 6 23:06:34 2002 -Date: Wed, 7 Aug 2002 13:07:38 +1000 (EST) -From: Gavin Sherry <swm@linuxworld.com.au> -To: Curt Sampson <cjs@cynic.net> -cc: pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] CLUSTER and indisclustered -In-Reply-To: <Pine.NEB.4.44.0208071126590.1214-100000@angelic.cynic.net> -Message-ID: <Pine.LNX.4.21.0208071259210.13438-100000@linuxworld.com.au> -X-Virus-Scanned: by AMaViS new-20020517 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Virus-Scanned: by AMaViS new-20020517 -Content-Length: 1357 - -On Wed, 7 Aug 2002, Curt Sampson wrote: - -> But after doing some benchmarking of various sorts of random reads -> and writes, it occurred to me that there might be optimizations -> that could help a lot with this sort of thing. What if, when we've -> got an index block with a bunch of entries, instead of doing the -> reads in the order of the entries, we do them in the order of the -> blocks the entries point to? That would introduce a certain amount -> of "sequentialness" to the reads that the OS is not capable of -> introducing (since it can't reschedule the reads you're doing, the -> way it could reschedule, say, random writes). - -This sounds more or less like the method employed by Firebird as described -by Ann Douglas to Tom at OSCON (correct me if I get this wrong). - -Basically, firebird populates a bitmap with entries the scan is interested -in. The bitmap is populated in page order so that all entries on the same -heap page can be fetched at once. - -This is totally different to the way postgres does things and would -require significant modification to the index access methods. - -Gavin - - ----------------------------(end of broadcast)--------------------------- -TIP 3: if posting/reading through Usenet, please send an appropriate -subscribe-nomail command to majordomo@postgresql.org so that your -message can get through to the mailing list cleanly - -From pgsql-hackers-owner+M26162@postgresql.org Wed Aug 7 00:42:35 2002 -To: Curt Sampson <cjs@cynic.net> -cc: mark Kirkwood <markir@slithery.org>, Gavin Sherry <swm@linuxworld.com.au>, - Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] CLUSTER and indisclustered -In-Reply-To: <Pine.NEB.4.44.0208071126590.1214-100000@angelic.cynic.net> -References: <Pine.NEB.4.44.0208071126590.1214-100000@angelic.cynic.net> -Comments: In-reply-to Curt Sampson <cjs@cynic.net> - message dated "Wed, 07 Aug 2002 11:31:32 +0900" -Date: Wed, 07 Aug 2002 00:41:47 -0400 -Message-ID: <12593.1028695307@sss.pgh.pa.us> -From: Tom Lane <tgl@sss.pgh.pa.us> -X-Virus-Scanned: by AMaViS new-20020517 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Virus-Scanned: by AMaViS new-20020517 -Content-Length: 3063 - -Curt Sampson <cjs@cynic.net> writes: -> But after doing some benchmarking of various sorts of random reads -> and writes, it occurred to me that there might be optimizations -> that could help a lot with this sort of thing. What if, when we've -> got an index block with a bunch of entries, instead of doing the -> reads in the order of the entries, we do them in the order of the -> blocks the entries point to? - -I thought to myself "didn't I just post something about that?" -and then realized it was on a different mailing list. Here ya go -(and no, this is not the first time around on this list either...) - - -I am currently thinking that bitmap indexes per se are not all that -interesting. What does interest me is bitmapped index lookup, which -came back into mind after hearing Ann Harrison describe how FireBird/ -InterBase does it. - -The idea is that you don't scan the index and base table concurrently -as we presently do it. Instead, you scan the index and make a list -of the TIDs of the table tuples you need to visit. This list can -be conveniently represented as a sparse bitmap. After you've finished -looking at the index, you visit all the required table tuples *in -physical order* using the bitmap. This eliminates multiple fetches -of the same heap page, and can possibly let you get some win from -sequential access. - -Once you have built this mechanism, you can then move on to using -multiple indexes in interesting ways: you can do several indexscans -in one query and then AND or OR their bitmaps before doing the heap -scan. This would allow, for example, "WHERE a = foo and b = bar" -to be handled by ANDing results from separate indexes on the a and b -columns, rather than having to choose only one index to use as we do -now. - -Some thoughts about implementation: FireBird's implementation seems -to depend on an assumption about a fixed number of tuple pointers -per page. We don't have that, but we could probably get away with -just allocating BLCKSZ/sizeof(HeapTupleHeaderData) bits per page. -Also, the main downside of this approach is that the bitmap could -get large --- but you could have some logic that causes you to fall -back to plain sequential scan if you get too many index hits. (It's -interesting to think of this as lossy compression of the bitmap... -which leads to the idea of only being fuzzy in limited areas of the -bitmap, rather than losing all the information you have.) - -A possibly nasty issue is that lazy VACUUM has some assumptions in it -about indexscans holding pins on index pages --- that's what prevents -it from removing heap tuples that a concurrent indexscan is just about -to visit. It might be that there is no problem: even if lazy VACUUM -removes a heap tuple and someone else then installs a new tuple in that -same TID slot, you should be okay because the new tuple is too new to -pass your visibility test. But I'm not convinced this is safe. - - regards, tom lane - ----------------------------(end of broadcast)--------------------------- -TIP 6: Have you searched our list archives? - -http://archives.postgresql.org - -From pgsql-hackers-owner+M26172@postgresql.org Wed Aug 7 02:49:56 2002 -X-Authentication-Warning: rh72.home.ee: hannu set sender to hannu@tm.ee using -f -Subject: Re: [HACKERS] CLUSTER and indisclustered -From: Hannu Krosing <hannu@tm.ee> -To: Tom Lane <tgl@sss.pgh.pa.us> -cc: Curt Sampson <cjs@cynic.net>, mark Kirkwood <markir@slithery.org>, - Gavin Sherry <swm@linuxworld.com.au>, - Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org -In-Reply-To: <12776.1028697148@sss.pgh.pa.us> -References: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net> - <12776.1028697148@sss.pgh.pa.us> -X-Mailer: Ximian Evolution 1.0.7 -Date: 07 Aug 2002 09:46:29 +0500 -Message-ID: <1028695589.2133.11.camel@rh72.home.ee> -X-Virus-Scanned: by AMaViS new-20020517 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Virus-Scanned: by AMaViS new-20020517 -Content-Length: 1064 - -On Wed, 2002-08-07 at 10:12, Tom Lane wrote: -> Curt Sampson <cjs@cynic.net> writes: -> > On Wed, 7 Aug 2002, Tom Lane wrote: -> >> Also, the main downside of this approach is that the bitmap could -> >> get large --- but you could have some logic that causes you to fall -> >> back to plain sequential scan if you get too many index hits. -> -> > Well, what I was thinking of, should the list of TIDs to fetch get too -> > long, was just to break it down in to chunks. -> -> But then you lose the possibility of combining multiple indexes through -> bitmap AND/OR steps, which seems quite interesting to me. If you've -> visited only a part of each index then you can't apply that concept. - -When the tuples are small relative to pagesize, you may get some -"compression" by saving just pages and not the actual tids in the the -bitmap. - -------------- -Hannu - ----------------------------(end of broadcast)--------------------------- -TIP 2: you can get off all lists at once with the unregister command - (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) - -From pgsql-hackers-owner+M26166@postgresql.org Wed Aug 7 00:55:52 2002 -Date: Wed, 7 Aug 2002 13:55:41 +0900 (JST) -From: Curt Sampson <cjs@cynic.net> -To: Tom Lane <tgl@sss.pgh.pa.us> -cc: mark Kirkwood <markir@slithery.org>, Gavin Sherry <swm@linuxworld.com.au>, - Bruce Momjian <pgman@candle.pha.pa.us>, <pgsql-hackers@postgresql.org> -Subject: Re: [HACKERS] CLUSTER and indisclustered -In-Reply-To: <12593.1028695307@sss.pgh.pa.us> -Message-ID: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net> -X-Virus-Scanned: by AMaViS new-20020517 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Virus-Scanned: by AMaViS new-20020517 -Content-Length: 1840 - -On Wed, 7 Aug 2002, Tom Lane wrote: - -> I thought to myself "didn't I just post something about that?" -> and then realized it was on a different mailing list. Here ya go -> (and no, this is not the first time around on this list either...) - -Wow. I'm glad to see you looking at this, because this feature would so -*so* much for the performance of some of my queries, and really, really -impress my "billion-row-database" client. - -> The idea is that you don't scan the index and base table concurrently -> as we presently do it. Instead, you scan the index and make a list -> of the TIDs of the table tuples you need to visit. - -Right. - -> Also, the main downside of this approach is that the bitmap could -> get large --- but you could have some logic that causes you to fall -> back to plain sequential scan if you get too many index hits. - -Well, what I was thinking of, should the list of TIDs to fetch get too -long, was just to break it down in to chunks. If you want to limit to, -say, 1000 TIDs, and your index has 3000, just do the first 1000, then -the next 1000, then the last 1000. This would still result in much less -disk head movement and speed the query immensely. - -(BTW, I have verified this emperically during testing of random read vs. -random write on a RAID controller. The writes were 5-10 times faster -than the reads because the controller was caching a number of writes and -then doing them in the best possible order, whereas the reads had to be -satisfied in the order they were submitted to the controller.) - -cjs --- -Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org - Don't you know, in this new Dark Age, we're all light. --XTC - - ----------------------------(end of broadcast)--------------------------- -TIP 5: Have you checked our extensive FAQ? - -http://www.postgresql.org/users-lounge/docs/faq.html - -From pgsql-hackers-owner+M26167@postgresql.org Wed Aug 7 01:12:54 2002 -To: Curt Sampson <cjs@cynic.net> -cc: mark Kirkwood <markir@slithery.org>, Gavin Sherry <swm@linuxworld.com.au>, - Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] CLUSTER and indisclustered -In-Reply-To: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net> -References: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net> -Comments: In-reply-to Curt Sampson <cjs@cynic.net> - message dated "Wed, 07 Aug 2002 13:55:41 +0900" -Date: Wed, 07 Aug 2002 01:12:28 -0400 -Message-ID: <12776.1028697148@sss.pgh.pa.us> -From: Tom Lane <tgl@sss.pgh.pa.us> -X-Virus-Scanned: by AMaViS new-20020517 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Virus-Scanned: by AMaViS new-20020517 -Content-Length: 1428 - -Curt Sampson <cjs@cynic.net> writes: -> On Wed, 7 Aug 2002, Tom Lane wrote: ->> Also, the main downside of this approach is that the bitmap could ->> get large --- but you could have some logic that causes you to fall ->> back to plain sequential scan if you get too many index hits. - -> Well, what I was thinking of, should the list of TIDs to fetch get too -> long, was just to break it down in to chunks. - -But then you lose the possibility of combining multiple indexes through -bitmap AND/OR steps, which seems quite interesting to me. If you've -visited only a part of each index then you can't apply that concept. - -Another point to keep in mind is that the bigger the bitmap gets, the -less useful an indexscan is, by definition --- sooner or later you might -as well fall back to a seqscan. So the idea of lossy compression of a -large bitmap seems really ideal to me. In principle you could seqscan -the parts of the table where matching tuples are thick on the ground, -and indexscan the parts where they ain't. Maybe this seems natural -to me as an old JPEG campaigner, but if you don't see the logic I -recommend thinking about it a little ... - - regards, tom lane - ----------------------------(end of broadcast)--------------------------- -TIP 3: if posting/reading through Usenet, please send an appropriate -subscribe-nomail command to majordomo@postgresql.org so that your -message can get through to the mailing list cleanly - -From tgl@sss.pgh.pa.us Wed Aug 7 09:27:05 2002 -To: Hannu Krosing <hannu@tm.ee> -cc: Curt Sampson <cjs@cynic.net>, mark Kirkwood <markir@slithery.org>, - Gavin Sherry <swm@linuxworld.com.au>, - Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] CLUSTER and indisclustered -In-Reply-To: <1028726966.13418.12.camel@taru.tm.ee> -References: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net> <12776.1028697148@sss.pgh.pa.us> <1028695589.2133.11.camel@rh72.home.ee> <1028726966.13418.12.camel@taru.tm.ee> -Comments: In-reply-to Hannu Krosing <hannu@tm.ee> - message dated "07 Aug 2002 15:29:26 +0200" -Date: Wed, 07 Aug 2002 09:26:42 -0400 -Message-ID: <15010.1028726802@sss.pgh.pa.us> -From: Tom Lane <tgl@sss.pgh.pa.us> -Content-Length: 1120 - -Hannu Krosing <hannu@tm.ee> writes: -> Now I remembered my original preference for page bitmaps (vs. tuple -> bitmaps): one can't actually make good use of a bitmap of tuples because -> there is no fixed tuples/page ratio and thus no way to quickly go from -> bit position to actual tuple. You mention the same problem but propose a -> different solution. - -> Using page bitmap, we will at least avoid fetching any unneeded pages - -> essentially we will have a sequential scan over possibly interesting -> pages. - -Right. One form of the "lossy compression" idea I suggested is to -switch from a per-tuple bitmap to a per-page bitmap once the bitmap gets -too large to work with. Again, one could imagine doing that only in -denser areas of the bitmap. - -> But I guess that CLUSTER support for INSERT will not be touched for 7.3 -> as will real bitmap indexes ;) - -All of this is far-future work I think. Adding a new scan type to the -executor would probably be pretty localized, but the ramifications in -the planner could be extensive --- especially if you want to do plans -involving ANDed or ORed bitmaps. - - regards, tom lane - -From pgsql-hackers-owner+M26178@postgresql.org Wed Aug 7 08:28:14 2002 -X-Authentication-Warning: taru.tm.ee: hannu set sender to hannu@tm.ee using -f -Subject: Re: [HACKERS] CLUSTER and indisclustered -From: Hannu Krosing <hannu@tm.ee> -To: Hannu Krosing <hannu@tm.ee> -cc: Tom Lane <tgl@sss.pgh.pa.us>, Curt Sampson <cjs@cynic.net>, - mark Kirkwood <markir@slithery.org>, Gavin Sherry <swm@linuxworld.com.au>, - Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org -In-Reply-To: <1028695589.2133.11.camel@rh72.home.ee> -References: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net> - <12776.1028697148@sss.pgh.pa.us> <1028695589.2133.11.camel@rh72.home.ee> -X-Mailer: Ximian Evolution 1.0.3.99 -Date: 07 Aug 2002 15:29:26 +0200 -Message-ID: <1028726966.13418.12.camel@taru.tm.ee> -X-Virus-Scanned: by AMaViS new-20020517 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Virus-Scanned: by AMaViS new-20020517 -Content-Length: 1837 - -On Wed, 2002-08-07 at 06:46, Hannu Krosing wrote: -> On Wed, 2002-08-07 at 10:12, Tom Lane wrote: -> > Curt Sampson <cjs@cynic.net> writes: -> > > On Wed, 7 Aug 2002, Tom Lane wrote: -> > >> Also, the main downside of this approach is that the bitmap could -> > >> get large --- but you could have some logic that causes you to fall -> > >> back to plain sequential scan if you get too many index hits. -> > -> > > Well, what I was thinking of, should the list of TIDs to fetch get too -> > > long, was just to break it down in to chunks. -> > -> > But then you lose the possibility of combining multiple indexes through -> > bitmap AND/OR steps, which seems quite interesting to me. If you've -> > visited only a part of each index then you can't apply that concept. -> -> When the tuples are small relative to pagesize, you may get some -> "compression" by saving just pages and not the actual tids in the the -> bitmap. - -Now I remembered my original preference for page bitmaps (vs. tuple -bitmaps): one can't actually make good use of a bitmap of tuples because -there is no fixed tuples/page ratio and thus no way to quickly go from -bit position to actual tuple. You mention the same problem but propose a -different solution. - -Using page bitmap, we will at least avoid fetching any unneeded pages - -essentially we will have a sequential scan over possibly interesting -pages. - -If we were to use page-bitmap index for something with only a few values -like booleans, some insert-time local clustering should be useful, so -that TRUEs and FALSEs end up on different pages. - -But I guess that CLUSTER support for INSERT will not be touched for 7.3 -as will real bitmap indexes ;) - ---------------- -Hannu - - ----------------------------(end of broadcast)--------------------------- -TIP 6: Have you searched our list archives? - -http://archives.postgresql.org - -From pgsql-hackers-owner+M26192@postgresql.org Wed Aug 7 10:26:30 2002 -To: Hannu Krosing <hannu@tm.ee> -cc: Curt Sampson <cjs@cynic.net>, mark Kirkwood <markir@slithery.org>, - Gavin Sherry <swm@linuxworld.com.au>, - Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] CLUSTER and indisclustered -In-Reply-To: <1028733234.13418.113.camel@taru.tm.ee> -References: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net> <12776.1028697148@sss.pgh.pa.us> <1028695589.2133.11.camel@rh72.home.ee> <1028726966.13418.12.camel@taru.tm.ee> <15010.1028726802@sss.pgh.pa.us> <1028733234.13418.113.camel@taru.tm.ee> -Comments: In-reply-to Hannu Krosing <hannu@tm.ee> - message dated "07 Aug 2002 17:13:54 +0200" -Date: Wed, 07 Aug 2002 10:26:13 -0400 -Message-ID: <15622.1028730373@sss.pgh.pa.us> -From: Tom Lane <tgl@sss.pgh.pa.us> -X-Virus-Scanned: by AMaViS new-20020517 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Virus-Scanned: by AMaViS new-20020517 -Content-Length: 1224 - -Hannu Krosing <hannu@tm.ee> writes: -> On Wed, 2002-08-07 at 15:26, Tom Lane wrote: ->> Right. One form of the "lossy compression" idea I suggested is to ->> switch from a per-tuple bitmap to a per-page bitmap once the bitmap gets ->> too large to work with. - -> If it is a real bitmap, should it not be easyeast to allocate at the -> start ? - -But it isn't a "real bitmap". That would be a really poor -implementation, both for space and speed --- do you really want to scan -over a couple of megs of zeroes to find the few one-bits you care about, -in the typical case? "Bitmap" is a convenient term because it describes -the abstract behavior we want, but the actual data structure will -probably be nontrivial. If I recall Ann's description correctly, -Firebird's implementation uses run length coding of some kind (anyone -care to dig in their source and get all the details?). If we tried -anything in the way of lossy compression then there'd be even more stuff -lurking under the hood. - - regards, tom lane - ----------------------------(end of broadcast)--------------------------- -TIP 2: you can get off all lists at once with the unregister command - (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) - -From pgsql-hackers-owner+M26188@postgresql.org Wed Aug 7 10:12:26 2002 -X-Authentication-Warning: taru.tm.ee: hannu set sender to hannu@tm.ee using -f -Subject: Re: [HACKERS] CLUSTER and indisclustered -From: Hannu Krosing <hannu@tm.ee> -To: Tom Lane <tgl@sss.pgh.pa.us> -cc: Curt Sampson <cjs@cynic.net>, mark Kirkwood <markir@slithery.org>, - Gavin Sherry <swm@linuxworld.com.au>, - Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org -In-Reply-To: <15010.1028726802@sss.pgh.pa.us> -References: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net> - <12776.1028697148@sss.pgh.pa.us> <1028695589.2133.11.camel@rh72.home.ee> - <1028726966.13418.12.camel@taru.tm.ee> <15010.1028726802@sss.pgh.pa.us> -X-Mailer: Ximian Evolution 1.0.3.99 -Date: 07 Aug 2002 17:13:54 +0200 -Message-ID: <1028733234.13418.113.camel@taru.tm.ee> -X-Virus-Scanned: by AMaViS new-20020517 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Virus-Scanned: by AMaViS new-20020517 -Content-Length: 2812 - -On Wed, 2002-08-07 at 15:26, Tom Lane wrote: -> Hannu Krosing <hannu@tm.ee> writes: -> > Now I remembered my original preference for page bitmaps (vs. tuple -> > bitmaps): one can't actually make good use of a bitmap of tuples because -> > there is no fixed tuples/page ratio and thus no way to quickly go from -> > bit position to actual tuple. You mention the same problem but propose a -> > different solution. -> -> > Using page bitmap, we will at least avoid fetching any unneeded pages - -> > essentially we will have a sequential scan over possibly interesting -> > pages. -> -> Right. One form of the "lossy compression" idea I suggested is to -> switch from a per-tuple bitmap to a per-page bitmap once the bitmap gets -> too large to work with. - -If it is a real bitmap, should it not be easyeast to allocate at the -start ? - -a page bitmap for a 100 000 000 tuple table with 10 tuples/page will be -sized 10000000/8 = 1.25 MB, which does not look too big for me for that -amount of data (the data table itself would occupy 80 GB). - -Even having the bitmap of 16 bits/page (with the bits 0-14 meaning -tuples 0-14 and bit 15 meaning "seq scan the rest of page") would -consume just 20 MB of _local_ memory, and would be quite justifyiable -for a query on a table that large. - -For a real bitmap index the tuples-per-page should be a user-supplied -tuning parameter. - -> Again, one could imagine doing that only in denser areas of the bitmap. - -I would hardly call the resulting structure "a bitmap" ;) - -And I'm not sure the overhead for a more complex structure would win us -any additional performance for most cases. - -> > But I guess that CLUSTER support for INSERT will not be touched for 7.3 -> > as will real bitmap indexes ;) -> -> All of this is far-future work I think. - -After we do that we will probably be able claim support for -"datawarehousing" ;) - -> Adding a new scan type to the -> executor would probably be pretty localized, but the ramifications in -> the planner could be extensive --- especially if you want to do plans -> involving ANDed or ORed bitmaps. - -Also going to "smart inserter" which can do local clustering on sets of -real bitmap indexes for INSERTS (and INSERT side of UPDATE) would -probably be a major change from our current "stupid inserter" ;) - -This will not be needed for bitmap resolution higher than 1bit/page but -default local clustering on bitmap indexes will probably buy us some -extra performance. by avoiding data page fetches when such indexes are -used. - -AN anyway the support for INSERT being aware of clustering will probably -come up sometime. - ------------- -Hannu - - - ----------------------------(end of broadcast)--------------------------- -TIP 2: you can get off all lists at once with the unregister command - (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) - -From hannu@tm.ee Wed Aug 7 11:22:53 2002 -X-Authentication-Warning: taru.tm.ee: hannu set sender to hannu@tm.ee using -f -Subject: Re: [HACKERS] CLUSTER and indisclustered -From: Hannu Krosing <hannu@tm.ee> -To: Tom Lane <tgl@sss.pgh.pa.us> -cc: Curt Sampson <cjs@cynic.net>, mark Kirkwood <markir@slithery.org>, - Gavin - Sherry <swm@linuxworld.com.au>, - Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org -In-Reply-To: <15622.1028730373@sss.pgh.pa.us> -References: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net> - <12776.1028697148@sss.pgh.pa.us> <1028695589.2133.11.camel@rh72.home.ee> - <1028726966.13418.12.camel@taru.tm.ee> <15010.1028726802@sss.pgh.pa.us> - <1028733234.13418.113.camel@taru.tm.ee> <15622.1028730373@sss.pgh.pa.us> -X-Mailer: Ximian Evolution 1.0.3.99 -Date: 07 Aug 2002 18:24:30 +0200 -Message-ID: <1028737470.13419.182.camel@taru.tm.ee> -Content-Length: 2382 - -On Wed, 2002-08-07 at 16:26, Tom Lane wrote: -> Hannu Krosing <hannu@tm.ee> writes: -> > On Wed, 2002-08-07 at 15:26, Tom Lane wrote: -> >> Right. One form of the "lossy compression" idea I suggested is to -> >> switch from a per-tuple bitmap to a per-page bitmap once the bitmap gets -> >> too large to work with. -> -> > If it is a real bitmap, should it not be easyeast to allocate at the -> > start ? -> -> But it isn't a "real bitmap". That would be a really poor -> implementation, both for space and speed --- do you really want to scan -> over a couple of megs of zeroes to find the few one-bits you care about, -> in the typical case? - -I guess that depends on data. The typical case should be somthing the -stats process will find out so the optimiser can use it - -The bitmap must be less than 1/48 (size of TID) full for best -uncompressed "active-tid-list" to be smaller than plain bitmap. If there -were some structure above list then this ratio would be even higher. - -I have had good experience using "compressed delta lists", which will -scale well ofer the whole "fullness" spectrum of bitmap, but this is for -storage, not for initial constructing of lists. - -> "Bitmap" is a convenient term because it describes -> the abstract behavior we want, but the actual data structure will -> probably be nontrivial. If I recall Ann's description correctly, -> Firebird's implementation uses run length coding of some kind (anyone -> care to dig in their source and get all the details?). - -Plain RLL is probably a good way to store it and for merging two or more -bitmaps, but not as good for constructing it bit-by-bit. I guess the -most effective structure for updating is often still a plain bitmap -(maybe not if it is very sparse and all of it does not fit in cache), -followed by some kind of balanced tree (maybe rb-tree). - -If the bitmap is relatively full then the plain bitmap is almost always -the most effective to update. - -> If we tried anything in the way of lossy compression then there'd -> be even more stuff lurking under the hood. - -Having three-valued (0,1,maybe) RLL-encoded "tritmap" would be a good -way to represent lossy compression, and it would also be quite -straightforward to merge two of these using AND or OR. It may even be -possible to easily construct it using a fixed-length b-tree and going -from 1 to "maybe" for nodes that get too dense. - ---------------- -Hannu - - -From pgsql-hackers-owner+M21991@postgresql.org Wed Apr 24 23:37:37 2002 -Return-path: <pgsql-hackers-owner+M21991@postgresql.org> -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P3ba416337 - for <pgman@candle.pha.pa.us>; Wed, 24 Apr 2002 23:37:36 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id CF13447622B; Wed, 24 Apr 2002 23:37:31 -0400 (EDT) -Received: from sraigw.sra.co.jp (sraigw.sra.co.jp [202.32.10.2]) - by postgresql.org (Postfix) with ESMTP id 3EE92474E4B - for <pgsql-hackers@postgresql.org>; Wed, 24 Apr 2002 23:37:19 -0400 (EDT) -Received: from srascb.sra.co.jp (srascb [133.137.8.65]) - by sraigw.sra.co.jp (8.9.3/3.7W-sraigw) with ESMTP id MAA76393; - Thu, 25 Apr 2002 12:35:44 +0900 (JST) -Received: (from root@localhost) - by srascb.sra.co.jp (8.11.6/8.11.6) id g3P3ZCK64299; - Thu, 25 Apr 2002 12:35:12 +0900 (JST) - (envelope-from t-ishii@sra.co.jp) -Received: from sranhm.sra.co.jp (sranhm [133.137.170.62]) - by srascb.sra.co.jp (8.11.6/8.11.6av) with ESMTP id g3P3ZBV64291; - Thu, 25 Apr 2002 12:35:11 +0900 (JST) - (envelope-from t-ishii@sra.co.jp) -Received: from localhost (IDENT:t-ishii@srapc1474.sra.co.jp [133.137.170.59]) - by sranhm.sra.co.jp (8.9.3+3.2W/3.7W-srambox) with ESMTP id MAA25562; - Thu, 25 Apr 2002 12:35:43 +0900 -To: tgl@sss.pgh.pa.us -cc: cjs@cynic.net, pgman@candle.pha.pa.us, pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Sequential Scan Read-Ahead -In-Reply-To: <12342.1019705420@sss.pgh.pa.us> -References: <Pine.NEB.4.43.0204251118040.445-100000@angelic.cynic.net> - <12342.1019705420@sss.pgh.pa.us> -X-Mailer: Mew version 1.94.2 on Emacs 20.7 / Mule 4.1 - =?iso-2022-jp?B?KBskQjAqGyhCKQ==?= -MIME-Version: 1.0 -Content-Type: Text/Plain; charset=us-ascii -Content-Transfer-Encoding: 7bit -Message-ID: <20020425123429E.t-ishii@sra.co.jp> -Date: Thu, 25 Apr 2002 12:34:29 +0900 -From: Tatsuo Ishii <t-ishii@sra.co.jp> -X-Dispatcher: imput version 20000228(IM140) -Lines: 12 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -> Curt Sampson <cjs@cynic.net> writes: -> > Grabbing bigger chunks is always optimal, AFICT, if they're not -> > *too* big and you use the data. A single 64K read takes very little -> > longer than a single 8K read. -> -> Proof? - -Long time ago I tested with the 32k block size and got 1.5-2x speed up -comparing ordinary 8k block size in the sequential scan case. -FYI, if this is the case. --- -Tatsuo Ishii - ----------------------------(end of broadcast)--------------------------- -TIP 5: Have you checked our extensive FAQ? - -http://www.postgresql.org/users-lounge/docs/faq.html - -From mloftis@wgops.com Thu Apr 25 01:43:14 2002 -Return-path: <mloftis@wgops.com> -Received: from free.wgops.com (root@dsl092-002-178.sfo1.dsl.speakeasy.net [66.92.2.178]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P5hC426529 - for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 01:43:13 -0400 (EDT) -Received: from wgops.com ([10.1.2.207]) - by free.wgops.com (8.11.3/8.11.3) with ESMTP id g3P5hBR43020; - Wed, 24 Apr 2002 22:43:11 -0700 (PDT) - (envelope-from mloftis@wgops.com) -Message-ID: <3CC7976F.7070407@wgops.com> -Date: Wed, 24 Apr 2002 22:43:11 -0700 -From: Michael Loftis <mloftis@wgops.com> -User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.4.1) Gecko/20020314 Netscape6/6.2.2 -X-Accept-Language: en-us -MIME-Version: 1.0 -To: Tom Lane <tgl@sss.pgh.pa.us> -cc: Curt Sampson <cjs@cynic.net>, Bruce Momjian <pgman@candle.pha.pa.us>, - PostgreSQL-development <pgsql-hackers@postgresql.org> -Subject: Re: [HACKERS] Sequential Scan Read-Ahead -References: <Pine.NEB.4.43.0204251118040.445-100000@angelic.cynic.net> <12342.1019705420@sss.pgh.pa.us> -Content-Type: text/plain; charset=us-ascii; format=flowed -Content-Transfer-Encoding: 7bit -Status: OR - - - -Tom Lane wrote: - ->Curt Sampson <cjs@cynic.net> writes: -> ->>Grabbing bigger chunks is always optimal, AFICT, if they're not ->>*too* big and you use the data. A single 64K read takes very little ->>longer than a single 8K read. ->> -> ->Proof? -> -I contend this statement. - -It's optimal to a point. I know that my system settles into it's best -read-speeds @ 32K or 64K chunks. 8K chunks are far below optimal for my -system. Most systems I work on do far better at 16K than at 8K, and -most don't see any degradation when going to 32K chunks. (this is -across numerous OSes and configs -- results are interpretations from -bonnie disk i/o marks). - -Depending on what you're doing it is more efficiend to read bigger -blocks up to a point. If you're multi-thread or reading in non-blocking -mode, take as big a chunk as you can handle or are ready to process in -quick order. If you're picking up a bunch of little chunks here and -there and know oyu're not using them again then choose a size that will -hopeuflly cause some of the reads to overlap, failing that, pick the -smallest usable read size. - -The OS can never do that stuff for you. - - - -From cjs@cynic.net Thu Apr 25 03:29:05 2002 -Return-path: <cjs@cynic.net> -Received: from angelic.cynic.net ([202.232.117.21]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P7T3404027 - for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 03:29:03 -0400 (EDT) -Received: from localhost (localhost [127.0.0.1]) - by angelic.cynic.net (Postfix) with ESMTP - id 1C44E870E; Thu, 25 Apr 2002 16:28:51 +0900 (JST) -Date: Thu, 25 Apr 2002 16:28:51 +0900 (JST) -From: Curt Sampson <cjs@cynic.net> -To: Tom Lane <tgl@sss.pgh.pa.us> -cc: Bruce Momjian <pgman@candle.pha.pa.us>, - PostgreSQL-development <pgsql-hackers@postgresql.org> -Subject: Re: [HACKERS] Sequential Scan Read-Ahead -In-Reply-To: <12342.1019705420@sss.pgh.pa.us> -Message-ID: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net> -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Status: OR - -On Wed, 24 Apr 2002, Tom Lane wrote: - -> Curt Sampson <cjs@cynic.net> writes: -> > Grabbing bigger chunks is always optimal, AFICT, if they're not -> > *too* big and you use the data. A single 64K read takes very little -> > longer than a single 8K read. -> -> Proof? - -Well, there are various sorts of "proof" for this assertion. What -sort do you want? - -Here's a few samples; if you're looking for something different to -satisfy you, let's discuss it. - -1. Theoretical proof: two components of the delay in retrieving a -block from disk are the disk arm movement and the wait for the -right block to rotate under the head. - -When retrieving, say, eight adjacent blocks, these will be spread -across no more than two cylinders (with luck, only one). The worst -case access time for a single block is the disk arm movement plus -the full rotational wait; this is the same as the worst case for -eight blocks if they're all on one cylinder. If they're not on one -cylinder, they're still on adjacent cylinders, requiring a very -short seek. - -2. Proof by others using it: SQL server uses 64K reads when doing -table scans, as they say that their research indicates that the -major limitation is usually the number of I/O requests, not the -I/O capacity of the disk. BSD's explicitly separates the optimum -allocation size for storage (1K fragments) and optimum read size -(8K blocks) because they found performance to be much better when -a larger size block was read. Most file system vendors, too, do -read-ahead for this very reason. - -3. Proof by testing. I wrote a little ruby program to seek to a -random point in the first 2 GB of my raw disk partition and read -1-8 8K blocks of data. (This was done as one I/O request.) (Using -the raw disk partition I avoid any filesystem buffering.) Here are -typical results: - - 125 reads of 16x8K blocks: 1.9 sec, 66.04 req/sec. 15.1 ms/req, 0.946 ms/block - 250 reads of 8x8K blocks: 1.9 sec, 132.3 req/sec. 7.56 ms/req, 0.945 ms/block - 500 reads of 4x8K blocks: 2.5 sec, 199 req/sec. 5.03 ms/req, 1.26 ms/block -1000 reads of 2x8K blocks: 3.8 sec, 261.6 req/sec. 3.82 ms/req, 1.91 ms/block -2000 reads of 1x8K blocks: 6.4 sec, 310.4 req/sec. 3.22 ms/req, 3.22 ms/block - -The ratios of data retrieval speed per read for groups of adjacent -8K blocks, assuming a single 8K block reads in 1 time unit, are: - - 1 block 1.00 - 2 blocks 1.18 - 4 blocks 1.56 - 8 blocks 2.34 - 16 blocks 4.68 - -At less than 20% more expensive, certainly two-block read requests -could be considered to cost "very little more" than one-block read -requests. Even four-block read requests are only half-again as -expensive. And if you know you're really going to be using the -data, read in 8 block chunks and your cost per block (in terms of -time) drops to less than a third of the cost of single-block reads. - -Let me put paid to comments about multiple simultaneous readers -making this invalid. Here's a typical result I get with four -instances of the program running simultaneously: - -125 reads of 16x8K blocks: 4.4 sec, 28.21 req/sec. 35.4 ms/req, 2.22 ms/block -250 reads of 8x8K blocks: 3.9 sec, 64.88 req/sec. 15.4 ms/req, 1.93 ms/block -500 reads of 4x8K blocks: 5.8 sec, 86.52 req/sec. 11.6 ms/req, 2.89 ms/block -1000 reads of 2x8K blocks: 10 sec, 100.2 req/sec. 9.98 ms/req, 4.99 ms/block -2000 reads of 1x8K blocks: 18 sec, 110 req/sec. 9.09 ms/req, 9.09 ms/block - -Here's the ratio table again, with another column comparing the -aggregate number of requests per second for one process and four -processes: - - 1 block 1.00 310 : 440 - 2 blocks 1.10 262 : 401 - 4 blocks 1.28 199 : 346 - 8 blocks 1.69 132 : 260 - 16 blocks 3.89 66 : 113 - -Note that, here the relative increase in performance for increasing -sizes of reads is even *better* until we get past 64K chunks. The -overall throughput is better, of course, because with more requests -per second coming in, the disk seek ordering code has more to work -with and the average seek time spent seeking vs. reading will be -reduced. - -You know, this is not rocket science; I'm sure there must be papers -all over the place about this. If anybody still disagrees that it's -a good thing to read chunks up to 64K or so when the blocks are -adjacent and you know you'll need the data, I'd like to see some -tangible evidence to support that. - -cjs --- -Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org - Don't you know, in this new Dark Age, we're all light. --XTC - - -From cjs@cynic.net Thu Apr 25 03:55:59 2002 -Return-path: <cjs@cynic.net> -Received: from angelic.cynic.net ([202.232.117.21]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P7tv405489 - for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 03:55:57 -0400 (EDT) -Received: from localhost (localhost [127.0.0.1]) - by angelic.cynic.net (Postfix) with ESMTP - id 188EC870E; Thu, 25 Apr 2002 16:55:51 +0900 (JST) -Date: Thu, 25 Apr 2002 16:55:50 +0900 (JST) -From: Curt Sampson <cjs@cynic.net> -To: Bruce Momjian <pgman@candle.pha.pa.us> -cc: PostgreSQL-development <pgsql-hackers@postgresql.org> -Subject: Re: [HACKERS] Sequential Scan Read-Ahead -In-Reply-To: <200204250404.g3P44OI19061@candle.pha.pa.us> -Message-ID: <Pine.NEB.4.43.0204251636550.3111-100000@angelic.cynic.net> -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Status: OR - -On Thu, 25 Apr 2002, Bruce Momjian wrote: - -> Well, we are guilty of trying to push as much as possible on to other -> software. We do this for portability reasons, and because we think our -> time is best spent dealing with db issues, not issues then can be deal -> with by other existing software, as long as the software is decent. - -That's fine. I think that's a perfectly fair thing to do. - -It was just the wording (i.e., "it's this other software's fault -that blah de blah") that got to me. To say, "We don't do readahead -becase most OSes supply it, and we feel that other things would -help more to improve performance," is fine by me. Or even, "Well, -nobody feels like doing it. You want it, do it yourself," I have -no problem with. - -> Sure, that is certainly true. However, it is hard to know what the -> future will hold even if we had perfect knowledge of what was happening -> in the kernel. We don't know who else is going to start doing I/O once -> our I/O starts. We may have a better idea with kernel knowledge, but we -> still don't know 100% what will be cached. - -Well, we do if we use raw devices and do our own caching, using -pages that are pinned in RAM. That was sort of what I was aiming -at for the long run. - -> We have free-behind on our list. - -Uh...can't do it, if you're relying on the OS to do the buffering. -How do you tell the OS that you're no longer going to use a page? - -> I think LRU-K will do this quite well -> and be a nice general solution for more than just sequential scans. - -LRU-K sounds like a great idea to me, as does putting pages read -for a table scan at the LRU end of the cache, rather than the MRU -(assuming we do something to ensure that they stay in cache until -read once, at any rate). - -But again, great for your own cache, but doesn't work with the OS -cache. And I'm a bit scared to crank up too high the amount of -memory I give Postgres, lest the OS try to too aggressively buffer -all that I/O in what memory remains to it, and start blowing programs -(like maybe the backend binary itself) out of RAM. But maybe this -isn't typically a problem; I don't know. - -> There may be validity in this. It is easy to do (I think) and could be -> a win. - -It didn't look to difficult to me, when I looked at the code, and -you can see what kind of win it is from the response I just made -to Tom. - -> > 1. It is *not* true that you have no idea where data is when -> > using a storage array or other similar system. While you -> > certainly ought not worry about things such as head positions -> > and so on, it's been a given for a long, long time that two -> > blocks that have close index numbers are going to be close -> > together in physical storage. -> -> SCSI drivers, for example, are pretty smart. Not sure we can take -> advantage of that from user-land I/O. - -Looking at the NetBSD ones, I don't see what they're doing that's -so smart. (Aside from some awfully clever workarounds for stupid -hardware limitations that would otherwise kill performance.) What -sorts of "smart" are you referring to? - -> Yes, but we are seeing some db's moving away from raw I/O. - -Such as whom? And are you certain that they're moving to using the -OS buffer cache, too? MS SQL server, for example, uses the filesystem, -but turns off all buffering on those files. - -> Our performance numbers beat most of the big db's already, so we must -> be doing something right. - -Really? Do the performance numbers for simple, bulk operations -(imports, exports, table scans) beat the others handily? My intuition -says not, but I'll happily be convinced otherwise. - -> Yes, but do we spend our time doing that. Is the payoff worth it, vs. -> working on other features. Sure it would be great to have all these -> fancy things, but is this where our time should be spent, considering -> other items on the TODO list? - -I agree that these things need to be assesed. - -> Jumping in and doing the I/O ourselves is a big undertaking, and looking -> at our TODO list, I am not sure if it is worth it right now. - -Right. I'm not trying to say this is a critical priority, I'm just -trying to determine what we do right now, what we could do, and -the potential performance increase that would give us. - -cjs --- -Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org - Don't you know, in this new Dark Age, we're all light. --XTC - - -From cjs@cynic.net Thu Apr 25 05:19:11 2002 -Return-path: <cjs@cynic.net> -Received: from angelic.cynic.net ([202.232.117.21]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P9J9412878 - for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 05:19:10 -0400 (EDT) -Received: from localhost (localhost [127.0.0.1]) - by angelic.cynic.net (Postfix) with ESMTP - id 50386870E; Thu, 25 Apr 2002 18:19:03 +0900 (JST) -Date: Thu, 25 Apr 2002 18:19:02 +0900 (JST) -From: Curt Sampson <cjs@cynic.net> -To: Tom Lane <tgl@sss.pgh.pa.us> -cc: Bruce Momjian <pgman@candle.pha.pa.us>, - PostgreSQL-development <pgsql-hackers@postgresql.org> -Subject: Re: [HACKERS] Sequential Scan Read-Ahead -In-Reply-To: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net> -Message-ID: <Pine.NEB.4.43.0204251805000.3111-100000@angelic.cynic.net> -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Status: OR - -On Thu, 25 Apr 2002, Curt Sampson wrote: - -> Here's the ratio table again, with another column comparing the -> aggregate number of requests per second for one process and four -> processes: -> - -Just for interest, I ran this again with 20 processes working -simultaneously. I did six runs at each blockread size and summed -the tps for each process to find the aggregate number of reads per -second during the test. I dropped the higest and the lowest ones, -and averaged the rest. Here's the new table: - - 1 proc 4 procs 20 procs - - 1 block 310 440 260 - 2 blocks 262 401 481 - 4 blocks 199 346 354 - 8 blocks 132 260 250 - 16 blocks 66 113 116 - -I'm not sure at all why performance gets so much *worse* with a lot of -contention on the 1K reads. This could have something to with NetBSD, or -its buffer cache, or my laptop's crappy little disk drive.... - -Or maybe I'm just running out of CPU. - -cjs --- -Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org - Don't you know, in this new Dark Age, we're all light. --XTC - - -From tgl@sss.pgh.pa.us Thu Apr 25 09:54:35 2002 -Return-path: <tgl@sss.pgh.pa.us> -Received: from sss.pgh.pa.us (root@[192.204.191.242]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3PDsY407038 - for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 09:54:34 -0400 (EDT) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g3PDsXF25059; - Thu, 25 Apr 2002 09:54:33 -0400 (EDT) -To: Curt Sampson <cjs@cynic.net> -cc: Bruce Momjian <pgman@candle.pha.pa.us>, - PostgreSQL-development <pgsql-hackers@postgresql.org> -Subject: Re: [HACKERS] Sequential Scan Read-Ahead -In-Reply-To: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net> -References: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net> -Comments: In-reply-to Curt Sampson <cjs@cynic.net> - message dated "Thu, 25 Apr 2002 16:28:51 +0900" -Date: Thu, 25 Apr 2002 09:54:32 -0400 -Message-ID: <25056.1019742872@sss.pgh.pa.us> -From: Tom Lane <tgl@sss.pgh.pa.us> -Status: OR - -Curt Sampson <cjs@cynic.net> writes: -> 1. Theoretical proof: two components of the delay in retrieving a -> block from disk are the disk arm movement and the wait for the -> right block to rotate under the head. - -> When retrieving, say, eight adjacent blocks, these will be spread -> across no more than two cylinders (with luck, only one). - -Weren't you contending earlier that with modern disk mechs you really -have no idea where the data is? You're asserting as an article of -faith that the OS has been able to place the file's data blocks -optimally --- or at least well enough to avoid unnecessary seeks. -But just a few days ago I was getting told that random_page_cost -was BS because there could be no such placement. - -I'm getting a tad tired of sweeping generalizations offered without -proof, especially when they conflict. - -> 3. Proof by testing. I wrote a little ruby program to seek to a -> random point in the first 2 GB of my raw disk partition and read -> 1-8 8K blocks of data. (This was done as one I/O request.) (Using -> the raw disk partition I avoid any filesystem buffering.) - -And also ensure that you aren't testing the point at issue. -The point at issue is that *in the presence of kernel read-ahead* -it's quite unclear that there's any benefit to a larger request size. -Ideally the kernel will have the next block ready for you when you -ask, no matter what the request is. - -There's been some talk of using the AIO interface (where available) -to "encourage" the kernel to do read-ahead. I don't foresee us -writing our own substitute filesystem to make this happen, however. -Oracle may have the manpower for that sort of boondoggle, but we -don't... - - regards, tom lane - -From pgsql-hackers-owner+M22053@postgresql.org Thu Apr 25 20:45:42 2002 -Return-path: <pgsql-hackers-owner+M22053@postgresql.org> -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3Q0jg405210 - for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 20:45:42 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id 17CE6476270; Thu, 25 Apr 2002 20:45:38 -0400 (EDT) -Received: from doppelbock.patentinvestor.com (ip146.usw5.rb1.bel.nwlink.com [209.20.249.146]) - by postgresql.org (Postfix) with ESMTP id 257DC47591C - for <pgsql-hackers@postgresql.org>; Thu, 25 Apr 2002 20:45:25 -0400 (EDT) -Received: (from kaf@localhost) - by doppelbock.patentinvestor.com (8.11.6/8.11.2) id g3Q0erX14397; - Thu, 25 Apr 2002 17:40:53 -0700 -From: Kyle <kaf@nwlink.com> -MIME-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -Content-Transfer-Encoding: 7bit -Message-ID: <15560.41493.529847.635632@doppelbock.patentinvestor.com> -Date: Thu, 25 Apr 2002 17:40:53 -0700 -To: PostgreSQL-development <pgsql-hackers@postgresql.org> -Subject: Re: [HACKERS] Sequential Scan Read-Ahead -In-Reply-To: <25056.1019742872@sss.pgh.pa.us> -References: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net> - <25056.1019742872@sss.pgh.pa.us> -X-Mailer: VM 6.95 under 21.1 (patch 14) "Cuyahoga Valley" XEmacs Lucid -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: ORr - -Tom Lane wrote: -> ... -> Curt Sampson <cjs@cynic.net> writes: -> > 3. Proof by testing. I wrote a little ruby program to seek to a -> > random point in the first 2 GB of my raw disk partition and read -> > 1-8 8K blocks of data. (This was done as one I/O request.) (Using -> > the raw disk partition I avoid any filesystem buffering.) -> -> And also ensure that you aren't testing the point at issue. -> The point at issue is that *in the presence of kernel read-ahead* -> it's quite unclear that there's any benefit to a larger request size. -> Ideally the kernel will have the next block ready for you when you -> ask, no matter what the request is. -> ... - -I have to agree with Tom. I think the numbers below show that with -kernel read-ahead, block size isn't an issue. - -The big_file1 file used below is 2.0 gig of random data, and the -machine has 512 mb of main memory. This ensures that we're not -just getting cached data. - -foreach i (4k 8k 16k 32k 64k 128k) - echo $i - time dd bs=$i if=big_file1 of=/dev/null -end - -and the results: - -bs user kernel elapsed -4k: 0.260 7.740 1:27.25 -8k: 0.210 8.060 1:30.48 -16k: 0.090 7.790 1:30.88 -32k: 0.060 8.090 1:32.75 -64k: 0.030 8.190 1:29.11 -128k: 0.070 9.830 1:28.74 - -so with kernel read-ahead, we have basically the same elapsed (wall -time) regardless of block size. Sure, user time drops to a low at 64k -blocksize, but kernel time is increasing. - - -You could argue that this is a contrived example, no other I/O is -being done. Well I created a second 2.0g file (big_file2) and did two -simultaneous reads from the same disk. Sure performance went to hell -but it shows blocksize is still irrelevant in a multi I/O environment -with sequential read-ahead. - -foreach i ( 4k 8k 16k 32k 64k 128k ) - echo $i - time dd bs=$i if=big_file1 of=/dev/null & - time dd bs=$i if=big_file2 of=/dev/null & - wait -end - -bs user kernel elapsed -4k: 0.480 8.290 6:34.13 bigfile1 - 0.320 8.730 6:34.33 bigfile2 -8k: 0.250 7.580 6:31.75 - 0.180 8.450 6:31.88 -16k: 0.150 8.390 6:32.47 - 0.100 7.900 6:32.55 -32k: 0.190 8.460 6:24.72 - 0.060 8.410 6:24.73 -64k: 0.060 9.350 6:25.05 - 0.150 9.240 6:25.13 -128k: 0.090 10.610 6:33.14 - 0.110 11.320 6:33.31 - - -the differences in read times are basically in the mud. Blocksize -just doesn't matter much with the kernel doing readahead. - --Kyle - ----------------------------(end of broadcast)--------------------------- -TIP 6: Have you searched our list archives? - -http://archives.postgresql.org - -From pgsql-hackers-owner+M22055@postgresql.org Thu Apr 25 22:19:07 2002 -Return-path: <pgsql-hackers-owner+M22055@postgresql.org> -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3Q2J7411254 - for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 22:19:07 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id F3924476208; Thu, 25 Apr 2002 22:19:02 -0400 (EDT) -Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35]) - by postgresql.org (Postfix) with ESMTP id 6741D474E71 - for <pgsql-hackers@postgresql.org>; Thu, 25 Apr 2002 22:18:50 -0400 (EDT) -Received: (from pgman@localhost) - by candle.pha.pa.us (8.11.6/8.10.1) id g3Q2Ili11246; - Thu, 25 Apr 2002 22:18:47 -0400 (EDT) -From: Bruce Momjian <pgman@candle.pha.pa.us> -Message-ID: <200204260218.g3Q2Ili11246@candle.pha.pa.us> -Subject: Re: [HACKERS] Sequential Scan Read-Ahead -In-Reply-To: <15560.41493.529847.635632@doppelbock.patentinvestor.com> -To: Kyle <kaf@nwlink.com> -Date: Thu, 25 Apr 2002 22:18:47 -0400 (EDT) -cc: PostgreSQL-development <pgsql-hackers@postgresql.org> -X-Mailer: ELM [version 2.4ME+ PL97 (25)] -MIME-Version: 1.0 -Content-Transfer-Encoding: 7bit -Content-Type: text/plain; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - - -Nice test. Would you test simultaneous 'dd' on the same file, perhaps -with a slight delay between to the two so they don't read each other's -blocks? - -seek() in the file will turn off read-ahead in most OS's. I am not -saying this is a major issue for PostgreSQL but the numbers would be -interesting. - - ---------------------------------------------------------------------------- - -Kyle wrote: -> Tom Lane wrote: -> > ... -> > Curt Sampson <cjs@cynic.net> writes: -> > > 3. Proof by testing. I wrote a little ruby program to seek to a -> > > random point in the first 2 GB of my raw disk partition and read -> > > 1-8 8K blocks of data. (This was done as one I/O request.) (Using -> > > the raw disk partition I avoid any filesystem buffering.) -> > -> > And also ensure that you aren't testing the point at issue. -> > The point at issue is that *in the presence of kernel read-ahead* -> > it's quite unclear that there's any benefit to a larger request size. -> > Ideally the kernel will have the next block ready for you when you -> > ask, no matter what the request is. -> > ... -> -> I have to agree with Tom. I think the numbers below show that with -> kernel read-ahead, block size isn't an issue. -> -> The big_file1 file used below is 2.0 gig of random data, and the -> machine has 512 mb of main memory. This ensures that we're not -> just getting cached data. -> -> foreach i (4k 8k 16k 32k 64k 128k) -> echo $i -> time dd bs=$i if=big_file1 of=/dev/null -> end -> -> and the results: -> -> bs user kernel elapsed -> 4k: 0.260 7.740 1:27.25 -> 8k: 0.210 8.060 1:30.48 -> 16k: 0.090 7.790 1:30.88 -> 32k: 0.060 8.090 1:32.75 -> 64k: 0.030 8.190 1:29.11 -> 128k: 0.070 9.830 1:28.74 -> -> so with kernel read-ahead, we have basically the same elapsed (wall -> time) regardless of block size. Sure, user time drops to a low at 64k -> blocksize, but kernel time is increasing. -> -> -> You could argue that this is a contrived example, no other I/O is -> being done. Well I created a second 2.0g file (big_file2) and did two -> simultaneous reads from the same disk. Sure performance went to hell -> but it shows blocksize is still irrelevant in a multi I/O environment -> with sequential read-ahead. -> -> foreach i ( 4k 8k 16k 32k 64k 128k ) -> echo $i -> time dd bs=$i if=big_file1 of=/dev/null & -> time dd bs=$i if=big_file2 of=/dev/null & -> wait -> end -> -> bs user kernel elapsed -> 4k: 0.480 8.290 6:34.13 bigfile1 -> 0.320 8.730 6:34.33 bigfile2 -> 8k: 0.250 7.580 6:31.75 -> 0.180 8.450 6:31.88 -> 16k: 0.150 8.390 6:32.47 -> 0.100 7.900 6:32.55 -> 32k: 0.190 8.460 6:24.72 -> 0.060 8.410 6:24.73 -> 64k: 0.060 9.350 6:25.05 -> 0.150 9.240 6:25.13 -> 128k: 0.090 10.610 6:33.14 -> 0.110 11.320 6:33.31 -> -> -> the differences in read times are basically in the mud. Blocksize -> just doesn't matter much with the kernel doing readahead. -> -> -Kyle -> -> ---------------------------(end of broadcast)--------------------------- -> TIP 6: Have you searched our list archives? -> -> http://archives.postgresql.org -> - --- - Bruce Momjian | http://candle.pha.pa.us - pgman@candle.pha.pa.us | (610) 853-3000 - + If your life is a hard drive, | 830 Blythe Avenue - + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 - ----------------------------(end of broadcast)--------------------------- -TIP 6: Have you searched our list archives? - -http://archives.postgresql.org - -From cjs@cynic.net Thu Apr 25 22:27:23 2002 -Return-path: <cjs@cynic.net> -Received: from angelic.cynic.net ([202.232.117.21]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3Q2RL411868 - for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 22:27:22 -0400 (EDT) -Received: from localhost (localhost [127.0.0.1]) - by angelic.cynic.net (Postfix) with ESMTP - id AF60C870E; Fri, 26 Apr 2002 11:27:17 +0900 (JST) -Date: Fri, 26 Apr 2002 11:27:17 +0900 (JST) -From: Curt Sampson <cjs@cynic.net> -To: Tom Lane <tgl@sss.pgh.pa.us> -cc: Bruce Momjian <pgman@candle.pha.pa.us>, - PostgreSQL-development <pgsql-hackers@postgresql.org> -Subject: Re: [HACKERS] Sequential Scan Read-Ahead -In-Reply-To: <25056.1019742872@sss.pgh.pa.us> -Message-ID: <Pine.NEB.4.43.0204261028110.449-100000@angelic.cynic.net> -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Status: OR - -On Thu, 25 Apr 2002, Tom Lane wrote: - -> Curt Sampson <cjs@cynic.net> writes: -> > 1. Theoretical proof: two components of the delay in retrieving a -> > block from disk are the disk arm movement and the wait for the -> > right block to rotate under the head. -> -> > When retrieving, say, eight adjacent blocks, these will be spread -> > across no more than two cylinders (with luck, only one). -> -> Weren't you contending earlier that with modern disk mechs you really -> have no idea where the data is? - -No, that was someone else. I contend that with pretty much any -large-scale storage mechanism (i.e., anything beyond ramdisks), -you will find that accessing two adjacent blocks is almost always -1) close to as fast as accessing just the one, and 2) much, much -faster than accessing two blocks that are relatively far apart. - -There will be the odd case where the two adjacent blocks are -physically far apart, but this is rare. - -If this idea doesn't hold true, the whole idea that sequential -reads are faster than random reads falls apart, and the optimizer -shouldn't even have the option to make random reads cost more, much -less have it set to four rather than one (or whatever it's set to). - -> You're asserting as an article of -> faith that the OS has been able to place the file's data blocks -> optimally --- or at least well enough to avoid unnecessary seeks. - -So are you, in the optimizer. But that's all right; the OS often -can and does do this placement; the FFS filesystem is explicitly -designed to do this sort of thing. If the filesystem isn't empty -and the files grow a lot they'll be split into large fragments, -but the fragments will be contiguous. - -> But just a few days ago I was getting told that random_page_cost -> was BS because there could be no such placement. - -I've been arguing against that point as well. - -> And also ensure that you aren't testing the point at issue. -> The point at issue is that *in the presence of kernel read-ahead* -> it's quite unclear that there's any benefit to a larger request size. - -I will test this. - -cjs --- -Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org - Don't you know, in this new Dark Age, we're all light. --XTC - - -From cjs@cynic.net Wed Apr 24 23:19:23 2002 -Return-path: <cjs@cynic.net> -Received: from angelic.cynic.net ([202.232.117.21]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P3JM414917 - for <pgman@candle.pha.pa.us>; Wed, 24 Apr 2002 23:19:22 -0400 (EDT) -Received: from localhost (localhost [127.0.0.1]) - by angelic.cynic.net (Postfix) with ESMTP - id 1F36F870E; Thu, 25 Apr 2002 12:19:14 +0900 (JST) -Date: Thu, 25 Apr 2002 12:19:14 +0900 (JST) -From: Curt Sampson <cjs@cynic.net> -To: Bruce Momjian <pgman@candle.pha.pa.us> -cc: PostgreSQL-development <pgsql-hackers@postgresql.org> -Subject: Re: Sequential Scan Read-Ahead -In-Reply-To: <200204250156.g3P1ufh05751@candle.pha.pa.us> -Message-ID: <Pine.NEB.4.43.0204251118040.445-100000@angelic.cynic.net> -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Status: OR - -On Wed, 24 Apr 2002, Bruce Momjian wrote: - -> > 1. Not all systems do readahead. -> -> If they don't, that isn't our problem. We expect it to be there, and if -> it isn't, the vendor/kernel is at fault. - -It is your problem when another database kicks Postgres' ass -performance-wise. - -And at that point, *you're* at fault. You're the one who's knowingly -decided to do things inefficiently. - -Sorry if this sounds harsh, but this, "Oh, someone else is to blame" -attitude gets me steamed. It's one thing to say, "We don't support -this." That's fine; there are often good reasons for that. It's a -completely different thing to say, "It's an unrelated entity's fault we -don't support this." - -At any rate, relying on the kernel to guess how to optimise for -the workload will never work as well as well as the software that -knows the workload doing the optimization. - -The lack of support thing is no joke. Sure, lots of systems nowadays -support unified buffer cache and read-ahead. But how many, besides -Solaris, support free-behind, which is also very important to avoid -blowing out your buffer cache when doing sequential reads? And who -at all supports read-ahead for reverse scans? (Or does Postgres -not do those, anyway? I can see the support is there.) - -And even when the facilities are there, you create problems by -using them. Look at the OS buffer cache, for example. Not only do -we lose efficiency by using two layers of caching, but (as people -have pointed out recently on the lists), the optimizer can't even -know how much or what is being cached, and thus can't make decisions -based on that. - -> Yes, seek() in file will turn off read-ahead. Grabbing bigger chunks -> would help here, but if you have two people already reading from the -> same file, grabbing bigger chunks of the file may not be optimal. - -Grabbing bigger chunks is always optimal, AFICT, if they're not -*too* big and you use the data. A single 64K read takes very little -longer than a single 8K read. - -> > 3. Even when the read-ahead does occur, you're still doing more -> > syscalls, and thus more expensive kernel/userland transitions, than -> > you have to. -> -> I would guess the performance impact is minimal. - -If it were minimal, people wouldn't work so hard to build multi-level -thread systems, where multiple userland threads are scheduled on -top of kernel threads. - -However, it does depend on how much CPU your particular application -is using. You may have it to spare. - -> http://candle.pha.pa.us/mhonarc/todo.detail/performance/msg00009.html - -Well, this message has some points in it that I feel are just incorrect. - - 1. It is *not* true that you have no idea where data is when - using a storage array or other similar system. While you - certainly ought not worry about things such as head positions - and so on, it's been a given for a long, long time that two - blocks that have close index numbers are going to be close - together in physical storage. - - 2. Raw devices are quite standard across Unix systems (except - in the unfortunate case of Linux, which I think has been - remedied, hasn't it?). They're very portable, and have just as - well--if not better--defined write semantics as a filesystem. - - 3. My observations of OS performance tuning over the past six - or eight years contradict the statement, "There's a considerable - cost in complexity and code in using "raw" storage too, and - it's not a one off cost: as the technologies change, the "fast" - way to do things will change and the code will have to be - updated to match." While optimizations have been removed over - the years the basic optimizations (order reads by block number, - do larger reads rather than smaller, cache the data) have - remained unchanged for a long, long time. - - 4. "Better to leave this to the OS vendor where possible, and - take advantage of the tuning they do." Well, sorry guys, but - have a look at the tuning they do. It hasn't changed in years, - except to remove now-unnecessary complexity realated to really, - really old and slow disk devices, and to add a few thing that - guess workload but still do a worse job than if the workload - generator just did its own optimisations in the first place. - -> http://candle.pha.pa.us/mhonarc/todo.detail/optimizer/msg00011.html - -Well, this one, with statements like "Postgres does have control -over its buffer cache," I don't know what to say. You can interpret -the statement however you like, but in the end Postgres very little -control at all over how data is moved between memory and disk. - -BTW, please don't take me as saying that all control over physical -IO should be done by Postgres. I just think that Posgres could do -a better job of managing data transfer between disk and memory than -the OS can. The rest of the things (using raw paritions, read-ahead, -free-behind, etc.) just drop out of that one idea. - -cjs --- -Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org - Don't you know, in this new Dark Age, we're all light. --XTC - - -From kaf@nwlink.com Fri Apr 26 14:22:39 2002 -Return-path: <kaf@nwlink.com> -Received: from doppelbock.patentinvestor.com (ip146.usw5.rb1.bel.nwlink.com [209.20.249.146]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3QIMc400783 - for <pgman@candle.pha.pa.us>; Fri, 26 Apr 2002 14:22:38 -0400 (EDT) -Received: (from kaf@localhost) - by doppelbock.patentinvestor.com (8.11.6/8.11.2) id g3QII0l16824; - Fri, 26 Apr 2002 11:18:00 -0700 -From: Kyle <kaf@nwlink.com> -MIME-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -Content-Transfer-Encoding: 7bit -Message-ID: <15561.39384.296503.501888@doppelbock.patentinvestor.com> -Date: Fri, 26 Apr 2002 11:18:00 -0700 -To: Bruce Momjian <pgman@candle.pha.pa.us> -Subject: Re: [HACKERS] Sequential Scan Read-Ahead -In-Reply-To: <200204261444.g3QEiFh11090@candle.pha.pa.us> -References: <15561.26116.817541.950416@doppelbock.patentinvestor.com> - <200204261444.g3QEiFh11090@candle.pha.pa.us> -X-Mailer: VM 6.95 under 21.1 (patch 14) "Cuyahoga Valley" XEmacs Lucid -Status: ORr - -Hey Bruce, - -I'll forward this to the list if you think they'd benefit from it. -I'm not sure it says anything about read-ahead, I think this is more a -kernel caching issue. But I've been known to be wrong in the past. -Anyway... - - -the test: - -foreach i (5 15 20 25 30 ) - echo $i - time dd bs=8k if=big_file1 of=/dev/null & - sleep $i - time dd bs=8k if=big_file1 of=/dev/null & - wait -end - -I did a couple more runs in the low range since their is a drastic -jump in elapsed (wall clock) time after doing a 6 second sleep: - - first process second process -sleep user kernel elapsed user kernel elapsed -0 sec 0.200 7.980 1:26.57 0.240 7.720 1:26.56 -3 sec 0.260 7.600 1:25.71 0.260 8.100 1:22.60 -5 sec 0.160 7.890 1:26.04 0.220 8.180 1:21.04 -6 sec 0.220 8.070 1:19.59 0.230 7.620 1:25.69 -7 sec 0.210 9.270 1:57.92 0.100 8.750 1:50.76 -8 sec 0.240 8.060 4:47.47 0.300 7.800 4:40.40 -15 sec 0.200 8.500 4:51.11 0.180 7.280 4:44.36 -20 sec 0.160 8.040 4:40.72 0.240 7.790 4:37.24 -25 sec 0.170 8.150 4:37.58 0.140 8.200 4:33.08 -30 sec 0.200 7.390 4:37.01 0.230 8.220 4:31.83 - - - -with a sleep of > 6 seconds, either the second process isn't getting -cached data or readahead is being turned off. I'd guess the former, I -don't see why read-ahead would be turned off since they're both doing -sequential operations. - -Although with 512mb of memory and the disk reading at about 22 mb/sec, -maybe we're not hitting the cache. I'd guess at least ~400 megs of -kernel cache is being used for buffering this 2 gig file. free(1) -reports: - -% free - total used free shared buffers cached -Mem: 512924 508576 4348 0 2640 477960 --/+ buffers/cache: 27976 484948 -Swap: 527152 15864 511288 - -so shouldn't we be getting cached data even with a sleep of up to -about (400/22) 18 seconds...? Maybe I'm just in the dark on what's -really happening. I should point out that this is linux 2.4.18. - - - - -Bruce Momjian wrote: -> -> I am trying to illustrate how kernel read-ahead could be turned off in -> certain cases. -> -> --------------------------------------------------------------------------- -> -> Kyle wrote: -> > What are you trying to test, the kernel's cache vs disk speed? -> > -> > -> > Bruce Momjian wrote: -> > > -> > > Nice test. Would you test simultaneous 'dd' on the same file, perhaps -> > > with a slight delay between to the two so they don't read each other's -> > > blocks? -> > > -> > > seek() in the file will turn off read-ahead in most OS's. I am not -> > > saying this is a major issue for PostgreSQL but the numbers would be -> > > interesting. - -From pgsql-hackers-owner+M49418=pgman=candle.pha.pa.us@postgresql.org Tue Jan 27 15:52:28 2004 -Return-path: <pgsql-hackers-owner+M49418=pgman=candle.pha.pa.us@postgresql.org> -Received: from vm2.hub.org ([200.46.204.60]) - by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0RKqPe07814 - for <pgman@candle.pha.pa.us>; Tue, 27 Jan 2004 15:52:28 -0500 (EST) -Received: from postgresql.org (svr1.postgresql.org [200.46.204.71]) - by vm2.hub.org (Postfix) with ESMTP id 70DC3CD397A - for <pgman@candle.pha.pa.us>; Tue, 27 Jan 2004 20:52:19 +0000 (GMT) -X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org -Received: from localhost (neptune.hub.org [200.46.204.2]) - by svr1.postgresql.org (Postfix) with ESMTP id A93D7D1D3A4 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Tue, 27 Jan 2004 20:41:43 +0000 (GMT) -Received: from svr1.postgresql.org ([200.46.204.71]) - by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024) - with ESMTP id 54186-02 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; - Tue, 27 Jan 2004 16:41:12 -0400 (AST) -Received: from smtp.istop.com (dci.doncaster.on.ca [66.11.168.194]) - by svr1.postgresql.org (Postfix) with ESMTP id 33243D1E1F2 - for <pgsql-hackers@postgresql.org>; Tue, 27 Jan 2004 16:36:24 -0400 (AST) -Received: from stark.xeocode.com (gsstark.mtl.istop.com [66.11.160.162]) - by smtp.istop.com (Postfix) with ESMTP - id 2A41136C44; Tue, 27 Jan 2004 15:36:21 -0500 (EST) -Received: from localhost ([127.0.0.1] helo=stark.xeocode.com) - by stark.xeocode.com with smtp (Exim 3.36 #1 (Debian)) - id 1AlZwa-0006sL-00; Tue, 27 Jan 2004 15:36:20 -0500 -To: pgsql-hackers@postgresql.org -Subject: [HACKERS] Question about indexes -From: Greg Stark <gsstark@mit.edu> -Organization: The Emacs Conspiracy; member since 1992 -Date: 27 Jan 2004 15:36:20 -0500 -Message-ID: <87isixt9h7.fsf@stark.xeocode.com> -Lines: 9 -User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3 -MIME-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -X-Virus-Scanned: by amavisd-new at postgresql.org -X-Mailing-List: pgsql-hackers -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on - candle.pha.pa.us -X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham - version=2.61 -Status: OR - - -How feasible would it be to have a btree index on ctid? I'm thinking it ought -to work simply enough for the normal case of insert/delet/update, but I'm not -completely certain how vacuum, vacuum full, and cluster would interact. - -You may think this would be utterly useless, but I have a cunning plan. - --- -greg - - ----------------------------(end of broadcast)--------------------------- -TIP 8: explain analyze is your friend - -From pgsql-hackers-owner+M49439=pgman=candle.pha.pa.us@postgresql.org Tue Jan 27 18:01:59 2004 -Return-path: <pgsql-hackers-owner+M49439=pgman=candle.pha.pa.us@postgresql.org> -Received: from bricolage.postgresql.org ([200.46.204.116]) - by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0RN1we27517 - for <pgman@candle.pha.pa.us>; Tue, 27 Jan 2004 18:01:59 -0500 (EST) -Received: from postgresql.org (svr1.postgresql.org [200.46.204.71]) - by bricolage.postgresql.org (Postfix) with ESMTP id 946B3148343C - for <pgman@candle.pha.pa.us>; Tue, 27 Jan 2004 23:01:52 +0000 (GMT) -X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org -Received: from localhost (neptune.hub.org [200.46.204.2]) - by svr1.postgresql.org (Postfix) with ESMTP id 778CED1D362 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Tue, 27 Jan 2004 22:52:27 +0000 (GMT) -Received: from svr1.postgresql.org ([200.46.204.71]) - by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024) - with ESMTP id 09353-02 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; - Tue, 27 Jan 2004 18:51:56 -0400 (AST) -Received: from sss.pgh.pa.us (unknown [192.204.191.242]) - by svr1.postgresql.org (Postfix) with ESMTP id 5C5D5D1B47D - for <pgsql-hackers@postgresql.org>; Tue, 27 Jan 2004 18:51:55 -0400 (AST) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.12.10/8.12.10) with ESMTP id i0RMpunX029816; - Tue, 27 Jan 2004 17:51:56 -0500 (EST) -To: Greg Stark <gsstark@mit.edu> -cc: pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Question about indexes -In-Reply-To: <87isixt9h7.fsf@stark.xeocode.com> -References: <87isixt9h7.fsf@stark.xeocode.com> -Comments: In-reply-to Greg Stark <gsstark@mit.edu> - message dated "27 Jan 2004 15:36:20 -0500" -Date: Tue, 27 Jan 2004 17:51:56 -0500 -Message-ID: <29815.1075243916@sss.pgh.pa.us> -From: Tom Lane <tgl@sss.pgh.pa.us> -X-Virus-Scanned: by amavisd-new at postgresql.org -X-Mailing-List: pgsql-hackers -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on - candle.pha.pa.us -X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham - version=2.61 -Status: OR - -Greg Stark <gsstark@mit.edu> writes: -> How feasible would it be to have a btree index on ctid? - -Why would you want one? Direct access by ctid beats out an index lookup -every time. In any case, vacuum and friends would break such an index -entirely. - - regards, tom lane - ----------------------------(end of broadcast)--------------------------- -TIP 3: if posting/reading through Usenet, please send an appropriate - subscribe-nomail command to majordomo@postgresql.org so that your - message can get through to the mailing list cleanly - -From pgsql-hackers-owner+M49440=pgman=candle.pha.pa.us@postgresql.org Tue Jan 27 18:19:13 2004 -Return-path: <pgsql-hackers-owner+M49440=pgman=candle.pha.pa.us@postgresql.org> -Received: from krusty-motorsports.com (IDENT:exim@krusty-motorsports.com [192.94.170.8]) - by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0RNJCe00301 - for <pgman@candle.pha.pa.us>; Tue, 27 Jan 2004 18:19:13 -0500 (EST) -Received: from [200.46.204.71] (helo=postgresql.org) - by krusty-motorsports.com with esmtp (Exim 4.22) - id 1AldQ9-0007JC-2z - for pgman@candle.pha.pa.us; Wed, 28 Jan 2004 00:19:05 +0000 -X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org -Received: from localhost (neptune.hub.org [200.46.204.2]) - by svr1.postgresql.org (Postfix) with ESMTP id 6D641D1D54A - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Tue, 27 Jan 2004 23:12:01 +0000 (GMT) -Received: from svr1.postgresql.org ([200.46.204.71]) - by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024) - with ESMTP id 14466-06 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; - Tue, 27 Jan 2004 19:11:30 -0400 (AST) -Received: from smtp.istop.com (dci.doncaster.on.ca [66.11.168.194]) - by svr1.postgresql.org (Postfix) with ESMTP id 6D58FD1D49E - for <pgsql-hackers@postgresql.org>; Tue, 27 Jan 2004 19:11:29 -0400 (AST) -Received: from stark.xeocode.com (gsstark.mtl.istop.com [66.11.160.162]) - by smtp.istop.com (Postfix) with ESMTP - id 9B74536ADA; Tue, 27 Jan 2004 18:11:31 -0500 (EST) -Received: from localhost ([127.0.0.1] helo=stark.xeocode.com) - by stark.xeocode.com with smtp (Exim 3.36 #1 (Debian)) - id 1AlcMl-0007Tk-00; Tue, 27 Jan 2004 18:11:31 -0500 -To: Tom Lane <tgl@sss.pgh.pa.us> -cc: Greg Stark <gsstark@mit.edu>, pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Question about indexes -References: <87isixt9h7.fsf@stark.xeocode.com> - <29815.1075243916@sss.pgh.pa.us> -In-Reply-To: <29815.1075243916@sss.pgh.pa.us> -From: Greg Stark <gsstark@mit.edu> -Organization: The Emacs Conspiracy; member since 1992 -Date: 27 Jan 2004 18:11:31 -0500 -Message-ID: <87d695t2ak.fsf@stark.xeocode.com> -Lines: 33 -User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3 -MIME-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -X-Virus-Scanned: by amavisd-new at postgresql.org -X-Mailing-List: pgsql-hackers -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on - candle.pha.pa.us -X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham - version=2.61 -Status: OR - -Tom Lane <tgl@sss.pgh.pa.us> writes: - -> Greg Stark <gsstark@mit.edu> writes: -> -> > How feasible would it be to have a btree index on ctid? -> -> Why would you want one? Direct access by ctid beats out an index lookup -> every time. - -Of course. But as I mentioned, I have a cunning plan. - -If you have two indexes (a,ctid) and (b,ctid) and do a query where a=1 and b=2 -then it would be particularly easy to combine the two efficiently. - -If specially marked btree indexes -- or even all btree indexes -- implicitly -had ctid as a final sort order after all the index column, then it would -esentially obviate the need for bitmap indexes. They wouldn't have the space -advantage, but they would be possible to combine using arbitrary boolean -expressions without looking at the actual tuples. - -This is essentially what is in the TODO about using bitmaps, but without -having to do any extra sorts. - -This would only really be an advantage for particularly wide tables where the -combination of boolean clauses narrows the result set down a lot more than any -one clause. - -> In any case, vacuum and friends would break such an index entirely. - -That was what I was afraid of. - --- -greg - - ----------------------------(end of broadcast)--------------------------- -TIP 5: Have you checked our extensive FAQ? - - http://www.postgresql.org/docs/faqs/FAQ.html - -From pgsql-hackers-owner+M49442=pgman=candle.pha.pa.us@postgresql.org Tue Jan 27 18:32:25 2004 -Return-path: <pgsql-hackers-owner+M49442=pgman=candle.pha.pa.us@postgresql.org> -Received: from vm2.hub.org ([200.46.204.60]) - by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0RNWNe02539 - for <pgman@candle.pha.pa.us>; Tue, 27 Jan 2004 18:32:24 -0500 (EST) -Received: from postgresql.org (svr1.postgresql.org [200.46.204.71]) - by vm2.hub.org (Postfix) with ESMTP id DC003CD49A4 - for <pgman@candle.pha.pa.us>; Tue, 27 Jan 2004 23:32:17 +0000 (GMT) -X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org -Received: from localhost (neptune.hub.org [200.46.204.2]) - by svr1.postgresql.org (Postfix) with ESMTP id 34466D1D17D - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Tue, 27 Jan 2004 23:25:11 +0000 (GMT) -Received: from svr1.postgresql.org ([200.46.204.71]) - by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024) - with ESMTP id 20117-05 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; - Tue, 27 Jan 2004 19:24:41 -0400 (AST) -Received: from sss.pgh.pa.us (unknown [192.204.191.242]) - by svr1.postgresql.org (Postfix) with ESMTP id 33E28D1D548 - for <pgsql-hackers@postgresql.org>; Tue, 27 Jan 2004 19:24:40 -0400 (AST) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.12.10/8.12.10) with ESMTP id i0RNOfnX000404; - Tue, 27 Jan 2004 18:24:41 -0500 (EST) -To: Greg Stark <gsstark@mit.edu> -cc: pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Question about indexes -In-Reply-To: <87d695t2ak.fsf@stark.xeocode.com> -References: <87isixt9h7.fsf@stark.xeocode.com> <29815.1075243916@sss.pgh.pa.us> <87d695t2ak.fsf@stark.xeocode.com> -Comments: In-reply-to Greg Stark <gsstark@mit.edu> - message dated "27 Jan 2004 18:11:31 -0500" -Date: Tue, 27 Jan 2004 18:24:41 -0500 -Message-ID: <403.1075245881@sss.pgh.pa.us> -From: Tom Lane <tgl@sss.pgh.pa.us> -X-Virus-Scanned: by amavisd-new at postgresql.org -X-Mailing-List: pgsql-hackers -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on - candle.pha.pa.us -X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham - version=2.61 -Status: OR - -Greg Stark <gsstark@mit.edu> writes: -> If you have two indexes (a,ctid) and (b,ctid) and do a query where a=1 and b=2 -> then it would be particularly easy to combine the two efficiently. - -> If specially marked btree indexes -- or even all btree indexes -- implicitly -> had ctid as a final sort order after all the index column, then it would -> esentially obviate the need for bitmap indexes. - -I don't think so. You are thinking only of exact-equality queries --- -as soon as the WHERE clause describes a range of index entries, the -readout wouldn't be sorted by ctid anyway. - -Combining indexes via a bitmap intermediate step (which is not really -the same thing as bitmap indexes, IIUC) seems like a more robust -approach than relying on the index entries to be in ctid order. - -But if we did want to sort indexes that way, we could do it today, -I think. The ctid is already stored in index entries (it is the -"payload" remember...) and we could use it as a tiebreaker when -determining insertion position. This doesn't have the problems that -putting ctid into the user columns would do, because the system knows -about that ctid as being special; the difficulty with ctid in the user -columns is the code not knowing that it'd need to change on a tuple move. - - regards, tom lane - ----------------------------(end of broadcast)--------------------------- -TIP 5: Have you checked our extensive FAQ? - - http://www.postgresql.org/docs/faqs/FAQ.html - -From pgsql-hackers-owner+M49450=pgman=candle.pha.pa.us@postgresql.org Tue Jan 27 21:28:20 2004 -Return-path: <pgsql-hackers-owner+M49450=pgman=candle.pha.pa.us@postgresql.org> -Received: from postgresql.wavefire.com (postgresql.wavefire.com [64.141.14.48]) - by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0S2SIe29755 - for <pgman@candle.pha.pa.us>; Tue, 27 Jan 2004 21:28:19 -0500 (EST) -Received: from postgresql.org ([200.46.204.71]) - by postgresql.wavefire.com (8.9.3/8.9.3) with ESMTP id TBM02845 - for <pgman@candle.pha.pa.us>; Tue, 27 Jan 2004 19:06:45 -0800 (PST) - (envelope-from pgsql-hackers-owner+M49450=pgman=candle.pha.pa.us@postgresql.org) -X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org -Received: from localhost (neptune.hub.org [200.46.204.2]) - by svr1.postgresql.org (Postfix) with ESMTP id 6213BD1B85F - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Wed, 28 Jan 2004 02:19:56 +0000 (GMT) -Received: from svr1.postgresql.org ([200.46.204.71]) - by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024) - with ESMTP id 69438-06 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; - Tue, 27 Jan 2004 22:19:26 -0400 (AST) -Received: from smtp.istop.com (dci.doncaster.on.ca [66.11.168.194]) - by svr1.postgresql.org (Postfix) with ESMTP id 1964FD1B47D - for <pgsql-hackers@postgresql.org>; Tue, 27 Jan 2004 22:19:24 -0400 (AST) -Received: from stark.xeocode.com (gsstark.mtl.istop.com [66.11.160.162]) - by smtp.istop.com (Postfix) with ESMTP - id BE92136B37; Tue, 27 Jan 2004 21:19:26 -0500 (EST) -Received: from localhost ([127.0.0.1] helo=stark.xeocode.com) - by stark.xeocode.com with smtp (Exim 3.36 #1 (Debian)) - id 1AlfIc-00084d-00; Tue, 27 Jan 2004 21:19:26 -0500 -To: Tom Lane <tgl@sss.pgh.pa.us> -cc: Greg Stark <gsstark@mit.edu>, pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Question about indexes -References: <87isixt9h7.fsf@stark.xeocode.com> - <29815.1075243916@sss.pgh.pa.us> <87d695t2ak.fsf@stark.xeocode.com> - <403.1075245881@sss.pgh.pa.us> -In-Reply-To: <403.1075245881@sss.pgh.pa.us> -From: Greg Stark <gsstark@mit.edu> -Organization: The Emacs Conspiracy; member since 1992 -Date: 27 Jan 2004 21:19:26 -0500 -Message-ID: <877jzcu85t.fsf@stark.xeocode.com> -Lines: 43 -User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3 -MIME-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -X-Virus-Scanned: by amavisd-new at postgresql.org -X-Mailing-List: pgsql-hackers -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on - candle.pha.pa.us -X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham - version=2.61 -Status: OR - - -Tom Lane <tgl@sss.pgh.pa.us> writes: - -> I don't think so. You are thinking only of exact-equality queries --- -> as soon as the WHERE clause describes a range of index entries, the -> readout wouldn't be sorted by ctid anyway. - -But then even bitmap indexes would fail in that way too, or at least have a -lot of extra cost that would have to be taken into account based on the number -of values in the range. - -> Combining indexes via a bitmap intermediate step (which is not really -> the same thing as bitmap indexes, IIUC) seems like a more robust -> approach than relying on the index entries to be in ctid order. - -I would see that as the next step, But it seems to me it would be only a small -set of queries where it would really help enough to outweigh the extra work of -the sort. Whereas if the ctid is already pre-sorted then the extra cost is -fairly low. Sort of like the difference in cost between a merge join where -both sides have to be sorted and a merge join where both sides are pre-sorted. - -> But if we did want to sort indexes that way, we could do it today, -> I think. The ctid is already stored in index entries (it is the -> "payload" remember...) and we could use it as a tiebreaker when -> determining insertion position. This doesn't have the problems that -> putting ctid into the user columns would do, because the system knows -> about that ctid as being special; the difficulty with ctid in the user -> columns is the code not knowing that it'd need to change on a tuple move. - -That's exactly what I was thinking. I just don't know how badly it would -complicate the vacuum{,full}/cluster code and whether those are the only cases -to worry about. - - -Note that the space saving of bitmap indexes is still a substantial factor. -Using btree indexes the i/o costs of doing multiple index scans plus a table -scan of the relevant pages would still be quite substantial. So this doesn't -completely obviate the need for bitmap indexes, but I think it would remove a -lot of the pressure from people who just need them to handle a few select -queries. - --- -greg - - ----------------------------(end of broadcast)--------------------------- -TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org - -From pgsql-hackers-owner+M49453=pgman=candle.pha.pa.us@postgresql.org Tue Jan 27 21:53:09 2004 -Return-path: <pgsql-hackers-owner+M49453=pgman=candle.pha.pa.us@postgresql.org> -Received: from joeconway.com (66-146-172-86.skyriver.net [66.146.172.86]) - by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0S2r3e04133 - for <pgman@candle.pha.pa.us>; Tue, 27 Jan 2004 21:53:08 -0500 (EST) -Received: from postgresql.org ([200.46.204.71] verified) - by joeconway.com (CommuniGate Pro SMTP 4.1.8) - with ESMTP id 791556 for pgman@candle.pha.pa.us; Tue, 27 Jan 2004 18:49:49 -0800 -X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org -Received: from localhost (neptune.hub.org [200.46.204.2]) - by svr1.postgresql.org (Postfix) with ESMTP id C4A10D1B47D - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Wed, 28 Jan 2004 02:49:28 +0000 (GMT) -Received: from svr1.postgresql.org ([200.46.204.71]) - by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024) - with ESMTP id 76787-10 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; - Tue, 27 Jan 2004 22:48:59 -0400 (AST) -Received: from sss.pgh.pa.us (unknown [192.204.191.242]) - by svr1.postgresql.org (Postfix) with ESMTP id A5C5CD1B4DC - for <pgsql-hackers@postgresql.org>; Tue, 27 Jan 2004 22:48:56 -0400 (AST) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.12.11/8.12.11) with ESMTP id i0S2mxTx005814; - Tue, 27 Jan 2004 21:48:59 -0500 (EST) -To: Greg Stark <gsstark@mit.edu> -cc: pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Question about indexes -In-Reply-To: <877jzcu85t.fsf@stark.xeocode.com> -References: <87isixt9h7.fsf@stark.xeocode.com> <29815.1075243916@sss.pgh.pa.us> <87d695t2ak.fsf@stark.xeocode.com> <403.1075245881@sss.pgh.pa.us> <877jzcu85t.fsf@stark.xeocode.com> -Comments: In-reply-to Greg Stark <gsstark@mit.edu> - message dated "27 Jan 2004 21:19:26 -0500" -Date: Tue, 27 Jan 2004 21:48:59 -0500 -Message-ID: <5813.1075258139@sss.pgh.pa.us> -From: Tom Lane <tgl@sss.pgh.pa.us> -X-Virus-Scanned: by amavisd-new at postgresql.org -X-Mailing-List: pgsql-hackers -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on - candle.pha.pa.us -X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham - version=2.61 -Status: OR - -Greg Stark <gsstark@mit.edu> writes: ->> Combining indexes via a bitmap intermediate step (which is not really ->> the same thing as bitmap indexes, IIUC) seems like a more robust ->> approach than relying on the index entries to be in ctid order. - -> I would see that as the next step, But it seems to me it would be only a small -> set of queries where it would really help enough to outweigh the extra work of -> the sort. - -What sort? The whole point of a bitmap is that it makes it easy to -visit the tuples in heap order. You scan the index, you set the -appropriate bits in the bitmap, and then you scan the bitmap and go to -the heap tuples that have their bits set. If you are using multiple -indexes you can AND or OR their results at the bitmap phase before you -go to the heap. - -An implementation of this kind would not produce tuples in index order, -so if you have an ORDER BY to satisfy then you end up doing an explicit -sort after you have the tuples. It would be up to the planner to -consider this cost versus the advantages of being able to use multiple -indexes; we'd certainly want to keep the existing scan mechanism as an -available alternative. But if the query is suited to multiple indexes -I suspect it'd be a win pretty often. - -> Note that the space saving of bitmap indexes is still a substantial factor. - -I think you are still confusing what I'm talking about with a bitmap -index, ie, a persistent structure on-disk. It's not that at all, but -a transient structure built in-memory during an index scan. - -I'm a little dubious that true bitmap indexes would be worth building -for Postgres. Seems like partial indexes cover the same sorts of -applications and are more flexible. - - regards, tom lane - ----------------------------(end of broadcast)--------------------------- -TIP 5: Have you checked our extensive FAQ? - - http://www.postgresql.org/docs/faqs/FAQ.html - -From pgsql-hackers-owner+M49462=pgman=candle.pha.pa.us@postgresql.org Wed Jan 28 13:10:48 2004 -Return-path: <pgsql-hackers-owner+M49462=pgman=candle.pha.pa.us@postgresql.org> -Received: from joeconway.com (66-146-172-86.skyriver.net [66.146.172.86]) - by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0SIAle25230 - for <pgman@candle.pha.pa.us>; Wed, 28 Jan 2004 13:10:47 -0500 (EST) -Received: from postgresql.org ([200.46.204.71] verified) - by joeconway.com (CommuniGate Pro SMTP 4.1.8) - with ESMTP id 793300 for pgman@candle.pha.pa.us; Wed, 28 Jan 2004 10:07:34 -0800 -X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org -Received: from localhost (unknown [200.46.204.2]) - by svr1.postgresql.org (Postfix) with ESMTP id 19389D1CCAF - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Wed, 28 Jan 2004 17:56:46 +0000 (GMT) -Received: from svr1.postgresql.org ([200.46.204.71]) - by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024) - with ESMTP id 10780-09 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; - Wed, 28 Jan 2004 13:56:14 -0400 (AST) -Received: from www.postgresql.com (www.postgresql.com [200.46.204.209]) - by svr1.postgresql.org (Postfix) with ESMTP id A53DAD1DF6B - for <pgsql-hackers@postgresql.org>; Wed, 28 Jan 2004 13:52:13 -0400 (AST) -Received: from smtp.istop.com (dci.doncaster.on.ca [66.11.168.194]) - by www.postgresql.com (Postfix) with ESMTP id E0414CF6FBA - for <pgsql-hackers@postgresql.org>; Wed, 28 Jan 2004 10:47:17 -0400 (AST) -Received: from stark.xeocode.com (gsstark.mtl.istop.com [66.11.160.162]) - by smtp.istop.com (Postfix) with ESMTP - id C4D5036BA2; Wed, 28 Jan 2004 09:13:47 -0500 (EST) -Received: from localhost ([127.0.0.1] helo=stark.xeocode.com) - by stark.xeocode.com with smtp (Exim 3.36 #1 (Debian)) - id 1AlqRv-0001fZ-00; Wed, 28 Jan 2004 09:13:47 -0500 -To: Tom Lane <tgl@sss.pgh.pa.us> -cc: Greg Stark <gsstark@mit.edu>, pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Question about indexes -References: <87isixt9h7.fsf@stark.xeocode.com> - <29815.1075243916@sss.pgh.pa.us> <87d695t2ak.fsf@stark.xeocode.com> - <403.1075245881@sss.pgh.pa.us> <877jzcu85t.fsf@stark.xeocode.com> - <5813.1075258139@sss.pgh.pa.us> -In-Reply-To: <5813.1075258139@sss.pgh.pa.us> -From: Greg Stark <gsstark@mit.edu> -Organization: The Emacs Conspiracy; member since 1992 -Date: 28 Jan 2004 09:13:47 -0500 -Message-ID: <871xpktb38.fsf@stark.xeocode.com> -Lines: 38 -User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3 -MIME-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -X-Virus-Scanned: by amavisd-new at postgresql.org -X-Mailing-List: pgsql-hackers -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on - candle.pha.pa.us -X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham - version=2.61 -Status: OR - -Tom Lane <tgl@sss.pgh.pa.us> writes: - -> Greg Stark <gsstark@mit.edu> writes: -> > -> > I would see that as the next step, But it seems to me it would be only a small -> > set of queries where it would really help enough to outweigh the extra work of -> > the sort. -> -> What sort? - -To build the in-memory bitmap you effectively have to do a sort. If the tuples -come out of the index in heap order then you can combine them without having -to go through that step. - -> I'm a little dubious that true bitmap indexes would be worth building -> for Postgres. Seems like partial indexes cover the same sorts of -> applications and are more flexible. - -I'm clear on the distinction. I think bitmap indexes still have a place, but -if regular btree indexes could be combined efficiently then that would be an -even narrower niche. - -Partial indexes are very handy, and they're useful in corner cases where -bitmap indexes are useful, such as flags for special types of records. - -But I think bitmap indexes are specifically wanted by certain types of data -warehousing applications where you have an index on virtually every column and -then want to do arbitrary boolean combinations of all of them. btree indexes -would generate more i/o scanning all the indexes than just doing a sequential -scan would. Whereas bitmap indexes are much denser on disk. - -However my experience leans more towards the OLTP side and I very rarely saw -applications like this. - - - --- -greg - - ----------------------------(end of broadcast)--------------------------- -TIP 3: if posting/reading through Usenet, please send an appropriate - subscribe-nomail command to majordomo@postgresql.org so that your - message can get through to the mailing list cleanly - -From pgsql-hackers-owner+M49465=pgman=candle.pha.pa.us@postgresql.org Wed Jan 28 13:30:48 2004 -Return-path: <pgsql-hackers-owner+M49465=pgman=candle.pha.pa.us@postgresql.org> -Received: from joeconway.com (66-146-172-86.skyriver.net [66.146.172.86]) - by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0SIUke29027 - for <pgman@candle.pha.pa.us>; Wed, 28 Jan 2004 13:30:47 -0500 (EST) -Received: from postgresql.org ([200.46.204.71] verified) - by joeconway.com (CommuniGate Pro SMTP 4.1.8) - with ESMTP id 793371 for pgman@candle.pha.pa.us; Wed, 28 Jan 2004 10:27:31 -0800 -X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org -Received: from localhost (unknown [200.46.204.2]) - by svr1.postgresql.org (Postfix) with ESMTP id 92005D1D3F7 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Wed, 28 Jan 2004 18:14:02 +0000 (GMT) -Received: from svr1.postgresql.org ([200.46.204.71]) - by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024) - with ESMTP id 21680-08 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; - Wed, 28 Jan 2004 14:13:31 -0400 (AST) -Received: from www.postgresql.com (www.postgresql.com [200.46.204.209]) - by svr1.postgresql.org (Postfix) with ESMTP id 088B0D1DC77 - for <pgsql-hackers@postgresql.org>; Wed, 28 Jan 2004 14:08:44 -0400 (AST) -Received: from sss.pgh.pa.us (unknown [192.204.191.242]) - by www.postgresql.com (Postfix) with ESMTP id CFF50CF77BD - for <pgsql-hackers@postgresql.org>; Wed, 28 Jan 2004 11:00:42 -0400 (AST) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.12.11/8.12.11) with ESMTP id i0SExBYA018093; - Wed, 28 Jan 2004 09:59:12 -0500 (EST) -To: Greg Stark <gsstark@mit.edu> -cc: pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Question about indexes -In-Reply-To: <871xpktb38.fsf@stark.xeocode.com> -References: <87isixt9h7.fsf@stark.xeocode.com> <29815.1075243916@sss.pgh.pa.us> <87d695t2ak.fsf@stark.xeocode.com> <403.1075245881@sss.pgh.pa.us> <877jzcu85t.fsf@stark.xeocode.com> <5813.1075258139@sss.pgh.pa.us> <871xpktb38.fsf@stark.xeocode.com> -Comments: In-reply-to Greg Stark <gsstark@mit.edu> - message dated "28 Jan 2004 09:13:47 -0500" -Date: Wed, 28 Jan 2004 09:59:11 -0500 -Message-ID: <18092.1075301951@sss.pgh.pa.us> -From: Tom Lane <tgl@sss.pgh.pa.us> -X-Virus-Scanned: by amavisd-new at postgresql.org -X-Mailing-List: pgsql-hackers -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on - candle.pha.pa.us -X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham - version=2.61 -Status: OR - -Greg Stark <gsstark@mit.edu> writes: -> Tom Lane <tgl@sss.pgh.pa.us> writes: ->> What sort? - -> To build the in-memory bitmap you effectively have to do a sort. - -Hm, you're thinking that the operation of inserting a bit into a bitmap -has to be at least O(log N). Seems to me that that depends on the data -structure you use. In principle it could be O(1), if you use a true -bitmap (linear array) -- just index and set the bit. You might be right -that practical data structures would be O(log N), but I'm not totally -convinced. - -> If the tuples come out of the index in heap order then you can combine -> them without having to go through that step. - -But considering the restrictions implied by that assumption --- no range -scans, no non-btree indexes --- I doubt we will take the trouble to -implement that variant. We'll want to do the generalized bitmap code -anyway. - -In any case, this discussion is predicated on the assumption that the -operations involving the bitmap are a significant fraction of the total -time, which I think is quite uncertain. Until we build it and profile -it, we won't know that. - - regards, tom lane - ----------------------------(end of broadcast)--------------------------- -TIP 4: Don't 'kill -9' the postmaster - -From pgsql-hackers-owner+M49457=pgman=candle.pha.pa.us@postgresql.org Wed Jan 28 10:42:58 2004 -Return-path: <pgsql-hackers-owner+M49457=pgman=candle.pha.pa.us@postgresql.org> -Received: from joeconway.com (66-146-172-86.skyriver.net [66.146.172.86]) - by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0SFgue00574 - for <pgman@candle.pha.pa.us>; Wed, 28 Jan 2004 10:42:57 -0500 (EST) -Received: from postgresql.org ([200.46.204.71] verified) - by joeconway.com (CommuniGate Pro SMTP 4.1.8) - with ESMTP id 792727 for pgman@candle.pha.pa.us; Wed, 28 Jan 2004 07:39:41 -0800 -X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org -Received: from localhost (unknown [200.46.204.2]) - by svr1.postgresql.org (Postfix) with ESMTP id 08484D1CA01 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Wed, 28 Jan 2004 15:38:28 +0000 (GMT) -Received: from svr1.postgresql.org ([200.46.204.71]) - by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024) - with ESMTP id 36717-02 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; - Wed, 28 Jan 2004 11:37:55 -0400 (AST) -Received: from smtp.istop.com (dci.doncaster.on.ca [66.11.168.194]) - by svr1.postgresql.org (Postfix) with ESMTP id E27BDD1D201 - for <pgsql-hackers@postgresql.org>; Wed, 28 Jan 2004 11:37:55 -0400 (AST) -Received: from stark.xeocode.com (gsstark.mtl.istop.com [66.11.160.162]) - by smtp.istop.com (Postfix) with ESMTP - id 1E70F36BBA; Wed, 28 Jan 2004 10:09:35 -0500 (EST) -Received: from localhost ([127.0.0.1] helo=stark.xeocode.com) - by stark.xeocode.com with smtp (Exim 3.36 #1 (Debian)) - id 1AlrJu-0001rj-00; Wed, 28 Jan 2004 10:09:34 -0500 -To: Tom Lane <tgl@sss.pgh.pa.us> -cc: Greg Stark <gsstark@mit.edu>, pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Question about indexes -References: <87isixt9h7.fsf@stark.xeocode.com> - <29815.1075243916@sss.pgh.pa.us> <87d695t2ak.fsf@stark.xeocode.com> - <403.1075245881@sss.pgh.pa.us> <877jzcu85t.fsf@stark.xeocode.com> - <5813.1075258139@sss.pgh.pa.us> <871xpktb38.fsf@stark.xeocode.com> - <18092.1075301951@sss.pgh.pa.us> -In-Reply-To: <18092.1075301951@sss.pgh.pa.us> -From: Greg Stark <gsstark@mit.edu> -Organization: The Emacs Conspiracy; member since 1992 -Date: 28 Jan 2004 10:09:34 -0500 -Message-ID: <87vfmwrtxt.fsf@stark.xeocode.com> -Lines: 15 -User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3 -MIME-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -X-Virus-Scanned: by amavisd-new at postgresql.org -X-Mailing-List: pgsql-hackers -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on - candle.pha.pa.us -X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham - version=2.61 -Status: ORr - - -Tom Lane <tgl@sss.pgh.pa.us> writes: - -> In any case, this discussion is predicated on the assumption that the -> operations involving the bitmap are a significant fraction of the total -> time, which I think is quite uncertain. Until we build it and profile -> it, we won't know that. - -The other thought I had was that it would be difficult to tell when to follow -this path. Since the main case where it wins is when the individual indexes -aren't very selective but the combination is very selective, and we don't have -inter-column correlation statistics ... - --- -greg - - ----------------------------(end of broadcast)--------------------------- -TIP 9: the planner will ignore your desire to choose an index scan if your - joining column's datatypes do not match - -From pgsql-hackers-owner+M49467=pgman=candle.pha.pa.us@postgresql.org Wed Jan 28 17:29:11 2004 -Return-path: <pgsql-hackers-owner+M49467=pgman=candle.pha.pa.us@postgresql.org> -Received: from svr1.postgresql.org ([200.46.204.71]) - by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0SMT9e09381 - for <pgman@candle.pha.pa.us>; Wed, 28 Jan 2004 17:29:10 -0500 (EST) -Received: from localhost (unknown [200.46.204.2]) - by svr1.postgresql.org (Postfix) with ESMTP id 7E6A1D1D0F9 - for <pgman@candle.pha.pa.us>; Wed, 28 Jan 2004 22:29:02 +0000 (GMT) -Received: from svr1.postgresql.org ([200.46.204.71]) - by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024) - with ESMTP id 30501-10 for <pgman@candle.pha.pa.us>; - Wed, 28 Jan 2004 18:28:33 -0400 (AST) -Received: from postgresql.org (svr1.postgresql.org [200.46.204.71]) - by svr1.postgresql.org (Postfix) with ESMTP id 002FED1CCDA - for <pgman@candle.pha.pa.us>; Wed, 28 Jan 2004 18:28:30 -0400 (AST) -X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org -Received: from localhost (unknown [200.46.204.2]) - by svr1.postgresql.org (Postfix) with ESMTP id BC300D1B4BD - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Wed, 28 Jan 2004 22:16:19 +0000 (GMT) -Received: from svr1.postgresql.org ([200.46.204.71]) - by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024) - with ESMTP id 29171-03 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; - Wed, 28 Jan 2004 18:15:50 -0400 (AST) -Received: from cmailm1.svr.pol.co.uk (cmailm1.svr.pol.co.uk [195.92.193.18]) - by svr1.postgresql.org (Postfix) with ESMTP id 99F4BD1C50E - for <pgsql-hackers@postgresql.org>; Wed, 28 Jan 2004 18:15:47 -0400 (AST) -Received: from modem-182.leopard.dialup.pol.co.uk ([217.135.144.182] helo=LaptopDellXP) - by cmailm1.svr.pol.co.uk with esmtp (Exim 4.14) - id 1AlxyO-0002XD-Ab; Wed, 28 Jan 2004 22:15:48 +0000 -Reply-To: <simon@2ndquadrant.com> -From: "Simon Riggs" <simon@2ndquadrant.com> -To: "'Tom Lane'" <tgl@sss.pgh.pa.us>, "'Greg Stark'" <gsstark@mit.edu> -cc: <pgsql-hackers@postgresql.org> -Subject: Re: [HACKERS] Question about indexes -Date: Wed, 28 Jan 2004 22:15:40 -0000 -Organization: 2nd Quadrant -Message-ID: <003701c3e5ec$44306250$efb887d9@LaptopDellXP> -MIME-Version: 1.0 -Content-Type: text/plain; - charset="US-ASCII" -Content-Transfer-Encoding: 7bit -X-Priority: 3 (Normal) -X-MSMail-Priority: Normal -X-Mailer: Microsoft Outlook, Build 10.0.2627 -X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2727.1300 -Importance: Normal -In-Reply-To: <18092.1075301951@sss.pgh.pa.us> -X-Virus-Scanned: by amavisd-new at postgresql.org -X-Mailing-List: pgsql-hackers -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Virus-Scanned: by amavisd-new at postgresql.org -X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on - candle.pha.pa.us -X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham - version=2.61 -Status: OR - -Some potentially helpful background comments on the discussion so far... - ->Tom Lane writes ->>Greg Stark writes ->> Note that the space saving of bitmap indexes is still a substantial ->> factor. ->I think you are still confusing what I'm talking about with a bitmap -index, >ie, a persistent structure on-disk. It's not that at all, but a -transient >structure built in-memory during an index scan. - -Oracle allows the creation of bitmap indices as persistent data -structures. - -The "space saving" of bitmap indices is only a saving when compared with -btree indices. If you don't have them at all because they are built -dynamically when required, as Tom is suggesting, then you "save" even -more space. - -Maintaining the bitmap index is a costly operation. You tend to want to -build them on "characteristic" columns, of which there tends to be more -of in a database than "partial/full identity" columns on which you build -btrees (forgive the vagueness of that comment), so you end up with loads -of the damn things, so the space soon adds up. It can be hard to judge -which ones are the important ones, especially when each is used by a -different user/group. Building them dynamically is a good way of solving -the question "which ones are needed?". Ever seen 58 indices on a table? -Don't go there. - -My vote would be implement the dynamic building capability, then return -to implement a persisted structure later if that seems like it would be -a further improvement. [The option would be nice] - -If we do it dynamically, as Tom suggests, then we don't have to code the -index maintenance logic at all and the functionality will be with us all -the sooner. Go Tom! - ->Tom Lane writes -> In any case, this discussion is predicated on the assumption that the -> operations involving the bitmap are a significant fraction of the -total -> time, which I think is quite uncertain. Until we build it and profile -> it, we won't know that. - -Dynamically building the bitmaps has been the strategy in use by -Teradata for nearly a decade on many large datawarehouses. I can -personally vouch for the effectiveness of this approach - I was -surprised when Oracle went for the persistent option. Certainly in that -case building the bitmaps adds much less time than is saved overall by -the better total query strategy. - ->Greg Stark writes -> > To build the in-memory bitmap you effectively have to do a sort. - -Not sure on this latter point: I think I agree with Greg on that point, -but want to believe Tom because requiring a sort will definitely add -time. - -To shed some light in this area, some other major implementations are: - -In Teradata, tables are stored based upon a primary index, which is -effectively an index-organised table. The index pointers are stored in -sorted order lock step with the blocks of the associated table - No sort -required. (The ordering is based upon a hashed index, but that doesn't -change the technique). - -Oracle's tables/indexes use heaps/btrees also, though they do provide an -index-organised table feature similar to Teradata. Maybe the lack of -heap/btree consistent ordering in Oracle and their subsequent design -choice of persistent bitmap indices is an indication for PostgreSQL too? - -In Oracle, bitmap indices are an important precursor to the star join -technique. AFAICS it is still possible to have a star join plan without -having persistent bitmap indices. IMHO, the longer term goal of a good -star join plan is an important one - that may influence the design -selection for this discussion. - -Hope some of that helps, - -Best regards, Simon Riggs - - ----------------------------(end of broadcast)--------------------------- -TIP 8: explain analyze is your friend - -From pgsql-hackers-owner+M49477=pgman=candle.pha.pa.us@postgresql.org Thu Jan 29 04:24:47 2004 -Return-path: <pgsql-hackers-owner+M49477=pgman=candle.pha.pa.us@postgresql.org> -Received: from joeconway.com (66-146-172-86.skyriver.net [66.146.172.86]) - by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0T9Ohe19178 - for <pgman@candle.pha.pa.us>; Thu, 29 Jan 2004 04:24:43 -0500 (EST) -Received: from postgresql.org ([200.46.204.71] verified) - by joeconway.com (CommuniGate Pro SMTP 4.1.8) - with ESMTP id 794811 for pgman@candle.pha.pa.us; Thu, 29 Jan 2004 01:21:28 -0800 -X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org -Received: from localhost (unknown [200.46.204.2]) - by svr1.postgresql.org (Postfix) with ESMTP id 639A8D1B4CE - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Thu, 29 Jan 2004 09:17:40 +0000 (GMT) -Received: from svr1.postgresql.org ([200.46.204.71]) - by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024) - with ESMTP id 24681-09 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; - Thu, 29 Jan 2004 05:17:16 -0400 (AST) -Received: from loki.hnit.is (unknown [193.4.243.180]) - by svr1.postgresql.org (Postfix) with ESMTP id 98971D1C9FD - for <pgsql-hackers@postgresql.org>; Thu, 29 Jan 2004 05:17:07 -0400 (AST) -Received: from seifur.hnit.is ([193.4.243.99]) by 193.4.243.180 with trend_isnt_name_B; Thu, 29 Jan 2004 09:17:12 -0000 -X-MimeOLE: Produced By Microsoft Exchange V6.0.6487.1 -Content-Class: urn:content-classes:message -MIME-Version: 1.0 -Content-Type: text/plain; - charset="us-ascii" -Subject: Re: [HACKERS] Question about indexes -Date: Thu, 29 Jan 2004 09:17:11 -0000 -Message-ID: <0A5B2E3C3A64CA4AB14F76DBCA76DDA44EF9B2@seifur.hnit.is> -Thread-Topic: [HACKERS] Question about indexes -Thread-Index: AcPl7J1SKohPpCtfSZq2EeeqhKLynAAW3BDw -From: <lnd@hnit.is> -To: <pgsql-hackers@postgresql.org> -X-Virus-Scanned: by amavisd-new at postgresql.org -X-Mailing-List: pgsql-hackers -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Content-Transfer-Encoding: 8bit -X-MIME-Autoconverted: from quoted-printable to 8bit by candle.pha.pa.us id i0T9Ohe19178 -X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on - candle.pha.pa.us -X-Spam-Status: No, hits=-4.7 required=5.0 tests=BAYES_00,NO_REAL_NAME - autolearn=no version=2.61 -Status: OR - - -A small comment on Oracle's implementation of persistent bitmap indexes: - -Oracle's bitmap index is concurently locked by DML, i.e. it suites for OLAP -(basically read only data warehouses) but in no way for OLTP. - -IMHO, -Laimis - -> Maybe the lack of heap/btree consistent ordering in Oracle -> and their subsequent design choice of persistent bitmap -> indices is an indication for PostgreSQL too? - - ----------------------------(end of broadcast)--------------------------- -TIP 9: the planner will ignore your desire to choose an index scan if your - joining column's datatypes do not match - -From pgsql-hackers-owner+M49497=pgman=candle.pha.pa.us@postgresql.org Fri Jan 30 01:22:15 2004 -Return-path: <pgsql-hackers-owner+M49497=pgman=candle.pha.pa.us@postgresql.org> -Received: from joeconway.com (66-146-172-86.skyriver.net [66.146.172.86]) - by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0U6MCe03385 - for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 01:22:14 -0500 (EST) -Received: from postgresql.org ([200.46.204.71] verified) - by joeconway.com (CommuniGate Pro SMTP 4.1.8) - with ESMTP id 797306 for pgman@candle.pha.pa.us; Thu, 29 Jan 2004 22:18:52 -0800 -X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org -Received: from localhost (unknown [200.46.204.2]) - by svr1.postgresql.org (Postfix) with ESMTP id 6CCBCD1C967 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Fri, 30 Jan 2004 06:16:52 +0000 (GMT) -Received: from svr1.postgresql.org ([200.46.204.71]) - by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024) - with ESMTP id 81674-05 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; - Fri, 30 Jan 2004 02:16:22 -0400 (AST) -Received: from smtp.istop.com (dci.doncaster.on.ca [66.11.168.194]) - by svr1.postgresql.org (Postfix) with ESMTP id 6DC4BD1CC98 - for <pgsql-hackers@postgresql.org>; Fri, 30 Jan 2004 02:16:21 -0400 (AST) -Received: from stark.xeocode.com (gsstark.mtl.istop.com [66.11.160.162]) - by smtp.istop.com (Postfix) with ESMTP - id 8FD5F369BB; Fri, 30 Jan 2004 01:16:21 -0500 (EST) -Received: from localhost ([127.0.0.1] helo=stark.xeocode.com) - by stark.xeocode.com with smtp (Exim 3.36 #1 (Debian)) - id 1AmRwz-0004kf-00; Fri, 30 Jan 2004 01:16:21 -0500 -To: pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Question about indexes -References: <0A5B2E3C3A64CA4AB14F76DBCA76DDA44EF9B2@seifur.hnit.is> -In-Reply-To: <0A5B2E3C3A64CA4AB14F76DBCA76DDA44EF9B2@seifur.hnit.is> -From: Greg Stark <gsstark@mit.edu> -Organization: The Emacs Conspiracy; member since 1992 -Date: 30 Jan 2004 01:16:21 -0500 -Message-ID: <87y8rqx8p6.fsf@stark.xeocode.com> -Lines: 31 -User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3 -MIME-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -X-Virus-Scanned: by amavisd-new at postgresql.org -X-Mailing-List: pgsql-hackers -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on - candle.pha.pa.us -X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham - version=2.61 -Status: OR - - -<lnd@hnit.is> writes: - -> A small comment on Oracle's implementation of persistent bitmap indexes: -> -> Oracle's bitmap index is concurently locked by DML, i.e. it suites for OLAP -> (basically read only data warehouses) but in no way for OLTP. - -I knew this. I think they figured that was ok because bitmap indexes were -mainly intended to solve data warehouse problems anyways. - -Thinking out loud here, I wonder whether this would be less of a problem for -postgres. Since tuples are never updated in place there would never be a need -to lock the entire bitmap until a transaction completes. - -There would never be as much concurrency as btrees, assuming there was any -kind of compression on the bitmap, but I don't see any reason why a long-term -lock would have to be held for updates. - -Even regular vacuum might not have to lock anything for long, just long enough -to clear the bits. and vacuum full/cluster already take table locks anyways. - -I think the problem Oracle ran into was that storing rollback ids in the -bitmap is untenable. The whole point of persistent bitmap indexes is to store -a very dense representation that represents thousands of records per page. -Allocating space to store thousands of pending transaction ids and having -thousands of old versions of the page in the rollback segment would defeat the -purpose. - --- -greg - - ----------------------------(end of broadcast)--------------------------- -TIP 7: don't forget to increase your free space map settings - -From pgsql-hackers-owner+M49502=pgman=candle.pha.pa.us@postgresql.org Fri Jan 30 06:37:25 2004 -Return-path: <pgsql-hackers-owner+M49502=pgman=candle.pha.pa.us@postgresql.org> -Received: from joeconway.com (66-146-172-86.skyriver.net [66.146.172.86]) - by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0UBbOe07302 - for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 06:37:25 -0500 (EST) -Received: from postgresql.org ([200.46.204.71] verified) - by joeconway.com (CommuniGate Pro SMTP 4.1.8) - with ESMTP id 797695 for pgman@candle.pha.pa.us; Fri, 30 Jan 2004 03:34:06 -0800 -X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org -Received: from localhost (unknown [200.46.204.2]) - by svr1.postgresql.org (Postfix) with ESMTP id 92A3CD1CCB7 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Fri, 30 Jan 2004 11:31:21 +0000 (GMT) -Received: from svr1.postgresql.org ([200.46.204.71]) - by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024) - with ESMTP id 76882-10 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; - Fri, 30 Jan 2004 07:31:24 -0400 (AST) -Received: from candle.pha.pa.us (candle.pha.pa.us [207.106.42.251]) - by svr1.postgresql.org (Postfix) with ESMTP id 59850D1CACB - for <pgsql-hackers@postgresql.org>; Fri, 30 Jan 2004 07:31:20 -0400 (AST) -Received: (from pgman@localhost) - by candle.pha.pa.us (8.11.6/8.11.6) id i0UBVHU04169; - Fri, 30 Jan 2004 06:31:17 -0500 (EST) -From: Bruce Momjian <pgman@candle.pha.pa.us> -Message-ID: <200401301131.i0UBVHU04169@candle.pha.pa.us> -Subject: Re: [HACKERS] Question about indexes -In-Reply-To: <87vfmwrtxt.fsf@stark.xeocode.com> -To: Greg Stark <gsstark@mit.edu> -Date: Fri, 30 Jan 2004 06:31:17 -0500 (EST) -cc: Tom Lane <tgl@sss.pgh.pa.us>, pgsql-hackers@postgresql.org -X-Mailer: ELM [version 2.4ME+ PL108 (25)] -MIME-Version: 1.0 -Content-Transfer-Encoding: 7bit -Content-Type: text/plain; charset=US-ASCII -X-Virus-Scanned: by amavisd-new at postgresql.org -X-Mailing-List: pgsql-hackers -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -Greg Stark wrote: -> -> Tom Lane <tgl@sss.pgh.pa.us> writes: -> -> > In any case, this discussion is predicated on the assumption that the -> > operations involving the bitmap are a significant fraction of the total -> > time, which I think is quite uncertain. Until we build it and profile -> > it, we won't know that. -> -> The other thought I had was that it would be difficult to tell when to follow -> this path. Since the main case where it wins is when the individual indexes -> aren't very selective but the combination is very selective, and we don't have -> inter-column correlation statistics ... - -I like the idea of building in-memory bitmapped indexes. - -In your example, if you are restricting on A and B, and have no A,B -index but an A index and B index, why wouldn't you always create an -in-memory bitmapped index from indexes A and B, unless index A hits only -a few rows. In fact, from the optimizer statistics, you can guess on -how many bits you will hit from index A and index B, so we only have to -decide if it is better to take the more restrictive index and do heap -lookups for those, or scan the second index and then hit the heap. The -only thing A,B combined statistics would tell you is how many heap -matches you will find. The time to scan A and B indexes and create the -bitmap is already guessable from the single column statistics. - -Also, what does an in-memory bitmapped index look like? Is it: - - value: bitmap... - value: bitmap... - -with the values organized in a btree fashion? - --- - Bruce Momjian | http://candle.pha.pa.us - pgman@candle.pha.pa.us | (610) 359-1001 - + If your life is a hard drive, | 13 Roberts Road - + Christ can be your backup. | Newtown Square, Pennsylvania 19073 - ----------------------------(end of broadcast)--------------------------- -TIP 6: Have you searched our list archives? - - http://archives.postgresql.org - -From pgsql-hackers-owner+M49505=pgman=candle.pha.pa.us@postgresql.org Fri Jan 30 09:55:27 2004 -Return-path: <pgsql-hackers-owner+M49505=pgman=candle.pha.pa.us@postgresql.org> -Received: from zippy.ims.net (IDENT:BTCTknqFfnMWdPgoZjvES928uVdg+CPr@zippy.ims.net [208.166.202.2]) - by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0UEtPe12397 - for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 09:55:26 -0500 (EST) -Received: from postgresql.org (svr1.postgresql.org [200.46.204.71]) - by zippy.ims.net (8.11.6/linuxconf) with ESMTP id i0UEsQt01250 - for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 08:54:31 -0600 -X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org -Received: from localhost (unknown [200.46.204.2]) - by svr1.postgresql.org (Postfix) with ESMTP id 3DF5DD1C9E1 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Fri, 30 Jan 2004 14:48:26 +0000 (GMT) -Received: from svr1.postgresql.org ([200.46.204.71]) - by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024) - with ESMTP id 55394-05 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; - Fri, 30 Jan 2004 10:48:29 -0400 (AST) -Received: from sss.pgh.pa.us (unknown [192.204.191.242]) - by svr1.postgresql.org (Postfix) with ESMTP id 79B71D1C992 - for <pgsql-hackers@postgresql.org>; Fri, 30 Jan 2004 10:48:25 -0400 (AST) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.12.11/8.12.11) with ESMTP id i0UEmJw9012966; - Fri, 30 Jan 2004 09:48:19 -0500 (EST) -To: Bruce Momjian <pgman@candle.pha.pa.us> -cc: Greg Stark <gsstark@mit.edu>, pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Question about indexes -In-Reply-To: <200401301131.i0UBVHU04169@candle.pha.pa.us> -References: <200401301131.i0UBVHU04169@candle.pha.pa.us> -Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us> - message dated "Fri, 30 Jan 2004 06:31:17 -0500" -Date: Fri, 30 Jan 2004 09:48:19 -0500 -Message-ID: <12965.1075474099@sss.pgh.pa.us> -From: Tom Lane <tgl@sss.pgh.pa.us> -X-Virus-Scanned: by amavisd-new at postgresql.org -X-Mailing-List: pgsql-hackers -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on - candle.pha.pa.us -X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=no - version=2.61 -Status: ORr - -Bruce Momjian <pgman@candle.pha.pa.us> writes: -> Also, what does an in-memory bitmapped index look like? - -One idea that might work: a binary search tree in which each node -represents a single page of the table, and contains a bit array with -one bit for each possible item number on the page. You could not need -more than BLCKSZ/(sizeof(HeapTupleHeaderData)+sizeof(ItemIdData)) bits -in a node, or about 36 bytes at default BLCKSZ --- for most tables you -could probably prove it would be a great deal less. You only allocate -nodes for pages that have at least one interesting row. - -I think this would represent a reasonable compromise between size and -insertion speed. It would only get large if the indexscan output -demanded visiting many different pages --- but at some point you could -abandon index usage and do a sequential scan, so I think that property -is okay. - -A variant is to make the per-page bit arrays be entries in a hash table -with page number as hash key. This would reduce insertion to a nearly -constant-time operation, but the drawback is that you'd need an explicit -sort at the end to put the per-page entries into page number order -before you scan 'em. You might come out ahead anyway, not sure. - -Or we could try a true linear bitmap (indexed by page number times -max-items-per-page plus item number) that's compressed in some fashion, -probably just by eliminating large runs of zeroes. The difficulty here -is that inserting a new one-bit could be pretty expensive, and we need -it to be cheap. - -Perhaps someone can come up with other better ideas ... - - regards, tom lane - ----------------------------(end of broadcast)--------------------------- -TIP 8: explain analyze is your friend - -From pgsql-hackers-owner+M49506=pgman=candle.pha.pa.us@postgresql.org Fri Jan 30 10:23:37 2004 -Return-path: <pgsql-hackers-owner+M49506=pgman=candle.pha.pa.us@postgresql.org> -Received: from joeconway.com (66-146-172-86.skyriver.net [66.146.172.86]) - by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0UFNZe17036 - for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 10:23:36 -0500 (EST) -Received: from postgresql.org ([200.46.204.71] verified) - by joeconway.com (CommuniGate Pro SMTP 4.1.8) - with ESMTP id 797996 for pgman@candle.pha.pa.us; Fri, 30 Jan 2004 07:20:18 -0800 -X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org -Received: from localhost (unknown [200.46.204.2]) - by svr1.postgresql.org (Postfix) with ESMTP id 8901ED1C9B3 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Fri, 30 Jan 2004 15:14:26 +0000 (GMT) -Received: from svr1.postgresql.org ([200.46.204.71]) - by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024) - with ESMTP id 67347-02 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; - Fri, 30 Jan 2004 11:14:30 -0400 (AST) -Received: from candle.pha.pa.us (candle.pha.pa.us [207.106.42.251]) - by svr1.postgresql.org (Postfix) with ESMTP id F021AD1C95E - for <pgsql-hackers@postgresql.org>; Fri, 30 Jan 2004 11:14:24 -0400 (AST) -Received: (from pgman@localhost) - by candle.pha.pa.us (8.11.6/8.11.6) id i0UFEMl15556; - Fri, 30 Jan 2004 10:14:22 -0500 (EST) -From: Bruce Momjian <pgman@candle.pha.pa.us> -Message-ID: <200401301514.i0UFEMl15556@candle.pha.pa.us> -Subject: Re: [HACKERS] Question about indexes -In-Reply-To: <12965.1075474099@sss.pgh.pa.us> -To: Tom Lane <tgl@sss.pgh.pa.us> -Date: Fri, 30 Jan 2004 10:14:22 -0500 (EST) -cc: Greg Stark <gsstark@mit.edu>, pgsql-hackers@postgresql.org -X-Mailer: ELM [version 2.4ME+ PL108 (25)] -MIME-Version: 1.0 -Content-Transfer-Encoding: 7bit -Content-Type: text/plain; charset=US-ASCII -X-Virus-Scanned: by amavisd-new at postgresql.org -X-Mailing-List: pgsql-hackers -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -Tom Lane wrote: -> Bruce Momjian <pgman@candle.pha.pa.us> writes: -> > Also, what does an in-memory bitmapped index look like? -> -> One idea that might work: a binary search tree in which each node -> represents a single page of the table, and contains a bit array with -> one bit for each possible item number on the page. You could not need -> more than BLCKSZ/(sizeof(HeapTupleHeaderData)+sizeof(ItemIdData)) bits -> in a node, or about 36 bytes at default BLCKSZ --- for most tables you -> could probably prove it would be a great deal less. You only allocate -> nodes for pages that have at least one interesting row. - -Actually, I think I made a mistake. I was wondering what on-disk -bitmapped indexes look like. - --- - Bruce Momjian | http://candle.pha.pa.us - pgman@candle.pha.pa.us | (610) 359-1001 - + If your life is a hard drive, | 13 Roberts Road - + Christ can be your backup. | Newtown Square, Pennsylvania 19073 - ----------------------------(end of broadcast)--------------------------- -TIP 9: the planner will ignore your desire to choose an index scan if your - joining column's datatypes do not match - -From pgsql-hackers-owner+M49507=pgman=candle.pha.pa.us@postgresql.org Fri Jan 30 10:31:27 2004 -Return-path: <pgsql-hackers-owner+M49507=pgman=candle.pha.pa.us@postgresql.org> -Received: from zippy.ims.net (IDENT:AWZrLd+EfFmX1x4Ch6+4AfIqn908pAfY@zippy.ims.net [208.166.202.2]) - by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0UFVOe18065 - for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 10:31:26 -0500 (EST) -Received: from postgresql.org (svr1.postgresql.org [200.46.204.71]) - by zippy.ims.net (8.11.6/linuxconf) with ESMTP id i0UFURt02719 - for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 09:30:32 -0600 -X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org -Received: from localhost (unknown [200.46.204.2]) - by svr1.postgresql.org (Postfix) with ESMTP id 9DF9ED1CCA7 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Fri, 30 Jan 2004 15:22:35 +0000 (GMT) -Received: from svr1.postgresql.org ([200.46.204.71]) - by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024) - with ESMTP id 66733-09 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; - Fri, 30 Jan 2004 11:22:39 -0400 (AST) -Received: from candle.pha.pa.us (candle.pha.pa.us [207.106.42.251]) - by svr1.postgresql.org (Postfix) with ESMTP id 235C3D1CCB2 - for <pgsql-hackers@postgresql.org>; Fri, 30 Jan 2004 11:22:33 -0400 (AST) -Received: (from pgman@localhost) - by candle.pha.pa.us (8.11.6/8.11.6) id i0UFMYr16926; - Fri, 30 Jan 2004 10:22:34 -0500 (EST) -From: Bruce Momjian <pgman@candle.pha.pa.us> -Message-ID: <200401301522.i0UFMYr16926@candle.pha.pa.us> -Subject: Re: [HACKERS] Question about indexes -In-Reply-To: <87vfmwrtxt.fsf@stark.xeocode.com> -To: Greg Stark <gsstark@mit.edu> -Date: Fri, 30 Jan 2004 10:22:34 -0500 (EST) -cc: Tom Lane <tgl@sss.pgh.pa.us>, pgsql-hackers@postgresql.org -X-Mailer: ELM [version 2.4ME+ PL108 (25)] -MIME-Version: 1.0 -Content-Transfer-Encoding: 7bit -Content-Type: text/plain; charset=US-ASCII -X-Virus-Scanned: by amavisd-new at postgresql.org -X-Mailing-List: pgsql-hackers -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -Greg Stark wrote: -> -> Tom Lane <tgl@sss.pgh.pa.us> writes: -> -> > In any case, this discussion is predicated on the assumption that the -> > operations involving the bitmap are a significant fraction of the total -> > time, which I think is quite uncertain. Until we build it and profile -> > it, we won't know that. -> -> The other thought I had was that it would be difficult to tell when to follow -> this path. Since the main case where it wins is when the individual indexes -> aren't very selective but the combination is very selective, and we don't have -> inter-column correlation statistics ... - -We actually have heap access cost and index access cost. You could -compare costs of looking at all of index A's heap vs. looking at index -B and then hopefully fewer heap rows. - --- - Bruce Momjian | http://candle.pha.pa.us - pgman@candle.pha.pa.us | (610) 359-1001 - + If your life is a hard drive, | 13 Roberts Road - + Christ can be your backup. | Newtown Square, Pennsylvania 19073 - ----------------------------(end of broadcast)--------------------------- -TIP 2: you can get off all lists at once with the unregister command - (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) - -From alvherre@CM-lcon2-51-253.cm.vtr.net Fri Jan 30 10:24:32 2004 -Return-path: <alvherre@CM-lcon2-51-253.cm.vtr.net> -Received: from CM-lcon2-51-253.cm.vtr.net (CM-lcon2-51-253.cm.vtr.net [200.83.51.253]) - by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0UFOSe17199 - for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 10:24:31 -0500 (EST) -Received: by CM-lcon2-51-253.cm.vtr.net (Postfix, from userid 500) - id 9A93157578; Fri, 30 Jan 2004 10:24:18 -0500 (EST) -Date: Fri, 30 Jan 2004 12:24:18 -0300 -From: Alvaro Herrera <alvherre@dcc.uchile.cl> -To: Tom Lane <tgl@sss.pgh.pa.us> -cc: Bruce Momjian <pgman@candle.pha.pa.us>, Greg Stark <gsstark@mit.edu>, - pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Question about indexes -Message-ID: <20040130152418.GB24123@dcc.uchile.cl> -References: <200401301131.i0UBVHU04169@candle.pha.pa.us> <12965.1075474099@sss.pgh.pa.us> -MIME-Version: 1.0 -Content-Type: text/plain; charset=iso-8859-1 -Content-Disposition: inline -Content-Transfer-Encoding: 8bit -In-Reply-To: <12965.1075474099@sss.pgh.pa.us> -User-Agent: Mutt/1.4.1i -X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on - candle.pha.pa.us -X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham - version=2.61 -Status: ORr - -On Fri, Jan 30, 2004 at 09:48:19AM -0500, Tom Lane wrote: - -> A variant is to make the per-page bit arrays be entries in a hash table -> with page number as hash key. This would reduce insertion to a nearly -> constant-time operation, but the drawback is that you'd need an explicit -> sort at the end to put the per-page entries into page number order -> before you scan 'em. You might come out ahead anyway, not sure. - -Is there a reason sort the pages before scanning them? The result won't -come out sorted one way or the other. - --- -Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) -"Para tener más hay que desear menos" - -From pgsql-hackers-owner+M49508=pgman=candle.pha.pa.us@postgresql.org Fri Jan 30 10:33:18 2004 -Return-path: <pgsql-hackers-owner+M49508=pgman=candle.pha.pa.us@postgresql.org> -Received: from zippy.ims.net (IDENT:Lj5veoF1GO3p04hu8b6BDDLvyD1wii0f@zippy.ims.net [208.166.202.2]) - by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0UFXHe18303 - for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 10:33:18 -0500 (EST) -Received: from postgresql.org (svr1.postgresql.org [200.46.204.71]) - by zippy.ims.net (8.11.6/linuxconf) with ESMTP id i0UFWIt02804 - for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 09:32:21 -0600 -X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org -Received: from localhost (unknown [200.46.204.2]) - by svr1.postgresql.org (Postfix) with ESMTP id E41F6D1CCDC - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Fri, 30 Jan 2004 15:24:25 +0000 (GMT) -Received: from svr1.postgresql.org ([200.46.204.71]) - by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024) - with ESMTP id 72118-01 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; - Fri, 30 Jan 2004 11:24:29 -0400 (AST) -Received: from CM-lcon2-51-253.cm.vtr.net (CM-lcon2-51-253.cm.vtr.net [200.83.51.253]) - by svr1.postgresql.org (Postfix) with ESMTP id 219F9D1CCDB - for <pgsql-hackers@postgresql.org>; Fri, 30 Jan 2004 11:24:25 -0400 (AST) -Received: by CM-lcon2-51-253.cm.vtr.net (Postfix, from userid 500) - id 9A93157578; Fri, 30 Jan 2004 10:24:18 -0500 (EST) -Date: Fri, 30 Jan 2004 12:24:18 -0300 -From: Alvaro Herrera <alvherre@dcc.uchile.cl> -To: Tom Lane <tgl@sss.pgh.pa.us> -cc: Bruce Momjian <pgman@candle.pha.pa.us>, Greg Stark <gsstark@mit.edu>, - pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Question about indexes -Message-ID: <20040130152418.GB24123@dcc.uchile.cl> -References: <200401301131.i0UBVHU04169@candle.pha.pa.us> <12965.1075474099@sss.pgh.pa.us> -MIME-Version: 1.0 -Content-Type: text/plain; charset=iso-8859-1 -Content-Disposition: inline -Content-Transfer-Encoding: 8bit -In-Reply-To: <12965.1075474099@sss.pgh.pa.us> -User-Agent: Mutt/1.4.1i -X-Virus-Scanned: by amavisd-new at postgresql.org -X-Mailing-List: pgsql-hackers -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on - candle.pha.pa.us -X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=no - version=2.61 -Status: OR - -On Fri, Jan 30, 2004 at 09:48:19AM -0500, Tom Lane wrote: - -> A variant is to make the per-page bit arrays be entries in a hash table -> with page number as hash key. This would reduce insertion to a nearly -> constant-time operation, but the drawback is that you'd need an explicit -> sort at the end to put the per-page entries into page number order -> before you scan 'em. You might come out ahead anyway, not sure. - -Is there a reason sort the pages before scanning them? The result won't -come out sorted one way or the other. - --- -Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) -"Para tener más hay que desear menos" - ----------------------------(end of broadcast)--------------------------- -TIP 4: Don't 'kill -9' the postmaster - -From pgsql-hackers-owner+M49509=pgman=candle.pha.pa.us@postgresql.org Fri Jan 30 10:39:11 2004 -Return-path: <pgsql-hackers-owner+M49509=pgman=candle.pha.pa.us@postgresql.org> -Received: from zippy.ims.net (IDENT:QumGpJuSSF+qB+W577trqd4FqP6fc1O+@zippy.ims.net [208.166.202.2]) - by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0UFd9e19273 - for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 10:39:10 -0500 (EST) -Received: from postgresql.org (svr1.postgresql.org [200.46.204.71]) - by zippy.ims.net (8.11.6/linuxconf) with ESMTP id i0UFcDt02990 - for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 09:38:17 -0600 -X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org -Received: from localhost (unknown [200.46.204.2]) - by svr1.postgresql.org (Postfix) with ESMTP id 606FBD1BA96 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Fri, 30 Jan 2004 15:31:24 +0000 (GMT) -Received: from svr1.postgresql.org ([200.46.204.71]) - by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024) - with ESMTP id 73148-04 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; - Fri, 30 Jan 2004 11:31:28 -0400 (AST) -Received: from candle.pha.pa.us (candle.pha.pa.us [207.106.42.251]) - by svr1.postgresql.org (Postfix) with ESMTP id D7A47D1B4BD - for <pgsql-hackers@postgresql.org>; Fri, 30 Jan 2004 11:31:22 -0400 (AST) -Received: (from pgman@localhost) - by candle.pha.pa.us (8.11.6/8.11.6) id i0UFUgQ18014; - Fri, 30 Jan 2004 10:30:42 -0500 (EST) -From: Bruce Momjian <pgman@candle.pha.pa.us> -Message-ID: <200401301530.i0UFUgQ18014@candle.pha.pa.us> -Subject: Re: [HACKERS] Question about indexes -In-Reply-To: <20040130152418.GB24123@dcc.uchile.cl> -To: Alvaro Herrera <alvherre@dcc.uchile.cl> -Date: Fri, 30 Jan 2004 10:30:42 -0500 (EST) -cc: Tom Lane <tgl@sss.pgh.pa.us>, Greg Stark <gsstark@mit.edu>, - pgsql-hackers@postgresql.org -X-Mailer: ELM [version 2.4ME+ PL108 (25)] -MIME-Version: 1.0 -Content-Transfer-Encoding: 7bit -Content-Type: text/plain; charset=US-ASCII -X-Virus-Scanned: by amavisd-new at postgresql.org -X-Mailing-List: pgsql-hackers -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -Alvaro Herrera wrote: -> On Fri, Jan 30, 2004 at 09:48:19AM -0500, Tom Lane wrote: -> -> > A variant is to make the per-page bit arrays be entries in a hash table -> > with page number as hash key. This would reduce insertion to a nearly -> > constant-time operation, but the drawback is that you'd need an explicit -> > sort at the end to put the per-page entries into page number order -> > before you scan 'em. You might come out ahead anyway, not sure. -> -> Is there a reason sort the pages before scanning them? The result won't -> come out sorted one way or the other. - -I think the goal would be to hit the heap in sequential order as much as -possible. When we are doing reading right from the index, we haven't -collected all the heap values in one place, but since we have them in -memory, we might as well sort them, though I don't think that is a -requirement, just a performance enhancement, or at least that is my -guess. - --- - Bruce Momjian | http://candle.pha.pa.us - pgman@candle.pha.pa.us | (610) 359-1001 - + If your life is a hard drive, | 13 Roberts Road - + Christ can be your backup. | Newtown Square, Pennsylvania 19073 - ----------------------------(end of broadcast)--------------------------- -TIP 8: explain analyze is your friend - -From hannu@tm.ee Fri Jan 30 17:44:13 2004 -Return-path: <hannu@tm.ee> -Received: from fuji.krosing.net (217-159-136-226-dsl.kt.estpak.ee [217.159.136.226]) - by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0UMi5e23093 - for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 17:44:12 -0500 (EST) -Received: from fuji.krosing.net (localhost.localdomain [127.0.0.1]) - by fuji.krosing.net (8.12.8/8.12.8) with ESMTP id i0UMhuEl005243; - Sat, 31 Jan 2004 00:43:57 +0200 -Received: (from hannu@localhost) - by fuji.krosing.net (8.12.8/8.12.8/Submit) id i0UMhs94005241; - Sat, 31 Jan 2004 00:43:54 +0200 -X-Authentication-Warning: fuji.krosing.net: hannu set sender to hannu@tm.ee using -f -Subject: Re: [HACKERS] Question about indexes -From: Hannu Krosing <hannu@tm.ee> -To: Tom Lane <tgl@sss.pgh.pa.us> -cc: Bruce Momjian <pgman@candle.pha.pa.us>, Greg Stark <gsstark@mit.edu>, - pgsql-hackers@postgresql.org -In-Reply-To: <12965.1075474099@sss.pgh.pa.us> -References: <200401301131.i0UBVHU04169@candle.pha.pa.us> - <12965.1075474099@sss.pgh.pa.us> -Content-Type: text/plain; charset= -Message-ID: <1075502634.4007.32.camel@fuji.krosing.net> -MIME-Version: 1.0 -X-Mailer: Ximian Evolution 1.4.5 -Date: Sat, 31 Jan 2004 00:43:54 +0200 -Content-Transfer-Encoding: 8bit -X-MIME-Autoconverted: from quoted-printable to 8bit by candle.pha.pa.us id i0UMi5e23093 -X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on - candle.pha.pa.us -X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham - version=2.61 -Status: OR - -Tom Lane kirjutas R, 30.01.2004 kell 16:48: -> Bruce Momjian <pgman@candle.pha.pa.us> writes: -> > Also, what does an in-memory bitmapped index look like? -> -> One idea that might work: a binary search tree in which each node -> represents a single page of the table, and contains a bit array with -> one bit for each possible item number on the page. You could not need -> more than BLCKSZ/(sizeof(HeapTupleHeaderData)+sizeof(ItemIdData)) bits -> in a node, or about 36 bytes at default BLCKSZ --- for most tables you -> could probably prove it would be a great deal less. You only allocate -> nodes for pages that have at least one interesting row. - -Another idea would be using bitmaps where we have just one bit per -database page and do a seq scan but just over marked pages. - -Even when allocating them in full such indexes would occupy just -1/(8k*8bit) of the amount they describe, so index for 1GB table would be -1G/(8k*8bit) = 16 kilobytes (2 pages) - -Also, such indexes, if persistent, could also be used (together with -FSM) when deciding placement of new tuples, so they provide a form of -clustering. - -This would of course be most useful for data-warehouse type operations, -where database is significantöy bigger than memory. - -And the seqscan over bitmap should not be done in simple page order, but -rather in two passes - - 1. over those pages which are already in cache (either postgresqls - or systems (if we find a way to get such info from the system)) - 2. in sequential order over the rest. - -> I think this would represent a reasonable compromise between size and -> insertion speed. It would only get large if the indexscan output -> demanded visiting many different pages --- but at some point you could -> abandon index usage and do a sequential scan, so I think that property -> is okay. - -One case where almost full intermediate bitmap could be needed is when -doing a star join or just AND of several conditions, where each single -index spans a significant part of the table, but the result does not. - -> A variant is to make the per-page bit arrays be entries in a hash table -> with page number as hash key. This would reduce insertion to a nearly -> constant-time operation, but the drawback is that you'd need an explicit -> sort at the end to put the per-page entries into page number order -> before you scan 'em. You might come out ahead anyway, not sure. -> -> Or we could try a true linear bitmap (indexed by page number times -> max-items-per-page plus item number) that's compressed in some fashion, -> probably just by eliminating large runs of zeroes. The difficulty here -> is that inserting a new one-bit could be pretty expensive, and we need -> it to be cheap. -> -> Perhaps someone can come up with other better ideas ... - -I have also contemplated a scenario, where we could use some -not-quite-max power-of-2 bits-per-page linear bitmap and mark intra-page -wraps (when we tried to mark a point past that not-quite-max number in a -page) in high bit (or another bitmap) making info for that page folded. -AN example would be setting bit 40 in 32-bits/page index - this would -set bit 40&31 and mark the page folded. - -When combining such indexes using AND or OR, we need some spcial -handling of folded pages, but could still get non-folded (0) results out -from AND of 2 folded pages if the bits are distributed nicely. - --------------- -Hannu - - - - - - - - - - - - - -From pgsql-hackers-owner+M49529=pgman=candle.pha.pa.us@postgresql.org Fri Jan 30 18:10:22 2004 -Return-path: <pgsql-hackers-owner+M49529=pgman=candle.pha.pa.us@postgresql.org> -Received: from joeconway.com (66-146-172-86.skyriver.net [66.146.172.86]) - by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0UNAKe25860 - for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 18:10:21 -0500 (EST) -Received: from postgresql.org ([200.46.204.71] verified) - by joeconway.com (CommuniGate Pro SMTP 4.1.8) - with ESMTP id 799059 for pgman@candle.pha.pa.us; Fri, 30 Jan 2004 15:07:00 -0800 -X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org -Received: from localhost (unknown [200.46.204.2]) - by svr1.postgresql.org (Postfix) with ESMTP id C2AB7D1CCDD - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Fri, 30 Jan 2004 23:03:05 +0000 (GMT) -Received: from svr1.postgresql.org ([200.46.204.71]) - by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024) - with ESMTP id 46819-09 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; - Fri, 30 Jan 2004 19:03:08 -0400 (AST) -Received: from sss.pgh.pa.us (unknown [192.204.191.242]) - by svr1.postgresql.org (Postfix) with ESMTP id AD55DD1C967 - for <pgsql-hackers@postgresql.org>; Fri, 30 Jan 2004 19:03:04 -0400 (AST) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.12.11/8.12.11) with ESMTP id i0UN2wBL020777; - Fri, 30 Jan 2004 18:02:58 -0500 (EST) -To: Hannu Krosing <hannu@tm.ee> -cc: Bruce Momjian <pgman@candle.pha.pa.us>, Greg Stark <gsstark@mit.edu>, - pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Question about indexes -In-Reply-To: <1075502634.4007.32.camel@fuji.krosing.net> -References: <200401301131.i0UBVHU04169@candle.pha.pa.us> <12965.1075474099@sss.pgh.pa.us> <1075502634.4007.32.camel@fuji.krosing.net> -Comments: In-reply-to Hannu Krosing <hannu@tm.ee> - message dated "Sat, 31 Jan 2004 00:43:54 +0200" -Date: Fri, 30 Jan 2004 18:02:58 -0500 -Message-ID: <20776.1075503778@sss.pgh.pa.us> -From: Tom Lane <tgl@sss.pgh.pa.us> -X-Virus-Scanned: by amavisd-new at postgresql.org -X-Mailing-List: pgsql-hackers -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on - candle.pha.pa.us -X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=no - version=2.61 -Status: OR - -Hannu Krosing <hannu@tm.ee> writes: -> Another idea would be using bitmaps where we have just one bit per -> database page and do a seq scan but just over marked pages. - -That seems a bit too lossy for me, but I really like your later idea -about folding. Generalizing that a little, we can choose any fold point -we like. We could allocate, say, one 32-bit word per page and set the -(i mod 32) bit when item i is fingered by the index. After retrieving -the heap page, we'd need to test all the valid rows that have item -numbers matching a set bit mod 32. On typical tables (with circa 100 -items per page) this would require testing only about 3 rows per page. -ORing and ANDing of such bitmaps still works, with the understanding -that it's lossy and you have to double check each retrieved tuple. - -If the fold point is above about 100, your idea of keeping track of -whether we actually set any wrapped-around bits would become useful, -but below that I think we'd just be wasting a bit. - - regards, tom lane - ----------------------------(end of broadcast)--------------------------- -TIP 5: Have you checked our extensive FAQ? - - http://www.postgresql.org/docs/faqs/FAQ.html - -From tgl@sss.pgh.pa.us Fri Jan 30 18:03:08 2004 -Return-path: <tgl@sss.pgh.pa.us> -Received: from sss.pgh.pa.us (root@[192.204.191.242]) - by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0UN37e24951 - for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 18:03:08 -0500 (EST) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.12.11/8.12.11) with ESMTP id i0UN2wBL020777; - Fri, 30 Jan 2004 18:02:58 -0500 (EST) -To: Hannu Krosing <hannu@tm.ee> -cc: Bruce Momjian <pgman@candle.pha.pa.us>, Greg Stark <gsstark@mit.edu>, - pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Question about indexes -In-Reply-To: <1075502634.4007.32.camel@fuji.krosing.net> -References: <200401301131.i0UBVHU04169@candle.pha.pa.us> <12965.1075474099@sss.pgh.pa.us> <1075502634.4007.32.camel@fuji.krosing.net> -Comments: In-reply-to Hannu Krosing <hannu@tm.ee> - message dated "Sat, 31 Jan 2004 00:43:54 +0200" -Date: Fri, 30 Jan 2004 18:02:58 -0500 -Message-ID: <20776.1075503778@sss.pgh.pa.us> -From: Tom Lane <tgl@sss.pgh.pa.us> -X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on - candle.pha.pa.us -X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham - version=2.61 -Status: OR - -Hannu Krosing <hannu@tm.ee> writes: -> Another idea would be using bitmaps where we have just one bit per -> database page and do a seq scan but just over marked pages. - -That seems a bit too lossy for me, but I really like your later idea -about folding. Generalizing that a little, we can choose any fold point -we like. We could allocate, say, one 32-bit word per page and set the -(i mod 32) bit when item i is fingered by the index. After retrieving -the heap page, we'd need to test all the valid rows that have item -numbers matching a set bit mod 32. On typical tables (with circa 100 -items per page) this would require testing only about 3 rows per page. -ORing and ANDing of such bitmaps still works, with the understanding -that it's lossy and you have to double check each retrieved tuple. - -If the fold point is above about 100, your idea of keeping track of -whether we actually set any wrapped-around bits would become useful, -but below that I think we'd just be wasting a bit. - - regards, tom lane - -From hannu@tm.ee Fri Jan 30 18:21:59 2004 -Return-path: <hannu@tm.ee> -Received: from fuji.krosing.net (217-159-136-226-dsl.kt.estpak.ee [217.159.136.226]) - by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0UNLue27301 - for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 18:21:57 -0500 (EST) -Received: from fuji.krosing.net (localhost.localdomain [127.0.0.1]) - by fuji.krosing.net (8.12.8/8.12.8) with ESMTP id i0UNLpEl006023; - Sat, 31 Jan 2004 01:21:51 +0200 -Received: (from hannu@localhost) - by fuji.krosing.net (8.12.8/8.12.8/Submit) id i0UNLgx1006021; - Sat, 31 Jan 2004 01:21:42 +0200 -X-Authentication-Warning: fuji.krosing.net: hannu set sender to hannu@tm.ee using -f -Subject: Re: [HACKERS] Question about indexes -From: Hannu Krosing <hannu@tm.ee> -To: Tom Lane <tgl@sss.pgh.pa.us> -cc: Bruce Momjian <pgman@candle.pha.pa.us>, Greg Stark <gsstark@mit.edu>, - pgsql-hackers@postgresql.org -In-Reply-To: <20776.1075503778@sss.pgh.pa.us> -References: <200401301131.i0UBVHU04169@candle.pha.pa.us> - <12965.1075474099@sss.pgh.pa.us> - <1075502634.4007.32.camel@fuji.krosing.net> - <20776.1075503778@sss.pgh.pa.us> -Content-Type: text/plain -Content-Transfer-Encoding: 7bit -Message-ID: <1075504902.4007.43.camel@fuji.krosing.net> -MIME-Version: 1.0 -X-Mailer: Ximian Evolution 1.4.5 -Date: Sat, 31 Jan 2004 01:21:42 +0200 -X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on - candle.pha.pa.us -X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham - version=2.61 -Status: OR - -Tom Lane kirjutas L, 31.01.2004 kell 01:02: -> Hannu Krosing <hannu@tm.ee> writes: -> > Another idea would be using bitmaps where we have just one bit per -> > database page and do a seq scan but just over marked pages. -> -> That seems a bit too lossy for me, - -I originally thought of it in context of data-warehousing and persistent -bitmap indexes. there the use of these same bitmaps for clustering would -un-lossify this approach. - -> but I really like your later idea -> about folding. Generalizing that a little, we can choose any fold point -> we like. We could allocate, say, one 32-bit word per page and set the -> (i mod 32) bit when item i is fingered by the index. After retrieving -> the heap page, we'd need to test all the valid rows that have item -> numbers matching a set bit mod 32. On typical tables (with circa 100 -> items per page) this would require testing only about 3 rows per page. -> ORing and ANDing of such bitmaps still works, with the understanding -> that it's lossy and you have to double check each retrieved tuple. -> -> If the fold point is above about 100, your idea of keeping track of -> whether we actually set any wrapped-around bits would become useful, -> but below that I think we'd just be wasting a bit. - -Not only wasting bits, but also making the code hairier - we can't just -do simple ANDs and ORs. - --------------- -Hannu - -From gsstark@mit.edu Fri Jan 30 19:04:21 2004 -Return-path: <gsstark@mit.edu> -Received: from smtp.istop.com (dci.doncaster.on.ca [66.11.168.194]) - by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0V04De01505 - for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 19:04:21 -0500 (EST) -Received: from stark.xeocode.com (gsstark.mtl.istop.com [66.11.160.162]) - by smtp.istop.com (Postfix) with ESMTP - id 7CC2436E2F; Fri, 30 Jan 2004 19:04:04 -0500 (EST) -Received: from localhost ([127.0.0.1] helo=stark.xeocode.com) - by stark.xeocode.com with smtp (Exim 3.36 #1 (Debian)) - id 1AmicG-0007zf-00; Fri, 30 Jan 2004 19:04:04 -0500 -Sender: gsstark@mit.edu -To: Tom Lane <tgl@sss.pgh.pa.us> -cc: Hannu Krosing <hannu@tm.ee>, Bruce Momjian <pgman@candle.pha.pa.us>, - Greg Stark <gsstark@mit.edu>, pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Question about indexes -References: <200401301131.i0UBVHU04169@candle.pha.pa.us> - <12965.1075474099@sss.pgh.pa.us> - <1075502634.4007.32.camel@fuji.krosing.net> - <20776.1075503778@sss.pgh.pa.us> -In-Reply-To: <20776.1075503778@sss.pgh.pa.us> -From: Greg Stark <gsstark@mit.edu> -Organization: The Emacs Conspiracy; member since 1992 -Date: 30 Jan 2004 19:04:03 -0500 -Message-ID: <87wu79vv9o.fsf@stark.xeocode.com> -Lines: 21 -User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3 -MIME-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on - candle.pha.pa.us -X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham - version=2.61 -Status: OR - - -Tom Lane <tgl@sss.pgh.pa.us> writes: - -> That seems a bit too lossy for me, but I really like your later idea -> about folding. Generalizing that a little, we can choose any fold point -> we like. We could allocate, say, one 32-bit word per page and set the -> (i mod 32) bit when item i is fingered by the index. After retrieving -> the heap page, we'd need to test all the valid rows that have item -> numbers matching a set bit mod 32. On typical tables (with circa 100 -> items per page) this would require testing only about 3 rows per page. -> ORing and ANDing of such bitmaps still works, with the understanding -> that it's lossy and you have to double check each retrieved tuple. - -That would make it really hard to ever clear the bits. What do you do when you -vacuum and one of the tuples is no longer needed. You can't be sure you can -clear the bit in the index because there could be multiple tuples represented -by the bit being set. You would have to test the condition on the other tuples -covered by the bit to see if it can be cleared. - --- -greg - -From pgsql-hackers-owner+M49533=pgman=candle.pha.pa.us@postgresql.org Fri Jan 30 19:56:45 2004 -Return-path: <pgsql-hackers-owner+M49533=pgman=candle.pha.pa.us@postgresql.org> -Received: from joeconway.com (66-146-172-86.skyriver.net [66.146.172.86]) - by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0V0uhe05716 - for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 19:56:44 -0500 (EST) -Received: from postgresql.org ([200.46.204.71] verified) - by joeconway.com (CommuniGate Pro SMTP 4.1.8) - with ESMTP id 799253 for pgman@candle.pha.pa.us; Fri, 30 Jan 2004 16:53:23 -0800 -X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org -Received: from localhost (unknown [200.46.204.2]) - by svr1.postgresql.org (Postfix) with ESMTP id B7F53D1CC9B - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Sat, 31 Jan 2004 00:50:25 +0000 (GMT) -Received: from svr1.postgresql.org ([200.46.204.71]) - by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024) - with ESMTP id 76472-01 - for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; - Fri, 30 Jan 2004 20:50:28 -0400 (AST) -Received: from sss.pgh.pa.us (unknown [192.204.191.242]) - by svr1.postgresql.org (Postfix) with ESMTP id 0A06FD1CB1D - for <pgsql-hackers@postgresql.org>; Fri, 30 Jan 2004 20:50:25 -0400 (AST) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.12.11/8.12.11) with ESMTP id i0V0oN9U023293; - Fri, 30 Jan 2004 19:50:24 -0500 (EST) -To: Greg Stark <gsstark@mit.edu> -cc: Hannu Krosing <hannu@tm.ee>, Bruce Momjian <pgman@candle.pha.pa.us>, - pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Question about indexes -In-Reply-To: <87wu79vv9o.fsf@stark.xeocode.com> -References: <200401301131.i0UBVHU04169@candle.pha.pa.us> <12965.1075474099@sss.pgh.pa.us> <1075502634.4007.32.camel@fuji.krosing.net> <20776.1075503778@sss.pgh.pa.us> <87wu79vv9o.fsf@stark.xeocode.com> -Comments: In-reply-to Greg Stark <gsstark@mit.edu> - message dated "30 Jan 2004 19:04:03 -0500" -Date: Fri, 30 Jan 2004 19:50:23 -0500 -Message-ID: <23292.1075510223@sss.pgh.pa.us> -From: Tom Lane <tgl@sss.pgh.pa.us> -X-Virus-Scanned: by amavisd-new at postgresql.org -X-Mailing-List: pgsql-hackers -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on - candle.pha.pa.us -X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=no - version=2.61 -Status: OR - -Greg Stark <gsstark@mit.edu> writes: -> Tom Lane <tgl@sss.pgh.pa.us> writes: ->> ORing and ANDing of such bitmaps still works, with the understanding ->> that it's lossy and you have to double check each retrieved tuple. - -> That would make it really hard to ever clear the bits. - -We're speaking of in-memory bitmaps constructed on-the-fly here. You're -right that it wouldn't work for persistent indexes, but I'm not very -interested in that case at the moment ... - - regards, tom lane - ----------------------------(end of broadcast)--------------------------- -TIP 8: explain analyze is your friend - -- GitLab