diff --git a/doc/src/sgml/arch-dev.sgml b/doc/src/sgml/arch-dev.sgml index c861a656e904fcbf5dcbf53a4929613e85598a46..7ee1ba357f09ffc47930393de5b2fe76c27dc4c2 100644 --- a/doc/src/sgml/arch-dev.sgml +++ b/doc/src/sgml/arch-dev.sgml @@ -1,4 +1,4 @@ -<!-- $PostgreSQL: pgsql/doc/src/sgml/arch-dev.sgml,v 2.29 2007/01/31 20:56:16 momjian Exp $ --> +<!-- $PostgreSQL: pgsql/doc/src/sgml/arch-dev.sgml,v 2.30 2007/07/21 04:02:41 tgl Exp $ --> <chapter id="overview"> <title>Overview of PostgreSQL Internals</title> @@ -345,9 +345,10 @@ can be executed would take an excessive amount of time and memory space. In particular, this occurs when executing queries involving large numbers of join operations. In order to determine - a reasonable (not optimal) query plan in a reasonable amount of - time, <productname>PostgreSQL</productname> uses a <xref - linkend="geqo" endterm="geqo-title">. + a reasonable (not necessarily optimal) query plan in a reasonable amount + of time, <productname>PostgreSQL</productname> uses a <xref + linkend="geqo" endterm="geqo-title"> when the number of joins + exceeds a threshold (see <xref linkend="guc-geqo-threshold">). </para> </note> @@ -380,20 +381,17 @@ the index's <firstterm>operator class</>, another plan is created using the B-tree index to scan the relation. If there are further indexes present and the restrictions in the query happen to match a key of an - index further plans will be considered. + index, further plans will be considered. Index scan plans are also + generated for indexes that have a sort ordering that can match the + query's <literal>ORDER BY</> clause (if any), or a sort ordering that + might be useful for merge joining (see below). </para> <para> - After all feasible plans have been found for scanning single relations, - plans for joining relations are created. The planner/optimizer - preferentially considers joins between any two relations for which there - exist a corresponding join clause in the <literal>WHERE</literal> qualification (i.e. for - which a restriction like <literal>where rel1.attr1=rel2.attr2</literal> - exists). Join pairs with no join clause are considered only when there - is no other choice, that is, a particular relation has no available - join clauses to any other relation. All possible plans are generated for - every join pair considered - by the planner/optimizer. The three possible join strategies are: + If the query requires joining two or more relations, + plans for joining relations are considered + after all feasible plans have been found for scanning single relations. + The three available join strategies are: <itemizedlist> <listitem> @@ -439,6 +437,26 @@ cheapest one. </para> + <para> + If the query uses fewer than <xref linkend="guc-geqo-threshold"> + relations, a near-exhaustive search is conducted to find the best + join sequence. The planner preferentially considers joins between any + two relations for which there exist a corresponding join clause in the + <literal>WHERE</literal> qualification (i.e. for + which a restriction like <literal>where rel1.attr1=rel2.attr2</literal> + exists). Join pairs with no join clause are considered only when there + is no other choice, that is, a particular relation has no available + join clauses to any other relation. All possible plans are generated for + every join pair considered by the planner, and the one that is + (estimated to be) the cheapest is chosen. + </para> + + <para> + When <varname>geqo_threshold</varname> is exceeded, the join + sequences considered are determined by heuristics, as described + in <xref linkend="geqo">. Otherwise the process is the same. + </para> + <para> The finished plan tree consists of sequential or index scans of the base relations, plus nested-loop, merge, or hash join nodes as diff --git a/doc/src/sgml/geqo.sgml b/doc/src/sgml/geqo.sgml index 6225dc4c3219ac87cbc8443bff86b44e11f42336..2f680762c13bb45c3b85bbba5c43011de112eb4b 100644 --- a/doc/src/sgml/geqo.sgml +++ b/doc/src/sgml/geqo.sgml @@ -1,4 +1,4 @@ -<!-- $PostgreSQL: pgsql/doc/src/sgml/geqo.sgml,v 1.39 2007/02/16 03:50:29 momjian Exp $ --> +<!-- $PostgreSQL: pgsql/doc/src/sgml/geqo.sgml,v 1.40 2007/07/21 04:02:41 tgl Exp $ --> <chapter id="geqo"> <chapterinfo> @@ -186,11 +186,6 @@ <productname>PostgreSQL</productname> optimizer. </para> - <para> - Parts of the <acronym>GEQO</acronym> module are adapted from D. Whitley's Genitor - algorithm. - </para> - <para> Specific characteristics of the <acronym>GEQO</acronym> implementation in <productname>PostgreSQL</productname> @@ -224,6 +219,11 @@ </itemizedlist> </para> + <para> + Parts of the <acronym>GEQO</acronym> module are adapted from D. Whitley's + Genitor algorithm. + </para> + <para> The <acronym>GEQO</acronym> module allows the <productname>PostgreSQL</productname> query optimizer to @@ -231,6 +231,42 @@ non-exhaustive search. </para> + <sect2> + <title>Generating Possible Plans with <acronym>GEQO</acronym></title> + + <para> + The <acronym>GEQO</acronym> planning process uses the standard planner + code to generate plans for scans of individual relations. Then join + plans are developed using the genetic approach. As shown above, each + candidate join plan is represented by a sequence in which to join + the base relations. In the initial stage, the <acronym>GEQO</acronym> + code simply generates some possible join sequences at random. For each + join sequence considered, the standard planner code is invoked to + estimate the cost of performing the query using that join sequence. + (For each step of the join sequence, all three possible join strategies + are considered; and all the initially-determined relation scan plans + are available. The estimated cost is the cheapest of these + possibilities.) Join sequences with lower estimated cost are considered + <quote>more fit</> than those with higher cost. The genetic algorithm + discards the least fit candidates. Then new candidates are generated + by combining genes of more-fit candidates — that is, by using + randomly-chosen portions of known low-cost join sequences to create + new sequences for consideration. This process is repeated until a + preset number of join sequences have been considered; then the best + one found at any time during the search is used to generate the finished + plan. + </para> + + <para> + This process is inherently nondeterministic, because of the randomized + choices made during both the initial population selection and subsequent + <quote>mutation</> of the best candidates. Hence different plans may + be selected from one run to the next, resulting in varying run time + and varying output row order. + </para> + + </sect2> + <sect2 id="geqo-future"> <title>Future Implementation Tasks for <productname>PostgreSQL</> <acronym>GEQO</acronym></title> @@ -257,6 +293,16 @@ </itemizedlist> </para> + <para> + In the current implementation, the fitness of each candidate join + sequence is estimated by running the standard planner's join selection + and cost estimation code from scratch. To the extent that different + candidates use similar sub-sequences of joins, a great deal of work + will be repeated. This could be made significantly faster by retaining + cost estimates for sub-joins. The problem is to avoid expending + unreasonable amounts of memory on retaining that state. + </para> + <para> At a more basic level, it is not clear that solving query optimization with a GA algorithm designed for TSP is appropriate. In the TSP case,