diff --git a/doc/src/sgml/gin.sgml b/doc/src/sgml/gin.sgml index 24517402bccec22a7c9f5b98d722812239320141..c2bbad42bfaa2ba7960e91643503981b5fbce0c9 100644 --- a/doc/src/sgml/gin.sgml +++ b/doc/src/sgml/gin.sgml @@ -1,4 +1,4 @@ -<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.6 2006/11/30 20:50:44 petere Exp $ --> +<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.7 2006/12/01 23:46:46 tgl Exp $ --> <chapter id="GIN"> <title>GIN Indexes</title> @@ -14,8 +14,9 @@ <para> <acronym>GIN</acronym> stands for Generalized Inverted Index. It is an index structure storing a set of (key, posting list) pairs, where - 'posting list' is a set of rows in which the key occurs. Each - row may contain many keys. + a <quote>posting list</> is a set of rows in which the key occurs. Each + indexed value may contain many keys, so the same row ID may appear in + multiple posting lists. </para> <para> @@ -45,7 +46,7 @@ <para> The <acronym>GIN</acronym> interface has a high level of abstraction, - requiring the access method implementer to only implement the semantics of + requiring the access method implementer only to implement the semantics of the data type being accessed. The <acronym>GIN</acronym> layer itself takes care of concurrency, logging and searching the tree structure. </para> @@ -53,26 +54,14 @@ <para> All it takes to get a <acronym>GIN</acronym> access method working is to implement four user-defined methods, which define the behavior of - keys in the tree. In short, <acronym>GIN</acronym> combines extensibility - along with generality, code reuse, and a clean interface. - </para> - -</sect1> - -<sect1 id="gin-implementation"> - <title>Implementation</title> - - <para> - Internally, <acronym>GIN</acronym> consists of a B-tree index constructed - over keys, where each key is an element of the indexed value - (element of array, for example) and where each tuple in a leaf page is - either a pointer to a B-tree over heap pointers (PT, posting tree), or a - list of heap pointers (PL, posting list) if the tuple is small enough. + keys in the tree and the relationships between keys, indexed values, + and indexable queries. In short, <acronym>GIN</acronym> combines + extensibility with generality, code reuse, and a clean interface. </para> <para> - There are four methods that an index operator class for - <acronym>GIN</acronym> must provide (prototypes are in pseudocode): + The four methods that an index operator class for + <acronym>GIN</acronym> must provide are: </para> <variablelist> @@ -80,9 +69,9 @@ <term>int compare(Datum a, Datum b)</term> <listitem> <para> - Compares keys (not indexed values!) and returns an integer less than - zero, zero, or greater than zero, indicating whether the first key is - less than, equal to, or greater than the second. + Compares keys (not indexed values!) and returns an integer less than + zero, zero, or greater than zero, indicating whether the first key is + less than, equal to, or greater than the second. </para> </listitem> </varlistentry> @@ -91,21 +80,26 @@ <term>Datum* extractValue(Datum inputValue, uint32 *nkeys)</term> <listitem> <para> - Returns an array of keys of value to be indexed, nkeys should - contain the number of returned keys. + Returns an array of keys given a value to be indexed. The + number of returned keys must be stored into <literal>*nkeys</>. </para> </listitem> </varlistentry> <varlistentry> - <term>Datum* extractQuery(Datum query, uint32 nkeys, - StrategyNumber n)</term> + <term>Datum* extractQuery(Datum query, uint32 *nkeys, + StrategyNumber n)</term> <listitem> <para> - Returns an array of keys of the query to be executed. n contains the - strategy number of the operation (see <xref - linkend="xindex-strategies">). Depending on n, query may be - different type. + Returns an array of keys given a value to be queried; that is, + <literal>query</> is the value on the right-hand side of an + indexable operator whose left-hand side is the indexed column. + <literal>n</> is the strategy number of the operator within the + operator class (see <xref linkend="xindex-strategies">). + Often, <function>extractQuery</> will need + to consult <literal>n</> to determine the data type of + <literal>query</> and the key values that need to be extracted. + The number of returned keys must be stored into <literal>*nkeys</>. </para> </listitem> </varlistentry> @@ -114,11 +108,16 @@ <term>bool consistent(bool check[], StrategyNumber n, Datum query)</term> <listitem> <para> - Returns TRUE if the indexed value satisfies the query qualifier with - strategy n (or may satisfy in case of RECHECK mark in operator class). - Each element of the check array is TRUE if the indexed value has a - corresponding key in the query: if (check[i] == TRUE) the i-th key of - the query is present in the indexed value. + Returns TRUE if the indexed value satisfies the query operator with + strategy number <literal>n</> (or may satisfy, if the operator is + marked RECHECK in the operator class). The <literal>check</> array has + the same length as the number of keys previously returned by + <function>extractQuery</> for this query. Each element of the + <literal>check</> array is TRUE if the indexed value contains the + corresponding query key, ie, if (check[i] == TRUE) the i-th key of the + <function>extractQuery</> result array is present in the indexed value. + The original <literal>query</> datum (not the extracted key array!) is + passed in case the <function>consistent</> method needs to consult it. </para> </listitem> </varlistentry> @@ -127,6 +126,19 @@ </sect1> +<sect1 id="gin-implementation"> + <title>Implementation</title> + + <para> + Internally, a <acronym>GIN</acronym> index contains a B-tree index + constructed over keys, where each key is an element of the indexed value + (a member of an array, for example) and where each tuple in a leaf page is + either a pointer to a B-tree over heap pointers (PT, posting tree), or a + list of heap pointers (PL, posting list) if the list is small enough. + </para> + +</sect1> + <sect1 id="gin-tips"> <title>GIN tips and tricks</title> @@ -134,44 +146,43 @@ <varlistentry> <term>Create vs insert</term> <listitem> - <para> - In most cases, insertion into a <acronym>GIN</acronym> index is slow - due to the likelihood of many keys being inserted for each value. - So, for bulk insertions into a table it is advisable to to drop the GIN - index and recreate it after finishing bulk insertion. - </para> + <para> + In most cases, insertion into a <acronym>GIN</acronym> index is slow + due to the likelihood of many keys being inserted for each value. + So, for bulk insertions into a table it is advisable to drop the GIN + index and recreate it after finishing bulk insertion. + </para> </listitem> </varlistentry> <varlistentry> - <term>gin_fuzzy_search_limit</term> + <term><xref linkend="guc-gin-fuzzy-search-limit"></term> <listitem> - <para> - The primary goal of developing <acronym>GIN</acronym> indices was - support for highly scalable, full-text search in - <productname>PostgreSQL</productname> and there are often situations when - a full-text search returns a very large set of results. Since reading - tuples from the disk and sorting them could take a lot of time, this is - unacceptable for production. (Note that the index search itself is very - fast.) + <para> + The primary goal of developing <acronym>GIN</acronym> indexes was + to create support for highly scalable, full-text search in + <productname>PostgreSQL</productname>, and there are often situations when + a full-text search returns a very large set of results. Moreover, this + often happens when the query contains very frequent words, so that the + large result set is not even useful. Since reading many + tuples from the disk and sorting them could take a lot of time, this is + unacceptable for production. (Note that the index search itself is very + fast.) + </para> + <para> + To facilitate controlled execution of such queries + <acronym>GIN</acronym> has a configurable soft upper limit on the size + of the returned set, the + <varname>gin_fuzzy_search_limit</varname> configuration parameter. + It is set to 0 (meaning no limit) by default. + If a non-zero limit is set, then the returned set is a subset of + the whole result set, chosen at random. + </para> + <para> + <quote>Soft</quote> means that the actual number of returned results + could differ slightly from the specified limit, depending on the query + and the quality of the system's random number generator. </para> - <para> - Such queries usually contain very frequent words, so the results are not - very helpful. To facilitate execution of such queries - <acronym>GIN</acronym> has a configurable soft upper limit of the size - of the returned set, determined by the - <varname>gin_fuzzy_search_limit</varname> GUC variable. It is set to 0 by - default (no limit). - </para> - <para> - If a non-zero search limit is set, then the returned set is a subset of - the whole result set, chosen at random. - </para> - <para> - <quote>Soft</quote> means that the actual number of returned results - could slightly differ from the specified limit, depending on the query - and the quality of the system's random number generator. - </para> </listitem> </varlistentry> </variablelist> @@ -182,21 +193,30 @@ <title>Limitations</title> <para> - <acronym>GIN</acronym> doesn't support full index scans due to their - extreme inefficiency: because there are often many keys per value, - each heap pointer will be returned several times. + <acronym>GIN</acronym> doesn't support full index scans: because there are + often many keys per value, each heap pointer would be returned many times, + and there is no easy way to prevent this. </para> <para> When <function>extractQuery</function> returns zero keys, - <acronym>GIN</acronym> will emit an error: for different opclasses and - strategies the semantic meaning of a void query may be different (for - example, any array contains the void array, but they don't overlap the - void array), and <acronym>GIN</acronym> can't suggest a reasonable answer. + <acronym>GIN</acronym> will emit an error. Depending on the operator, + a void query might match all, some, or none of the indexed values (for + example, every array contains the empty array, but does not overlap the + empty array), and <acronym>GIN</acronym> can't determine the correct + answer, nor produce a full-index-scan result if it could determine that + that was correct. </para> <para> - <acronym>GIN</acronym> searches keys only by equality matching. This may + It is not an error for <function>extractValue</> to return zero keys, + but in this case the indexed value will be unrepresented in the index. + This is another reason why full index scan is not useful — it would + miss such rows. + </para> + + <para> + <acronym>GIN</acronym> searches keys only by equality matching. This may be improved in future. </para> </sect1> @@ -206,12 +226,12 @@ <para> The <productname>PostgreSQL</productname> source distribution includes - <acronym>GIN</acronym> classes for one-dimensional arrays of all internal + <acronym>GIN</acronym> classes for one-dimensional arrays of all internal types. The following <filename>contrib</> modules also contain <acronym>GIN</acronym> - operator classes: + operator classes: </para> - + <variablelist> <varlistentry> <term>intarray</term> diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml index fd8bb7251edb3015906ec8c912ddaa998d31c2ac..1edceebd2d5f7619b7890aa13493d2491a747b94 100644 --- a/doc/src/sgml/indices.sgml +++ b/doc/src/sgml/indices.sgml @@ -1,4 +1,4 @@ -<!-- $PostgreSQL: pgsql/doc/src/sgml/indices.sgml,v 1.65 2006/10/23 18:10:31 petere Exp $ --> +<!-- $PostgreSQL: pgsql/doc/src/sgml/indices.sgml,v 1.66 2006/12/01 23:46:46 tgl Exp $ --> <chapter id="indexes"> <title id="indexes-title">Indexes</title> @@ -116,7 +116,7 @@ CREATE INDEX test1_id_index ON test1 (id); <para> <productname>PostgreSQL</productname> provides several index types: - B-tree, Hash, GIN and GiST. Each index type uses a different + B-tree, Hash, GiST and GIN. Each index type uses a different algorithm that is best suited to different types of queries. By default, the <command>CREATE INDEX</command> command will create a B-tree index, which fits the most common situations. @@ -247,8 +247,8 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> <primary>GIN</primary> <see>index</see> </indexterm> - GIN is a inverted index and it's usable for values which have more - than one key, arrays for example. Like GiST, GIN may support + GIN indexes are inverted indexes which can handle values that contain more + than one key, arrays for example. Like GiST, GIN can support many different user-defined indexing strategies and the particular operators with which a GIN index can be used vary depending on the indexing strategy. @@ -267,7 +267,8 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> (See <xref linkend="functions-array"> for the meaning of these operators.) Other GIN operator classes are available in the <literal>contrib</> - <literal>tsearch2</literal> and <literal>intarray</literal> modules. For more information see <xref linkend="GIN">. + <literal>tsearch2</literal> and <literal>intarray</literal> modules. + For more information see <xref linkend="GIN">. </para> </sect1> diff --git a/doc/src/sgml/xindex.sgml b/doc/src/sgml/xindex.sgml index 444839399e1ebb122a38d8dae7de272c8b5abdb0..a66dd3c4ee402020820cbf5d415c22b7d0a0af09 100644 --- a/doc/src/sgml/xindex.sgml +++ b/doc/src/sgml/xindex.sgml @@ -1,4 +1,4 @@ -<!-- $PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.52 2006/10/23 18:10:32 petere Exp $ --> +<!-- $PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.53 2006/12/01 23:46:46 tgl Exp $ --> <sect1 id="xindex"> <title>Interfacing Extensions To Indexes</title> @@ -243,15 +243,16 @@ </table> <para> - GIN indexes are similar to GiST's in flexibility: they don't have a fixed - et of strategies. Instead, the <quote>consistency</> support routine - interprets the strategy numbers accordingly with operator class - definition. As an example, strategies of operator class over arrays - is shown in <xref linkend="xindex-gin-array-strat-table">. + GIN indexes are similar to GiST indexes in flexibility: they don't have a + fixed set of strategies. Instead the support routines of each operator + class interpret the strategy numbers according to the operator class's + definition. As an example, the strategy numbers used by the built-in + operator classes for arrays are + shown in <xref linkend="xindex-gin-array-strat-table">. </para> <table tocentry="1" id="xindex-gin-array-strat-table"> - <title>GIN Array's Strategies</title> + <title>GIN Array Strategies</title> <tgroup cols="2"> <thead> <row> @@ -388,36 +389,35 @@ <tbody> <row> <entry>consistent - determine whether key satisfies the - query qualifier</entry> + query qualifier</entry> <entry>1</entry> </row> <row> - <entry>union - compute union of of a set of given keys</entry> + <entry>union - compute union of a set of keys</entry> <entry>2</entry> </row> <row> - <entry>compress - computes a compressed representation of a key or value - to be indexed</entry> + <entry>compress - compute a compressed representation of a key or value + to be indexed</entry> <entry>3</entry> </row> <row> - <entry>decompress - computes a decompressed representation of a - compressed key </entry> + <entry>decompress - compute a decompressed representation of a + compressed key</entry> <entry>4</entry> </row> <row> <entry>penalty - compute penalty for inserting new key into subtree - with given subtree's key</entry> + with given subtree's key</entry> <entry>5</entry> </row> <row> <entry>picksplit - determine which entries of a page are to be moved - to the new page and compute the union keys for resulting pages </entry> + to the new page and compute the union keys for resulting pages</entry> <entry>6</entry> </row> <row> - <entry>equal - compare two keys and returns true if they are equal - </entry> + <entry>equal - compare two keys and return true if they are equal</entry> <entry>7</entry> </row> </tbody> @@ -441,23 +441,22 @@ <tbody> <row> <entry> - compare - Compare two keys and return an integer less than zero, zero, or - greater than zero, indicating whether the first key is less than, equal to, - or greater than the second. - </entry> + compare - compare two keys and return an integer less than zero, zero, + or greater than zero, indicating whether the first key is less than, + equal to, or greater than the second + </entry> <entry>1</entry> </row> <row> - <entry>extractValue - extract keys from value to be indexed</entry> + <entry>extractValue - extract keys from a value to be indexed</entry> <entry>2</entry> </row> <row> - <entry>extractQuery - extract keys from query</entry> + <entry>extractQuery - extract keys from a query condition</entry> <entry>3</entry> </row> <row> - <entry>consistent - determine whether value matches by the - query</entry> + <entry>consistent - determine whether value matches query condition</entry> <entry>4</entry> </row> </tbody> @@ -822,12 +821,16 @@ CREATE OPERATOR CLASS polygon_ops STORAGE box; </programlisting> - At present, only the GiST and GIN index method supports a + At present, only the GiST and GIN index methods support a <literal>STORAGE</> type that's different from the column data type. - The GiST <literal>compress</> and <literal>decompress</> support + The GiST <function>compress</> and <function>decompress</> support routines must deal with data-type conversion when <literal>STORAGE</> - is used. Functions named <literal>extractValue</> and <literal>extractQuery</> - do conversation into internally used types for GIN. + is used. In GIN, the <literal>STORAGE</> type identifies the type of + the <quote>key</> values, which normally is different from the type + of the indexed column — for example, an operator class for + integer array columns might have keys that are just integers. The + GIN <function>extractValue</> and <function>extractQuery</> support + routines are responsible for extracting keys from indexed values. </para> </sect2>