From 7b9fe96812982b53b01e54e63d773c5454f5f199 Mon Sep 17 00:00:00 2001 From: Tom Lane <tgl@sss.pgh.pa.us> Date: Sun, 17 Dec 2000 05:55:26 +0000 Subject: [PATCH] Update type-coercion discussions to reflect current reality. --- doc/src/sgml/func.sgml | 9 +- doc/src/sgml/typeconv.sgml | 338 ++++++++++++++++++++++--------------- 2 files changed, 204 insertions(+), 143 deletions(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index fdd1c34a3e1..0d59899f801 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -1,4 +1,4 @@ -<!-- $Header: /cvsroot/pgsql/doc/src/sgml/func.sgml,v 1.43 2000/12/16 19:33:23 tgl Exp $ --> +<!-- $Header: /cvsroot/pgsql/doc/src/sgml/func.sgml,v 1.44 2000/12/17 05:55:26 tgl Exp $ --> <chapter id="functions"> <title>Functions and Operators</title> @@ -770,7 +770,7 @@ <para> There are two separate approaches to pattern matching provided by - <productname>Postgres</productname>: The <acronym>SQL</acronym> + <productname>Postgres</productname>: the <acronym>SQL</acronym> <function>LIKE</function> operator and <acronym>POSIX</acronym>-style regular expressions. </para> @@ -2562,8 +2562,9 @@ END </informalexample> <para> - The data types of all possible <replaceable>result</replaceable> - expressions must match. + The data types of all the <replaceable>result</replaceable> + expressions must be coercible to a single output type. + See <xref linkend="typeconv-union-case"> for more detail. </para> <synopsis> diff --git a/doc/src/sgml/typeconv.sgml b/doc/src/sgml/typeconv.sgml index 5169999f980..cac31d8dfba 100644 --- a/doc/src/sgml/typeconv.sgml +++ b/doc/src/sgml/typeconv.sgml @@ -12,16 +12,17 @@ evaluating mixed-type expressions. In many cases a user will not need to understand the details of the type conversion mechanism. However, the implicit conversions done by <productname>Postgres</productname> -can affect the apparent results of a query, and these results +can affect the results of a query. When necessary, these results can be tailored by a user or programmer using <emphasis>explicit</emphasis> type coercion. </para> <para> This chapter introduces the <productname>Postgres</productname> - type conversion mechanisms and conventions. +type conversion mechanisms and conventions. Refer to the relevant sections in the User's Guide and Programmer's Guide -for more information on specific data types and allowed functions and operators. +for more information on specific data types and allowed functions and +operators. </para> <para> @@ -43,12 +44,13 @@ mixed-type expressions to be meaningful, even with user-defined types. </para> <para> -The <productname>Postgres</productname> scanner/parser decodes lexical elements -into only five fundamental categories: integers, floats, strings, names, and keywords. -Most extended types are first tokenized into strings. The <acronym>SQL</acronym> -language definition allows specifying type names with strings, and this mechanism -is used by <productname>Postgres</productname> -to start the parser down the correct path. For example, the query +The <productname>Postgres</productname> scanner/parser decodes lexical +elements into only five fundamental categories: integers, floats, strings, +names, and keywords. Most extended types are first tokenized into +strings. The <acronym>SQL</acronym> language definition allows specifying type +names with strings, and this mechanism can be used in +<productname>Postgres</productname> to start the parser down the correct +path. For example, the query <programlisting> tgl=> SELECT text 'Origin' AS "Label", point '(0,0)' AS "Value"; @@ -59,8 +61,9 @@ tgl=> SELECT text 'Origin' AS "Label", point '(0,0)' AS "Value"; </programlisting> has two strings, of type <type>text</type> and <type>point</type>. -If a type is not specified, then the placeholder type <type>unknown</type> -is assigned initially, to be resolved in later stages as described below. +If a type is not specified for a string, then the placeholder type +<firstterm>unknown</firstterm> is assigned initially, to be resolved in later +stages as described below. </para> <para> @@ -88,9 +91,13 @@ Function calls </term> <listitem> <para> -Much of the <productname>Postgres</productname> type system is built around a rich set of -functions. Function calls have one or more arguments which, for any specific query, -must be matched to the functions available in the system catalog. +Much of the <productname>Postgres</productname> type system is built around a +rich set of functions. Function calls have one or more arguments which, for +any specific query, must be matched to the functions available in the system +catalog. Since <productname>Postgres</productname> permits function +overloading, the function name alone does not uniquely identify the function +to be called --- the parser must select the right function based on the data +types of the supplied arguments. </para> </listitem> </varlistentry> @@ -100,19 +107,23 @@ Query targets </term> <listitem> <para> -<acronym>SQL</acronym> INSERT statements place the results of query into a table. The expressions -in the query must be matched up with, and perhaps converted to, the target columns of the insert. +<acronym>SQL</acronym> INSERT and UPDATE statements place the results of +expressions into a table. The expressions in the query must be matched up +with, and perhaps converted to, the types of the target columns. </para> </listitem> </varlistentry> <varlistentry> <term> -UNION queries +UNION and CASE constructs </term> <listitem> <para> -Since all select results from a UNION SELECT statement must appear in a single set of columns, the types +Since all select results from a UNION SELECT statement must appear in a single +set of columns, the types of the results of each SELECT clause must be matched up and converted to a uniform set. +Similarly, the result expressions of a CASE construct must be coerced to +a common type so that the CASE expression as a whole has a known output type. </para> </listitem> </varlistentry> @@ -129,7 +140,7 @@ conventions for the <acronym>SQL92</acronym> standard native types such as <para> The <productname>Postgres</productname> parser uses the convention that all type conversion functions take a single argument of the source type and are -named with the same name as the target type. Any function meeting this +named with the same name as the target type. Any function meeting these criteria is considered to be a valid conversion function, and may be used by the parser as such. This simple assumption gives the parser the power to explore type conversion possibilities without hardcoding, allowing @@ -139,19 +150,16 @@ extended user-defined types to use these same features transparently. <para> An additional heuristic is provided in the parser to allow better guesses at proper behavior for <acronym>SQL</acronym> standard types. There are -five categories of types defined: boolean, string, numeric, geometric, +several basic <firstterm>type categories</firstterm> defined: boolean, +numeric, string, bitstring, datetime, timespan, geometric, network, and user-defined. Each category, with the exception of user-defined, has -a "preferred type" which is used to resolve ambiguities in candidates. -Each "user-defined" type is its own "preferred type", so ambiguous -expressions (those with multiple candidate parsing solutions) -with only one user-defined type can resolve to a single best choice, while those with -multiple user-defined types will remain ambiguous and throw an error. -</para> - -<para> -Ambiguous expressions which have candidate solutions within only one type category are -likely to resolve, while ambiguous expressions with candidates spanning multiple -categories are likely to throw an error and ask for clarification from the user. +a <firstterm>preferred type</firstterm> which is preferentially selected +when there is ambiguity. +In the user-defined category, each type is its own preferred type. +Ambiguous expressions (those with multiple candidate parsing solutions) +can often be resolved when there are multiple possible built-in types, but +they will raise an error when there are multiple choices for user-defined +types. </para> <sect2> @@ -207,12 +215,8 @@ should use this new function and will no longer do the implicit conversion using <sect1 id="typeconv-oper"> <title>Operators</title> -<sect2> -<title>Conversion Procedure</title> - <procedure> -<title>Operator Evaluation</title> - +<title>Operator Type Resolution</title> <step performance="required"> <para> @@ -222,15 +226,10 @@ Check for an exact match in the pg_operator system catalog. <substeps> <step performance="optional"> <para> -If one argument of a binary operator is <type>unknown</type>, -then assume it is the same type as the other argument. -</para> -</step> -<step performance="required"> -<para> -Reverse the arguments, and look for an exact match with an operator which -points to itself as being commutative. -If found, then reverse the arguments in the parse tree and use this operator. +If one argument of a binary operator is <type>unknown</type> type, +then assume it is the same type as the other argument for this check. +Other cases involving <type>unknown</type> will never find a match at +this step. </para> </step> </substeps> @@ -241,46 +240,63 @@ If found, then reverse the arguments in the parse tree and use this operator. Look for the best match. </para> <substeps> -<step performance="optional"> +<step performance="required"> <para> -Make a list of all operators of the same name. +Make a list of all operators of the same name for which the input types +match or can be coerced to match. (<type>unknown</type> literals are +assumed to be coercible to anything for this purpose.) If there is only +one, use it; else continue to the next step. </para> </step> <step performance="required"> <para> -If only one operator is in the list, use it if the input type can be coerced, -and throw an error if the type cannot be coerced. +Run through all candidates and keep those with the most exact matches +on input types. Keep all candidates if none have any exact matches. +If only one candidate remains, use it; else continue to the next step. +</para> +<step performance="required"> +<para> +Run through all candidates and keep those with the most exact or +binary-compatible matches on input types. Keep all candidates if none have +any exact or binary-compatible matches. +If only one candidate remains, use it; else continue to the next step. </para> </step> <step performance="required"> <para> -Keep all operators with the most explicit matches for types. Keep all if there -are no explicit matches and move to the next step. -If only one candidate remains, use it if the type can be coerced. +Run through all candidates and keep those which accept preferred types at +the most positions where type coercion will be required. +Keep all candidates if none accept preferred types. +If only one candidate remains, use it; else continue to the next step. </para> </step> <step performance="required"> <para> -If any input arguments are "unknown", categorize the input candidates as -boolean, numeric, string, geometric, or user-defined. If there is a mix of -categories, or more than one user-defined type, throw an error because -the correct choice cannot be deduced without more clues. -If only one category is present, then assign the "preferred type" -to the input column which had been previously "unknown". +If any input arguments are "unknown", check the type categories accepted +at those argument positions by the remaining candidates. At each position, +select "string" +category if any candidate accepts that category (this bias towards string +is appropriate since an unknown-type literal does look like a string). +Otherwise, if all the remaining candidates accept the same type category, +select that category; otherwise raise an error because +the correct choice cannot be deduced without more clues. Also note whether +any of the candidates accept a preferred datatype within the selected category. +Now discard operator candidates that do not accept the selected type category; +furthermore, if any candidate accepts a preferred type at a given argument +position, discard candidates that accept non-preferred types for that +argument. </para> </step> <step performance="required"> <para> -Choose the candidate with the most exact type matches, and which matches -the "preferred type" for each column category from the previous step. -If there is still more than one candidate, or if there are none, -then throw an error. +If only one candidate remains, use it. If no candidate or more than one +candidate remains, +then raise an error. </para> </step> </substeps> </step> </procedure> -</sect2> <sect2> <title>Examples</title> @@ -372,17 +388,12 @@ tgl=> SELECT 'abc' || 'def' AS "Unspecified"; <para> In this case there is no initial hint for which type to use, since no types are specified in the query. So, the parser looks for all candidate operators -and finds that all arguments for all the candidates are string types. It chooses -the "preferred type" for strings, <type>text</type>, for this query. -</para> - -<note> -<para> -If a user defines a new type and defines an operator "<literal>||</literal>" to work -with it, then this query would no longer succeed as written. The parser would -now have candidate types from two categories, and could not decide which to use. +and finds that there are candidates accepting both string-category and +bitstring-category inputs. Since string category is preferred when available, +that category is selected, and then the +"preferred type" for strings, <type>text</type>, is used as the specific +type to resolve the unknown literals to. </para> -</note> </sect3> <sect3> @@ -423,11 +434,13 @@ will try to oblige. <title>Functions</title> <procedure> -<title>Function Evaluation</title> +<title>Function Call Type Resolution</title> <step performance="required"> <para> Check for an exact match in the pg_proc system catalog. +(Cases involving <type>unknown</type> will never find a match at +this step.) </para></step> <step performance="required"> <para> @@ -436,38 +449,63 @@ Look for the best match. <substeps> <step performance="required"> <para> -Make a list of all functions of the same name with the same number of arguments. -</para></step> +Make a list of all functions of the same name with the same number of +arguments for which the input types +match or can be coerced to match. (<type>unknown</type> literals are +assumed to be coercible to anything for this purpose.) If there is only +one, use it; else continue to the next step. +</para> +</step> <step performance="required"> <para> -If only one function is in the list, use it if the input types can be coerced, -and throw an error if the types cannot be coerced. -</para></step> +Run through all candidates and keep those with the most exact matches +on input types. Keep all candidates if none have any exact matches. +If only one candidate remains, use it; else continue to the next step. +</para> <step performance="required"> <para> -Keep all functions with the most explicit matches for types. Keep all if there -are no explicit matches and move to the next step. -If only one candidate remains, use it if the type can be coerced. -</para></step> +Run through all candidates and keep those with the most exact or +binary-compatible matches on input types. Keep all candidates if none have +any exact or binary-compatible matches. +If only one candidate remains, use it; else continue to the next step. +</para> +</step> <step performance="required"> <para> -If any input arguments are "unknown", categorize the input candidate arguments as -boolean, numeric, string, geometric, or user-defined. If there is a mix of -categories, or more than one user-defined type, throw an error because -the correct choice cannot be deduced without more clues. -If only one category is present, then assign the "preferred type" -to the input column which had been previously "unknown". -</para></step> +Run through all candidates and keep those which accept preferred types at +the most positions where type coercion will be required. +Keep all candidates if none accept preferred types. +If only one candidate remains, use it; else continue to the next step. +</para> +</step> <step performance="required"> <para> -Choose the candidate with the most exact type matches, and which matches -the "preferred type" for each column category from the previous step. -If there is still more than one candidate, or if there are none, -then throw an error. -</para></step> +If any input arguments are "unknown", check the type categories accepted +at those argument positions by the remaining candidates. At each position, +select "string" +category if any candidate accepts that category (this bias towards string +is appropriate since an unknown-type literal does look like a string). +Otherwise, if all the remaining candidates accept the same type category, +select that category; otherwise raise an error because +the correct choice cannot be deduced without more clues. Also note whether +any of the candidates accept a preferred datatype within the selected category. +Now discard operator candidates that do not accept the selected type category; +furthermore, if any candidate accepts a preferred type at a given argument +position, discard candidates that accept non-preferred types for that +argument. +</para> +</step> +<step performance="required"> +<para> +If only one candidate remains, use it. If no candidate or more than one +candidate remains, +then raise an error. +</para> +</step> </substeps> </step> </procedure> + <sect2> <title>Examples</title> @@ -539,10 +577,10 @@ tgl=> select substr(text(varchar '1234'), 3); </para> <note> <para> -There are some heuristics in the parser to optimize the relationship between the -<type>char</type>, <type>varchar</type>, and <type>text</type> types. -For this case, <function>substr</function> is called directly with the <type>varchar</type> string -rather than inserting an explicit conversion call. +Actually, the parser is aware that <type>text</type> and <type>varchar</type> +are "binary compatible", meaning that one can be passed to a function that +accepts the other without doing any physical conversion. Therefore, no +explicit type conversion call is really inserted in this case. </para> </note> @@ -564,6 +602,8 @@ tgl=> select substr(text(1234), 3); 34 (1 row) </programlisting> +This succeeds because there is a conversion function text(int4) in the +system catalog. </para> </sect3> </sect2> @@ -573,7 +613,7 @@ tgl=> select substr(text(1234), 3); <title>Query Targets</title> <procedure> -<title>Target Evaluation</title> +<title>Query Target Type Resolution</title> <step performance="required"> <para> @@ -581,15 +621,21 @@ Check for an exact match with the target. </para></step> <step performance="required"> <para> -Try to coerce the expression directly to the target type if necessary. +Otherwise, try to coerce the expression to the target type. This will succeed +if the two types are known binary-compatible, or if there is a conversion +function. If the expression is an unknown-type literal, the contents of +the literal string will be fed to the input conversion routine for the target +type. </para></step> <step performance="required"> <para> If the target is a fixed-length type (e.g. <type>char</type> or <type>varchar</type> -declared with a length) then try to find a sizing function of the same name -as the type taking two arguments, the first the type name and the second an -integer length. +declared with a length) then try to find a sizing function for the target +type. A sizing function is a function of the same name as the type, +taking two arguments of which the first is that type and the second is an +integer, and returning the same type. If one is found, it is applied, +passing the column's declared length as the second parameter. </para></step> </procedure> @@ -613,32 +659,62 @@ tgl=> SELECT * FROM vv; v ------ abcd -(1 row) +(1 row) </programlisting> + +What's really happened here is that the two unknown literals are resolved +to text by default, allowing the <literal>||</literal> operator to be +resolved as text concatenation. Then the text result of the operator +is coerced to varchar to match the target column type. (But, since the +parser knows that text and varchar are binary-compatible, this coercion +is implicit and does not insert any real function call.) Finally, the +sizing function <literal>varchar(varchar,int4)</literal> is found in the system +catalogs and applied to the operator's result and the stored column length. +This type-specific function performs the desired truncation. </para> </sect3> </sect2> </sect1> -<sect1 id="typeconv-union"> -<title>UNION Queries</title> +<sect1 id="typeconv-union-case"> +<title>UNION and CASE Constructs</title> <para> -The UNION construct is somewhat different in that it must match up -possibly dissimilar types to become a single result set. +The UNION and CASE constructs must match up possibly dissimilar types to +become a single result set. The resolution algorithm is applied separately to +each output column of a UNION. CASE uses the identical algorithm to match +up its result expressions. </para> <procedure> -<title>UNION Evaluation</title> +<title>UNION and CASE Type Resolution</title> + +<step performance="required"> +<para> +If all inputs are of type <type>unknown</type>, resolve as type +<type>text</type> (the preferred type for string category). +Otherwise, ignore the <type>unknown</type> inputs while choosing the type. +</para></step> + +<step performance="required"> +<para> +If the non-unknown inputs are not all of the same type category, raise an +error. +</para></step> <step performance="required"> <para> -Check for identical types for all results. +If one or more non-unknown inputs are of a preferred type in that category, +resolve as that type. </para></step> <step performance="required"> <para> -Coerce each result from the UNION clauses to match the type of the -first SELECT clause or the target column. +Otherwise, resolve as the type of the first non-unknown input. +</para></step> + +<step performance="required"> +<para> +Coerce all inputs to the selected type. </para></step> </procedure> @@ -657,6 +733,7 @@ tgl=> SELECT text 'a' AS "Text" UNION SELECT 'b'; b (2 rows) </programlisting> +Here, the unknown-type literal 'b' will be resolved as type text. </para> </sect3> @@ -679,43 +756,26 @@ tgl=> SELECT 1.2 AS "Float8" UNION SELECT 1; <title>Transposed UNION</title> <para> -The types of the union are forced to match the types of +Here the output type of the union is forced to match the type of the first/top clause in the union: <programlisting> tgl=> SELECT 1 AS "All integers" -tgl-> UNION SELECT '2.2'::float4 -tgl-> UNION SELECT 3.3; +tgl-> UNION SELECT '2.2'::float4; All integers -------------- 1 2 - 3 -(3 rows) +(2 rows) </programlisting> </para> <para> -An alternate parser strategy could be to choose the "best" type of the bunch, but -this is more difficult because of the nice recursion technique used in the -parser. However, the "best" type is used when selecting <emphasis>into</emphasis> -a table: - -<programlisting> -tgl=> CREATE TABLE ff (f float); -CREATE -tgl=> INSERT INTO ff -tgl-> SELECT 1 -tgl-> UNION SELECT '2.2'::float4 -tgl-> UNION SELECT 3.3; -INSERT 0 3 -tgl=> SELECT f AS "Floating point" from ff; - Floating point ------------------- - 1 - 2.20000004768372 - 3.3 -(3 rows) -</programlisting> +Since float4 is not a preferred type, the parser sees no reason to select it +over int4, and instead falls back on the use-the-first-alternative rule. +This example demonstrates that the preferred-type mechanism doesn't encode +as much information as we'd like. Future versions of +<productname>Postgres</productname> may support a more general notion of +type preferences. </para> </sect3> </sect2> -- GitLab