Skip to content
Snippets Groups Projects
Select Git revision
  • benchmark-tools
  • postgres-lambda
  • master default
  • REL9_4_25
  • REL9_5_20
  • REL9_6_16
  • REL_10_11
  • REL_11_6
  • REL_12_1
  • REL_12_0
  • REL_12_RC1
  • REL_12_BETA4
  • REL9_4_24
  • REL9_5_19
  • REL9_6_15
  • REL_10_10
  • REL_11_5
  • REL_12_BETA3
  • REL9_4_23
  • REL9_5_18
  • REL9_6_14
  • REL_10_9
  • REL_11_4
23 results

executor

  • Clone with SSH
  • Clone with HTTPS
  • user avatar
    Tom Lane authored
    Flattening of subquery range tables during setrefs.c could lead to the
    rangetable indexes in PlanRowMark nodes not matching up with the column
    names previously assigned to the corresponding resjunk ctid (resp. tableoid
    or wholerow) columns.  Typical symptom would be either a "cannot extract
    system attribute from virtual tuple" error or an Assert failure.  This
    wasn't a problem before 9.0 because we didn't support FOR UPDATE below the
    top query level, and so the final flattening could never renumber an RTE
    that was relevant to FOR UPDATE.  Fix by using a plan-tree-wide unique
    number for each PlanRowMark to label the associated resjunk columns, so
    that the number need not change during flattening.
    
    Per report from David Johnston (though I'm darned if I can see how this got
    past initial testing of the relevant code).  Back-patch to 9.0.
    e617f0d7
    History
    src/backend/executor/README
    
    The Postgres Executor
    =====================
    
    The executor processes a tree of "plan nodes".  The plan tree is essentially
    a demand-pull pipeline of tuple processing operations.  Each node, when
    called, will produce the next tuple in its output sequence, or NULL if no
    more tuples are available.  If the node is not a primitive relation-scanning
    node, it will have child node(s) that it calls in turn to obtain input
    tuples.
    
    Refinements on this basic model include:
    
    * Choice of scan direction (forwards or backwards).  Caution: this is not
    currently well-supported.  It works for primitive scan nodes, but not very
    well for joins, aggregates, etc.
    
    * Rescan command to reset a node and make it generate its output sequence
    over again.
    
    * Parameters that can alter a node's results.  After adjusting a parameter,
    the rescan command must be applied to that node and all nodes above it.
    There is a moderately intelligent scheme to avoid rescanning nodes
    unnecessarily (for example, Sort does not rescan its input if no parameters
    of the input have changed, since it can just reread its stored sorted data).
    
    For a SELECT, it is only necessary to deliver the top-level result tuples
    to the client.  For INSERT/UPDATE/DELETE, the actual table modification
    operations happen in a top-level ModifyTable plan node.  If the query
    includes a RETURNING clause, the ModifyTable node delivers the computed
    RETURNING rows as output, otherwise it returns nothing.  Handling INSERT
    is pretty straightforward: the tuples returned from the plan tree below
    ModifyTable are inserted into the correct result relation.  For UPDATE,
    the plan tree returns the computed tuples to be updated, plus a "junk"
    (hidden) CTID column identifying which table row is to be replaced by each
    one.  For DELETE, the plan tree need only deliver a CTID column, and the
    ModifyTable node visits each of those rows and marks the row deleted.
    
    XXX a great deal more documentation needs to be written here...
    
    
    Plan Trees and State Trees
    --------------------------
    
    The plan tree delivered by the planner contains a tree of Plan nodes (struct
    types derived from struct Plan).  Each Plan node may have expression trees
    associated with it, to represent its target list, qualification conditions,
    etc.  During executor startup we build a parallel tree of identical structure
    containing executor state nodes --- every plan and expression node type has
    a corresponding executor state node type.  Each node in the state tree has a
    pointer to its corresponding node in the plan tree, plus executor state data
    as needed to implement that node type.  This arrangement allows the plan
    tree to be completely read-only as far as the executor is concerned: all data
    that is modified during execution is in the state tree.  Read-only plan trees
    make life much simpler for plan caching and reuse.
    
    Altogether there are four classes of nodes used in these trees: Plan nodes,
    their corresponding PlanState nodes, Expr nodes, and their corresponding
    ExprState nodes.  (Actually, there are also List nodes, which are used as
    "glue" in all four kinds of tree.)
    
    
    Memory Management
    -----------------
    
    A "per query" memory context is created during CreateExecutorState();
    all storage allocated during an executor invocation is allocated in that
    context or a child context.  This allows easy reclamation of storage
    during executor shutdown --- rather than messing with retail pfree's and
    probable storage leaks, we just destroy the memory context.
    
    In particular, the plan state trees and expression state trees described
    in the previous section are allocated in the per-query memory context.
    
    To avoid intra-query memory leaks, most processing while a query runs
    is done in "per tuple" memory contexts, which are so-called because they
    are typically reset to empty once per tuple.  Per-tuple contexts are usually
    associated with ExprContexts, and commonly each PlanState node has its own
    ExprContext to evaluate its qual and targetlist expressions in.
    
    
    Query Processing Control Flow
    -----------------------------
    
    This is a sketch of control flow for full query processing:
    
    	CreateQueryDesc
    
    	ExecutorStart
    		CreateExecutorState
    			creates per-query context
    		switch to per-query context to run ExecInitNode
    		ExecInitNode --- recursively scans plan tree
    			CreateExprContext
    				creates per-tuple context
    			ExecInitExpr
    
    	ExecutorRun
    		ExecProcNode --- recursively called in per-query context
    			ExecEvalExpr --- called in per-tuple context
    			ResetExprContext --- to free memory
    
    	ExecutorEnd
    		ExecEndNode --- recursively releases resources
    		FreeExecutorState
    			frees per-query context and child contexts
    
    	FreeQueryDesc
    
    Per above comments, it's not really critical for ExecEndNode to free any
    memory; it'll all go away in FreeExecutorState anyway.  However, we do need to
    be careful to close relations, drop buffer pins, etc, so we do need to scan
    the plan state tree to find these sorts of resources.
    
    
    The executor can also be used to evaluate simple expressions without any Plan
    tree ("simple" meaning "no aggregates and no sub-selects", though such might
    be hidden inside function calls).  This case has a flow of control like
    
    	CreateExecutorState
    		creates per-query context
    
    	CreateExprContext	-- or use GetPerTupleExprContext(estate)
    		creates per-tuple context
    
    	ExecPrepareExpr
    		temporarily switch to per-query context
    		run the expression through expression_planner
    		ExecInitExpr
    
    	Repeatedly do:
    		ExecEvalExprSwitchContext
    			ExecEvalExpr --- called in per-tuple context
    		ResetExprContext --- to free memory
    
    	FreeExecutorState
    		frees per-query context, as well as ExprContext
    		(a separate FreeExprContext call is not necessary)
    
    
    EvalPlanQual (READ COMMITTED Update Checking)
    ---------------------------------------------
    
    For simple SELECTs, the executor need only pay attention to tuples that are
    valid according to the snapshot seen by the current transaction (ie, they
    were inserted by a previously committed transaction, and not deleted by any
    previously committed transaction).  However, for UPDATE and DELETE it is not
    cool to modify or delete a tuple that's been modified by an open or
    concurrently-committed transaction.  If we are running in SERIALIZABLE
    isolation level then we just raise an error when this condition is seen to
    occur.  In READ COMMITTED isolation level, we must work a lot harder.
    
    The basic idea in READ COMMITTED mode is to take the modified tuple
    committed by the concurrent transaction (after waiting for it to commit,
    if need be) and re-evaluate the query qualifications to see if it would
    still meet the quals.  If so, we regenerate the updated tuple (if we are
    doing an UPDATE) from the modified tuple, and finally update/delete the
    modified tuple.  SELECT FOR UPDATE/SHARE behaves similarly, except that its
    action is just to lock the modified tuple and return results based on that
    version of the tuple.
    
    To implement this checking, we actually re-run the query from scratch for
    each modified tuple (or set of tuples, for SELECT FOR UPDATE), with the
    relation scan nodes tweaked to return only the current tuples --- either
    the original ones, or the updated (and now locked) versions of the modified
    tuple(s).  If this query returns a tuple, then the modified tuple(s) pass
    the quals (and the query output is the suitably modified update tuple, if
    we're doing UPDATE).  If no tuple is returned, then the modified tuple(s)
    fail the quals, so we ignore the current result tuple and continue the
    original query.
    
    In UPDATE/DELETE, only the target relation needs to be handled this way.
    In SELECT FOR UPDATE, there may be multiple relations flagged FOR UPDATE,
    so we obtain lock on the current tuple version in each such relation before
    executing the recheck.
    
    It is also possible that there are relations in the query that are not
    to be locked (they are neither the UPDATE/DELETE target nor specified to
    be locked in SELECT FOR UPDATE/SHARE).  When re-running the test query
    we want to use the same rows from these relations that were joined to
    the locked rows.  For ordinary relations this can be implemented relatively
    cheaply by including the row TID in the join outputs and re-fetching that
    TID.  (The re-fetch is expensive, but we're trying to optimize the normal
    case where no re-test is needed.)  We have also to consider non-table
    relations, such as a ValuesScan or FunctionScan.  For these, since there
    is no equivalent of TID, the only practical solution seems to be to include
    the entire row value in the join output row.
    
    We disallow set-returning functions in the targetlist of SELECT FOR UPDATE,
    so as to ensure that at most one tuple can be returned for any particular
    set of scan tuples.  Otherwise we'd get duplicates due to the original
    query returning the same set of scan tuples multiple times.  (Note: there
    is no explicit prohibition on SRFs in UPDATE, but the net effect will be
    that only the first result row of an SRF counts, because all subsequent
    rows will result in attempts to re-update an already updated target row.
    This is historical behavior and seems not worth changing.)