Skip to content
Snippets Groups Projects
Select Git revision
  • benchmark-tools
  • postgres-lambda
  • master default
  • REL9_4_25
  • REL9_5_20
  • REL9_6_16
  • REL_10_11
  • REL_11_6
  • REL_12_1
  • REL_12_0
  • REL_12_RC1
  • REL_12_BETA4
  • REL9_4_24
  • REL9_5_19
  • REL9_6_15
  • REL_10_10
  • REL_11_5
  • REL_12_BETA3
  • REL9_4_23
  • REL9_5_18
  • REL9_6_14
  • REL_10_9
  • REL_11_4
23 results

transam

  • Clone with SSH
  • Clone with HTTPS
  • user avatar
    Alvaro Herrera authored
    unnecessary #include lines in it.  Also, move some tuple routine prototypes and
    macros to htup.h, which allows removal of heapam.h inclusion from some .c
    files.
    
    For this to work, a new header file access/sysattr.h needed to be created,
    initially containing attribute numbers of system columns, for pg_dump usage.
    
    While at it, make contrib ltree, intarray and hstore header files more
    consistent with our header style.
    f8c4d7db
    History
    $PostgreSQL: pgsql/src/backend/access/transam/README,v 1.11 2008/03/21 13:23:28 momjian Exp $
    
    The Transaction System
    ======================
    
    PostgreSQL's transaction system is a three-layer system.  The bottom layer
    implements low-level transactions and subtransactions, on top of which rests
    the mainloop's control code, which in turn implements user-visible
    transactions and savepoints.
    
    The middle layer of code is called by postgres.c before and after the
    processing of each query, or after detecting an error:
    
    		StartTransactionCommand
    		CommitTransactionCommand
    		AbortCurrentTransaction
    
    Meanwhile, the user can alter the system's state by issuing the SQL commands
    BEGIN, COMMIT, ROLLBACK, SAVEPOINT, ROLLBACK TO or RELEASE.  The traffic cop
    redirects these calls to the toplevel routines
    
    		BeginTransactionBlock
    		EndTransactionBlock
    		UserAbortTransactionBlock
    		DefineSavepoint
    		RollbackToSavepoint
    		ReleaseSavepoint
    
    respectively.  Depending on the current state of the system, these functions
    call low level functions to activate the real transaction system:
    
    		StartTransaction
    		CommitTransaction
    		AbortTransaction
    		CleanupTransaction
    		StartSubTransaction
    		CommitSubTransaction
    		AbortSubTransaction
    		CleanupSubTransaction
    
    Additionally, within a transaction, CommandCounterIncrement is called to
    increment the command counter, which allows future commands to "see" the
    effects of previous commands within the same transaction.  Note that this is
    done automatically by CommitTransactionCommand after each query inside a
    transaction block, but some utility functions also do it internally to allow
    some operations (usually in the system catalogs) to be seen by future
    operations in the same utility command.  (For example, in DefineRelation it is
    done after creating the heap so the pg_class row is visible, to be able to
    lock it.)
    
    
    For example, consider the following sequence of user commands:
    
    1)		BEGIN
    2)		SELECT * FROM foo
    3)		INSERT INTO foo VALUES (...)
    4)		COMMIT
    
    In the main processing loop, this results in the following function call
    sequence:
    
    	 /	StartTransactionCommand;
    	/		StartTransaction;
    1) <		ProcessUtility;				<< BEGIN
    	\		BeginTransactionBlock;
    	 \	CommitTransactionCommand;
    
    	/	StartTransactionCommand;
    2) /		ProcessQuery;				<< SELECT ...
       \		CommitTransactionCommand;
    	\		CommandCounterIncrement;
    
    	/	StartTransactionCommand;
    3) /		ProcessQuery;				<< INSERT ...
       \		CommitTransactionCommand;
    	\		CommandCounterIncrement;
    
    	 /	StartTransactionCommand;
    	/	ProcessUtility;				<< COMMIT
    4) <			EndTransactionBlock;
    	\	CommitTransactionCommand;
    	 \		CommitTransaction;
    
    The point of this example is to demonstrate the need for
    StartTransactionCommand and CommitTransactionCommand to be state smart -- they
    should call CommandCounterIncrement between the calls to BeginTransactionBlock
    and EndTransactionBlock and outside these calls they need to do normal start,
    commit or abort processing.
    
    Furthermore, suppose the "SELECT * FROM foo" caused an abort condition.	In
    this case AbortCurrentTransaction is called, and the transaction is put in
    aborted state.  In this state, any user input is ignored except for
    transaction-termination statements, or ROLLBACK TO <savepoint> commands.
    
    Transaction aborts can occur in two ways:
    
    1)	system dies from some internal cause  (syntax error, etc)
    2)	user types ROLLBACK
    
    The reason we have to distinguish them is illustrated by the following two
    situations:
    
    	case 1					case 2
    	------					------
    1) user types BEGIN			1) user types BEGIN
    2) user does something			2) user does something
    3) user does not like what		3) system aborts for some reason
       she sees and types ABORT		   (syntax error, etc)
    
    In case 1, we want to abort the transaction and return to the default state.
    In case 2, there may be more commands coming our way which are part of the
    same transaction block; we have to ignore these commands until we see a COMMIT
    or ROLLBACK.
    
    Internal aborts are handled by AbortCurrentTransaction, while user aborts are
    handled by UserAbortTransactionBlock.  Both of them rely on AbortTransaction
    to do all the real work.  The only difference is what state we enter after
    AbortTransaction does its work:
    
    * AbortCurrentTransaction leaves us in TBLOCK_ABORT,
    * UserAbortTransactionBlock leaves us in TBLOCK_ABORT_END
    
    Low-level transaction abort handling is divided in two phases:
    * AbortTransaction executes as soon as we realize the transaction has
      failed.  It should release all shared resources (locks etc) so that we do
      not delay other backends unnecessarily.
    * CleanupTransaction executes when we finally see a user COMMIT
      or ROLLBACK command; it cleans things up and gets us out of the transaction
      completely.  In particular, we mustn't destroy TopTransactionContext until
      this point.
    
    Also, note that when a transaction is committed, we don't close it right away.
    Rather it's put in TBLOCK_END state, which means that when
    CommitTransactionCommand is called after the query has finished processing,
    the transaction has to be closed.  The distinction is subtle but important,
    because it means that control will leave the xact.c code with the transaction
    open, and the main loop will be able to keep processing inside the same
    transaction.  So, in a sense, transaction commit is also handled in two
    phases, the first at EndTransactionBlock and the second at
    CommitTransactionCommand (which is where CommitTransaction is actually
    called).
    
    The rest of the code in xact.c are routines to support the creation and
    finishing of transactions and subtransactions.  For example, AtStart_Memory
    takes care of initializing the memory subsystem at main transaction start.
    
    
    Subtransaction Handling
    -----------------------
    
    Subtransactions are implemented using a stack of TransactionState structures,
    each of which has a pointer to its parent transaction's struct.  When a new
    subtransaction is to be opened, PushTransaction is called, which creates a new
    TransactionState, with its parent link pointing to the current transaction.
    StartSubTransaction is in charge of initializing the new TransactionState to
    sane values, and properly initializing other subsystems (AtSubStart routines).
    
    When closing a subtransaction, either CommitSubTransaction has to be called
    (if the subtransaction is committing), or AbortSubTransaction and
    CleanupSubTransaction (if it's aborting).  In either case, PopTransaction is
    called so the system returns to the parent transaction.
    
    One important point regarding subtransaction handling is that several may need
    to be closed in response to a single user command.  That's because savepoints
    have names, and we allow to commit or rollback a savepoint by name, which is
    not necessarily the one that was last opened.  Also a COMMIT or ROLLBACK
    command must be able to close out the entire stack.  We handle this by having
    the utility command subroutine mark all the state stack entries as commit-
    pending or abort-pending, and then when the main loop reaches
    CommitTransactionCommand, the real work is done.  The main point of doing
    things this way is that if we get an error while popping state stack entries,
    the remaining stack entries still show what we need to do to finish up.
    
    In the case of ROLLBACK TO <savepoint>, we abort all the subtransactions up
    through the one identified by the savepoint name, and then re-create that
    subtransaction level with the same name.  So it's a completely new
    subtransaction as far as the internals are concerned.
    
    Other subsystems are allowed to start "internal" subtransactions, which are
    handled by BeginInternalSubtransaction.  This is to allow implementing
    exception handling, e.g. in PL/pgSQL.  ReleaseCurrentSubTransaction and
    RollbackAndReleaseCurrentSubTransaction allows the subsystem to close said
    subtransactions.  The main difference between this and the savepoint/release
    path is that we execute the complete state transition immediately in each
    subroutine, rather than deferring some work until CommitTransactionCommand.
    Another difference is that BeginInternalSubtransaction is allowed when no
    explicit transaction block has been established, while DefineSavepoint is not.
    
    
    Transaction and Subtransaction Numbering
    ----------------------------------------
    
    Transactions and subtransactions are assigned permanent XIDs only when/if
    they first do something that requires one --- typically, insert/update/delete
    a tuple, though there are a few other places that need an XID assigned.
    If a subtransaction requires an XID, we always first assign one to its
    parent.  This maintains the invariant that child transactions have XIDs later
    than their parents, which is assumed in a number of places.
    
    The subsidiary actions of obtaining a lock on the XID and and entering it into
    pg_subtrans and PG_PROC are done at the time it is assigned.
    
    A transaction that has no XID still needs to be identified for various
    purposes, notably holding locks.  For this purpose we assign a "virtual
    transaction ID" or VXID to each top-level transaction.  VXIDs are formed from
    two fields, the backendID and a backend-local counter; this arrangement allows
    assignment of a new VXID at transaction start without any contention for
    shared memory.  To ensure that a VXID isn't re-used too soon after backend
    exit, we store the last local counter value into shared memory at backend
    exit, and initialize it from the previous value for the same backendID slot
    at backend start.  All these counters go back to zero at shared memory
    re-initialization, but that's OK because VXIDs never appear anywhere on-disk.
    
    Internally, a backend needs a way to identify subtransactions whether or not
    they have XIDs; but this need only lasts as long as the parent top transaction
    endures.  Therefore, we have SubTransactionId, which is somewhat like
    CommandId in that it's generated from a counter that we reset at the start of
    each top transaction.  The top-level transaction itself has SubTransactionId 1,
    and subtransactions have IDs 2 and up.  (Zero is reserved for
    InvalidSubTransactionId.)  Note that subtransactions do not have their
    own VXIDs; they use the parent top transaction's VXID.
    
    
    Interlocking Transaction Begin, Transaction End, and Snapshots
    --------------------------------------------------------------
    
    We try hard to minimize the amount of overhead and lock contention involved
    in the frequent activities of beginning/ending a transaction and taking a
    snapshot.  Unfortunately, we must have some interlocking for this, because
    we must ensure consistency about the commit order of transactions.
    For example, suppose an UPDATE in xact A is blocked by xact B's prior
    update of the same row, and xact B is doing commit while xact C gets a
    snapshot.  Xact A can complete and commit as soon as B releases its locks.
    If xact C's GetSnapshotData sees xact B as still running, then it had
    better see xact A as still running as well, or it will be able to see two
    tuple versions - one deleted by xact B and one inserted by xact A.  Another
    reason why this would be bad is that C would see (in the row inserted by A)
    earlier changes by B, and it would be inconsistent for C not to see any
    of B's changes elsewhere in the database.
    
    Formally, the correctness requirement is "if a snapshot A considers
    transaction X as committed, and any of transaction X's snapshots considered
    transaction Y as committed, then snapshot A must consider transaction Y as
    committed".
    
    What we actually enforce is strict serialization of commits and rollbacks
    with snapshot-taking: we do not allow any transaction to exit the set of
    running transactions while a snapshot is being taken.  (This rule is
    stronger than necessary for consistency, but is relatively simple to
    enforce, and it assists with some other issues as explained below.)  The
    implementation of this is that GetSnapshotData takes the ProcArrayLock in
    shared mode (so that multiple backends can take snapshots in parallel),
    but ProcArrayEndTransaction must take the ProcArrayLock in exclusive mode
    while clearing MyProc->xid at transaction end (either commit or abort).
    
    ProcArrayEndTransaction also holds the lock while advancing the shared
    latestCompletedXid variable.  This allows GetSnapshotData to use
    latestCompletedXid + 1 as xmax for its snapshot: there can be no
    transaction >= this xid value that the snapshot needs to consider as
    completed.
    
    In short, then, the rule is that no transaction may exit the set of
    currently-running transactions between the time we fetch latestCompletedXid
    and the time we finish building our snapshot.  However, this restriction
    only applies to transactions that have an XID --- read-only transactions
    can end without acquiring ProcArrayLock, since they don't affect anyone
    else's snapshot nor latestCompletedXid.
    
    Transaction start, per se, doesn't have any interlocking with these
    considerations, since we no longer assign an XID immediately at transaction
    start.  But when we do decide to allocate an XID, GetNewTransactionId must
    store the new XID into the shared ProcArray before releasing XidGenLock.
    This ensures that all top-level XIDs <= latestCompletedXid are either
    present in the ProcArray, or not running anymore.  (This guarantee doesn't
    apply to subtransaction XIDs, because of the possibility that there's not
    room for them in the subxid array; instead we guarantee that they are
    present or the overflow flag is set.)  If a backend released XidGenLock
    before storing its XID into MyProc, then it would be possible for another
    backend to allocate and commit a later XID, causing latestCompletedXid to
    pass the first backend's XID, before that value became visible in the
    ProcArray.  That would break GetOldestXmin, as discussed below.
    
    We allow GetNewTransactionId to store the XID into MyProc->xid (or the
    subxid array) without taking ProcArrayLock.  This was once necessary to
    avoid deadlock; while that is no longer the case, it's still beneficial for
    performance.  We are thereby relying on fetch/store of an XID to be atomic,
    else other backends might see a partially-set XID.  This also means that
    readers of the ProcArray xid fields must be careful to fetch a value only
    once, rather than assume they can read it multiple times and get the same
    answer each time.  (Use volatile-qualified pointers when doing this, to
    ensure that the C compiler does exactly what you tell it to.)
    
    Another important activity that uses the shared ProcArray is GetOldestXmin,
    which must determine a lower bound for the oldest xmin of any active MVCC
    snapshot, system-wide.  Each individual backend advertises the smallest
    xmin of its own snapshots in MyProc->xmin, or zero if it currently has no
    live snapshots (eg, if it's between transactions or hasn't yet set a
    snapshot for a new transaction).  GetOldestXmin takes the MIN() of the
    valid xmin fields.  It does this with only shared lock on ProcArrayLock,
    which means there is a potential race condition against other backends
    doing GetSnapshotData concurrently: we must be certain that a concurrent
    backend that is about to set its xmin does not compute an xmin less than
    what GetOldestXmin returns.  We ensure that by including all the active
    XIDs into the MIN() calculation, along with the valid xmins.  The rule that
    transactions can't exit without taking exclusive ProcArrayLock ensures that
    concurrent holders of shared ProcArrayLock will compute the same minimum of
    currently-active XIDs: no xact, in particular not the oldest, can exit
    while we hold shared ProcArrayLock.  So GetOldestXmin's view of the minimum
    active XID will be the same as that of any concurrent GetSnapshotData, and
    so it can't produce an overestimate.  If there is no active transaction at
    all, GetOldestXmin returns latestCompletedXid + 1, which is a lower bound
    for the xmin that might be computed by concurrent or later GetSnapshotData
    calls.  (We know that no XID less than this could be about to appear in
    the ProcArray, because of the XidGenLock interlock discussed above.)
    
    GetSnapshotData also performs an oldest-xmin calculation (which had better
    match GetOldestXmin's) and stores that into RecentGlobalXmin, which is used
    for some tuple age cutoff checks where a fresh call of GetOldestXmin seems
    too expensive.  Note that while it is certain that two concurrent
    executions of GetSnapshotData will compute the same xmin for their own
    snapshots, as argued above, it is not certain that they will arrive at the
    same estimate of RecentGlobalXmin.  This is because we allow XID-less
    transactions to clear their MyProc->xmin asynchronously (without taking
    ProcArrayLock), so one execution might see what had been the oldest xmin,
    and another not.  This is OK since RecentGlobalXmin need only be a valid
    lower bound.  As noted above, we are already assuming that fetch/store
    of the xid fields is atomic, so assuming it for xmin as well is no extra
    risk.
    
    
    pg_clog and pg_subtrans
    -----------------------
    
    pg_clog and pg_subtrans are permanent (on-disk) storage of transaction related
    information.  There is a limited number of pages of each kept in memory, so
    in many cases there is no need to actually read from disk.  However, if
    there's a long running transaction or a backend sitting idle with an open
    transaction, it may be necessary to be able to read and write this information
    from disk.  They also allow information to be permanent across server restarts.
    
    pg_clog records the commit status for each transaction that has been assigned
    an XID.  A transaction can be in progress, committed, aborted, or
    "sub-committed".  This last state means that it's a subtransaction that's no
    longer running, but its parent has not updated its state yet (either it is
    still running, or the backend crashed without updating its status).  A
    sub-committed transaction's status will be updated again to the final value as
    soon as the parent commits or aborts, or when the parent is detected to be
    aborted.
    
    Savepoints are implemented using subtransactions.  A subtransaction is a
    transaction inside a transaction; its commit or abort status is not only
    dependent on whether it committed itself, but also whether its parent
    transaction committed.  To implement multiple savepoints in a transaction we
    allow unlimited transaction nesting depth, so any particular subtransaction's
    commit state is dependent on the commit status of each and every ancestor
    transaction.
    
    The "subtransaction parent" (pg_subtrans) mechanism records, for each
    transaction with an XID, the TransactionId of its parent transaction.  This
    information is stored as soon as the subtransaction is assigned an XID.
    Top-level transactions do not have a parent, so they leave their pg_subtrans
    entries set to the default value of zero (InvalidTransactionId).
    
    pg_subtrans is used to check whether the transaction in question is still
    running --- the main Xid of a transaction is recorded in the PGPROC struct,
    but since we allow arbitrary nesting of subtransactions, we can't fit all Xids
    in shared memory, so we have to store them on disk.  Note, however, that for
    each transaction we keep a "cache" of Xids that are known to be part of the
    transaction tree, so we can skip looking at pg_subtrans unless we know the
    cache has been overflowed.  See storage/ipc/procarray.c for the gory details.
    
    slru.c is the supporting mechanism for both pg_clog and pg_subtrans.  It
    implements the LRU policy for in-memory buffer pages.  The high-level routines
    for pg_clog are implemented in transam.c, while the low-level functions are in
    clog.c.  pg_subtrans is contained completely in subtrans.c.
    
    
    Write-Ahead Log Coding
    ----------------------
    
    The WAL subsystem (also called XLOG in the code) exists to guarantee crash
    recovery.  It can also be used to provide point-in-time recovery, as well as
    hot-standby replication via log shipping.  Here are some notes about
    non-obvious aspects of its design.
    
    A basic assumption of a write AHEAD log is that log entries must reach stable
    storage before the data-page changes they describe.  This ensures that
    replaying the log to its end will bring us to a consistent state where there
    are no partially-performed transactions.  To guarantee this, each data page
    (either heap or index) is marked with the LSN (log sequence number --- in
    practice, a WAL file location) of the latest XLOG record affecting the page.
    Before the bufmgr can write out a dirty page, it must ensure that xlog has
    been flushed to disk at least up to the page's LSN.  This low-level
    interaction improves performance by not waiting for XLOG I/O until necessary.
    The LSN check exists only in the shared-buffer manager, not in the local
    buffer manager used for temp tables; hence operations on temp tables must not
    be WAL-logged.
    
    During WAL replay, we can check the LSN of a page to detect whether the change
    recorded by the current log entry is already applied (it has been, if the page
    LSN is >= the log entry's WAL location).
    
    Usually, log entries contain just enough information to redo a single
    incremental update on a page (or small group of pages).  This will work only
    if the filesystem and hardware implement data page writes as atomic actions,
    so that a page is never left in a corrupt partly-written state.  Since that's
    often an untenable assumption in practice, we log additional information to
    allow complete reconstruction of modified pages.  The first WAL record
    affecting a given page after a checkpoint is made to contain a copy of the
    entire page, and we implement replay by restoring that page copy instead of
    redoing the update.  (This is more reliable than the data storage itself would
    be because we can check the validity of the WAL record's CRC.)  We can detect
    the "first change after checkpoint" by noting whether the page's old LSN
    precedes the end of WAL as of the last checkpoint (the RedoRecPtr).
    
    The general schema for executing a WAL-logged action is
    
    1. Pin and exclusive-lock the shared buffer(s) containing the data page(s)
    to be modified.
    
    2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
    PANIC because the shared buffers will contain unlogged changes, which we
    have to ensure don't get to disk.  Obviously, you should check conditions
    such as whether there's enough free space on the page before you start the
    critical section.)
    
    3. Apply the required changes to the shared buffer(s).
    
    4. Mark the shared buffer(s) as dirty with MarkBufferDirty().  (This must
    happen before the WAL record is inserted; see notes in SyncOneBuffer().)
    
    5. Build a WAL log record and pass it to XLogInsert(); then update the page's
    LSN and TLI using the returned XLOG location.  For instance,
    
    		recptr = XLogInsert(rmgr_id, info, rdata);
    
    		PageSetLSN(dp, recptr);
    		PageSetTLI(dp, ThisTimeLineID);
    
    6. END_CRIT_SECTION()
    
    7. Unlock and unpin the buffer(s).
    
    XLogInsert's "rdata" argument is an array of pointer/size items identifying
    chunks of data to be written in the XLOG record, plus optional shared-buffer
    IDs for chunks that are in shared buffers rather than temporary variables.
    The "rdata" array must mention (at least once) each of the shared buffers
    being modified, unless the action is such that the WAL replay routine can
    reconstruct the entire page contents.  XLogInsert includes the logic that
    tests to see whether a shared buffer has been modified since the last
    checkpoint.  If not, the entire page contents are logged rather than just the
    portion(s) pointed to by "rdata".
    
    Because XLogInsert drops the rdata components associated with buffers it
    chooses to log in full, the WAL replay routines normally need to test to see
    which buffers were handled that way --- otherwise they may be misled about
    what the XLOG record actually contains.  XLOG records that describe multi-page
    changes therefore require some care to design: you must be certain that you
    know what data is indicated by each "BKP" bit.  An example of the trickiness
    is that in a HEAP_UPDATE record, BKP(1) normally is associated with the source
    page and BKP(2) is associated with the destination page --- but if these are
    the same page, only BKP(1) would have been set.
    
    For this reason as well as the risk of deadlocking on buffer locks, it's best
    to design WAL records so that they reflect small atomic actions involving just
    one or a few pages.  The current XLOG infrastructure cannot handle WAL records
    involving references to more than three shared buffers, anyway.
    
    In the case where the WAL record contains enough information to re-generate
    the entire contents of a page, do *not* show that page's buffer ID in the
    rdata array, even if some of the rdata items point into the buffer.  This is
    because you don't want XLogInsert to log the whole page contents.  The
    standard replay-routine pattern for this case is
    
    	reln = XLogOpenRelation(rnode);
    	buffer = XLogReadBuffer(reln, blkno, true);
    	Assert(BufferIsValid(buffer));
    	page = (Page) BufferGetPage(buffer);
    
    	... initialize the page ...
    
    	PageSetLSN(page, lsn);
    	PageSetTLI(page, ThisTimeLineID);
    	MarkBufferDirty(buffer);
    	UnlockReleaseBuffer(buffer);
    
    In the case where the WAL record provides only enough information to
    incrementally update the page, the rdata array *must* mention the buffer
    ID at least once; otherwise there is no defense against torn-page problems.
    The standard replay-routine pattern for this case is
    
    	if (record->xl_info & XLR_BKP_BLOCK_n)
    		<< do nothing, page was rewritten from logged copy >>;
    
    	reln = XLogOpenRelation(rnode);
    	buffer = XLogReadBuffer(reln, blkno, false);
    	if (!BufferIsValid(buffer))
    		<< do nothing, page has been deleted >>;
    	page = (Page) BufferGetPage(buffer);
    
    	if (XLByteLE(lsn, PageGetLSN(page)))
    	{
    		/* changes are already applied */
    		UnlockReleaseBuffer(buffer);
    		return;
    	}
    
    	... apply the change ...
    
    	PageSetLSN(page, lsn);
    	PageSetTLI(page, ThisTimeLineID);
    	MarkBufferDirty(buffer);
    	UnlockReleaseBuffer(buffer);
    
    As noted above, for a multi-page update you need to be able to determine
    which XLR_BKP_BLOCK_n flag applies to each page.  If a WAL record reflects
    a combination of fully-rewritable and incremental updates, then the rewritable
    pages don't count for the XLR_BKP_BLOCK_n numbering.  (XLR_BKP_BLOCK_n is
    associated with the n'th distinct buffer ID seen in the "rdata" array, and
    per the above discussion, fully-rewritable buffers shouldn't be mentioned in
    "rdata".)
    
    Due to all these constraints, complex changes (such as a multilevel index
    insertion) normally need to be described by a series of atomic-action WAL
    records.  What do you do if the intermediate states are not self-consistent?
    The answer is that the WAL replay logic has to be able to fix things up.
    In btree indexes, for example, a page split requires insertion of a new key in
    the parent btree level, but for locking reasons this has to be reflected by
    two separate WAL records.  The replay code has to remember "unfinished" split
    operations, and match them up to subsequent insertions in the parent level.
    If no matching insert has been found by the time the WAL replay ends, the
    replay code has to do the insertion on its own to restore the index to
    consistency.  Such insertions occur after WAL is operational, so they can
    and should write WAL records for the additional generated actions.
    
    
    Asynchronous Commit
    -------------------
    
    As of PostgreSQL 8.3 it is possible to perform asynchronous commits - i.e.,
    we don't wait while the WAL record for the commit is fsync'ed.
    We perform an asynchronous commit when synchronous_commit = off.  Instead
    of performing an XLogFlush() up to the LSN of the commit, we merely note
    the LSN in shared memory.  The backend then continues with other work.
    We record the LSN only for an asynchronous commit, not an abort; there's
    never any need to flush an abort record, since the presumption after a
    crash would be that the transaction aborted anyway.
    
    We always force synchronous commit when the transaction is deleting
    relations, to ensure the commit record is down to disk before the relations
    are removed from the filesystem.  Also, certain utility commands that have
    non-roll-backable side effects (such as filesystem changes) force sync
    commit to minimize the window in which the filesystem change has been made
    but the transaction isn't guaranteed committed.
    
    Every wal_writer_delay milliseconds, the walwriter process performs an
    XLogBackgroundFlush().  This checks the location of the last completely
    filled WAL page.  If that has moved forwards, then we write all the changed
    buffers up to that point, so that under full load we write only whole
    buffers.  If there has been a break in activity and the current WAL page is
    the same as before, then we find out the LSN of the most recent
    asynchronous commit, and flush up to that point, if required (i.e.,
    if it's in the current WAL page).  This arrangement in itself would
    guarantee that an async commit record reaches disk during at worst the
    second walwriter cycle after the transaction completes.  However, we also
    allow XLogFlush to flush full buffers "flexibly" (ie, not wrapping around
    at the end of the circular WAL buffer area), so as to minimize the number
    of writes issued under high load when multiple WAL pages are filled per
    walwriter cycle.  This makes the worst-case delay three walwriter cycles.
    
    There are some other subtle points to consider with asynchronous commits.
    First, for each page of CLOG we must remember the LSN of the latest commit
    affecting the page, so that we can enforce the same flush-WAL-before-write
    rule that we do for ordinary relation pages.  Otherwise the record of the
    commit might reach disk before the WAL record does.  Again, abort records
    need not factor into this consideration.
    
    In fact, we store more than one LSN for each clog page.  This relates to
    the way we set transaction status hint bits during visibility tests.
    We must not set a transaction-committed hint bit on a relation page and
    have that record make it to disk prior to the WAL record of the commit.
    Since visibility tests are normally made while holding buffer share locks,
    we do not have the option of changing the page's LSN to guarantee WAL
    synchronization.  Instead, we defer the setting of the hint bit if we have
    not yet flushed WAL as far as the LSN associated with the transaction.
    This requires tracking the LSN of each unflushed async commit.  It is
    convenient to associate this data with clog buffers: because we will flush
    WAL before writing a clog page, we know that we do not need to remember a
    transaction's LSN longer than the clog page holding its commit status
    remains in memory.  However, the naive approach of storing an LSN for each
    clog position is unattractive: the LSNs are 32x bigger than the two-bit
    commit status fields, and so we'd need 256K of additional shared memory for
    each 8K clog buffer page.  We choose instead to store a smaller number of
    LSNs per page, where each LSN is the highest LSN associated with any
    transaction commit in a contiguous range of transaction IDs on that page.
    This saves storage at the price of some possibly-unnecessary delay in
    setting transaction hint bits.
    
    How many transactions should share the same cached LSN (N)?  If the
    system's workload consists only of small async-commit transactions, then
    it's reasonable to have N similar to the number of transactions per
    walwriter cycle, since that is the granularity with which transactions will
    become truly committed (and thus hintable) anyway.  The worst case is where
    a sync-commit xact shares a cached LSN with an async-commit xact that
    commits a bit later; even though we paid to sync the first xact to disk,
    we won't be able to hint its outputs until the second xact is sync'd, up to
    three walwriter cycles later.  This argues for keeping N (the group size)
    as small as possible.  For the moment we are setting the group size to 32,
    which makes the LSN cache space the same size as the actual clog buffer
    space (independently of BLCKSZ).
    
    It is useful that we can run both synchronous and asynchronous commit
    transactions concurrently, but the safety of this is perhaps not
    immediately obvious.  Assume we have two transactions, T1 and T2.  The Log
    Sequence Number (LSN) is the point in the WAL sequence where a transaction
    commit is recorded, so LSN1 and LSN2 are the commit records of those
    transactions.  If T2 can see changes made by T1 then when T2 commits it
    must be true that LSN2 follows LSN1.  Thus when T2 commits it is certain
    that all of the changes made by T1 are also now recorded in the WAL.  This
    is true whether T1 was asynchronous or synchronous.  As a result, it is
    safe for asynchronous commits and synchronous commits to work concurrently
    without endangering data written by synchronous commits.  Sub-transactions
    are not important here since the final write to disk only occurs at the
    commit of the top level transaction.
    
    Changes to data blocks cannot reach disk unless WAL is flushed up to the
    point of the LSN of the data blocks.  Any attempt to write unsafe data to
    disk will trigger a write which ensures the safety of all data written by
    that and prior transactions.  Data blocks and clog pages are both protected
    by LSNs.
    
    Changes to a temp table are not WAL-logged, hence could reach disk in
    advance of T1's commit, but we don't care since temp table contents don't
    survive crashes anyway.
    
    Database writes made via any of the paths we have introduced to avoid WAL
    overhead for bulk updates are also safe.  In these cases it's entirely
    possible for the data to reach disk before T1's commit, because T1 will
    fsync it down to disk without any sort of interlock, as soon as it finishes
    the bulk update.  However, all these paths are designed to write data that
    no other transaction can see until after T1 commits.  The situation is thus
    not different from ordinary WAL-logged updates.