diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index a204ad4af08041caabb70aead54aec07d5c5ddc6..9ae596ab23f46dcac6383f2bf60e18ae14871dbe 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -1,68 +1,175 @@
-$Header: /cvsroot/pgsql/src/backend/access/nbtree/README,v 1.1.1.1 1996/07/09 06:21:12 scrappy Exp $
+$Header: /cvsroot/pgsql/src/backend/access/nbtree/README,v 1.2 2000/07/21 06:42:32 tgl Exp $
 
 This directory contains a correct implementation of Lehman and Yao's
-btree management algorithm that supports concurrent access for Postgres.
+high-concurrency B-tree management algorithm (P. Lehman and S. Yao,
+Efficient Locking for Concurrent Operations on B-Trees, ACM Transactions
+on Database Systems, Vol 6, No. 4, December 1981, pp 650-670).
+
 We have made the following changes in order to incorporate their algorithm
 into Postgres:
 
-	+  The requirement that all btree keys be unique is too onerous,
-	   but the algorithm won't work correctly without it.  As a result,
-	   this implementation adds an OID (guaranteed to be unique) to
-	   every key in the index.  This guarantees uniqueness within a set
-	   of duplicates.  Space overhead is four bytes.
-
-	   For this reason, when we're passed an index tuple to store by the
-	   common access method code, we allocate a larger one and copy the
-	   supplied tuple into it.  No Postgres code outside of the btree
-	   access method knows about this xid or sequence number.
-
-	+  Lehman and Yao don't require read locks, but assume that in-
-	   memory copies of tree nodes are unshared.  Postgres shares
-	   in-memory buffers among backends.  As a result, we do page-
-	   level read locking on btree nodes in order to guarantee that
-	   no record is modified while we are examining it.  This reduces
-	   concurrency but guaranteees correct behavior.
-
-	+  Read locks on a page are held for as long as a scan has a pointer
-	   to the page.  However, locks are always surrendered before the
-	   sibling page lock is acquired (for readers), so we remain deadlock-
-	   free.  I will do a formal proof if I get bored anytime soon.
++  The requirement that all btree keys be unique is too onerous,
+   but the algorithm won't work correctly without it.  Fortunately, it is
+   only necessary that keys be unique on a single tree level, because L&Y
+   only use the assumption of key uniqueness when re-finding a key in a
+   parent node (to determine where to insert the key for a split page).
+   Therefore, we can use the link field to disambiguate multiple
+   occurrences of the same user key: only one entry in the parent level
+   will be pointing at the page we had split.  (Indeed we need not look at
+   the real "key" at all, just at the link field.)  We can distinguish
+   items at the leaf level in the same way, by examining their links to
+   heap tuples; we'd never have two items for the same heap tuple.
+
++  Lehman and Yao assume that the key range for a subtree S is described
+   by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
+   node.  This does not work for nonunique keys (for example, if we have
+   enough equal keys to spread across several leaf pages, there *must* be
+   some equal bounding keys in the first level up).  Therefore we assume
+   Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
+   bounding key in an upper tree level must descend to the left of that
+   key to ensure it finds any equal keys in the preceding page.  An
+   insertion that sees the high key of its target page is equal to the key
+   to be inserted has a choice whether or not to move right, since the new
+   key could go on either page.  (Currently, we try to find a page where
+   there is room for the new key without a split.)
+
++  Lehman and Yao don't require read locks, but assume that in-memory
+   copies of tree nodes are unshared.  Postgres shares in-memory buffers
+   among backends.  As a result, we do page-level read locking on btree
+   nodes in order to guarantee that no record is modified while we are
+   examining it.  This reduces concurrency but guaranteees correct
+   behavior.  An advantage is that when trading in a read lock for a
+   write lock, we need not re-read the page after getting the write lock.
+   Since we're also holding a pin on the shared buffer containing the
+   page, we know that buffer still contains the page and is up-to-date.
+
++  We support the notion of an ordered "scan" of an index as well as
+   insertions, deletions, and simple lookups.  A scan in the forward
+   direction is no problem, we just use the right-sibling pointers that
+   L&Y require anyway.  (Thus, once we have descended the tree to the
+   correct start point for the scan, the scan looks only at leaf pages
+   and never at higher tree levels.)  To support scans in the backward
+   direction, we also store a "left sibling" link much like the "right
+   sibling".  (This adds an extra step to the L&Y split algorithm: while
+   holding the write lock on the page being split, we also lock its former
+   right sibling to update that page's left-link.  This is safe since no
+   writer of that page can be interested in acquiring a write lock on our
+   page.)  A backwards scan has one additional bit of complexity: after
+   following the left-link we must account for the possibility that the
+   left sibling page got split before we could read it.  So, we have to
+   move right until we find a page whose right-link matches the page we
+   came from.
+
++  Read locks on a page are held for as long as a scan has a pointer
+   to the page.  However, locks are always surrendered before the
+   sibling page lock is acquired (for readers), so we remain deadlock-
+   free.  I will do a formal proof if I get bored anytime soon.
+   NOTE: nbtree.c arranges to drop the read lock, but not the buffer pin,
+   on the current page of a scan before control leaves nbtree.  When we
+   come back to resume the scan, we have to re-grab the read lock and
+   then move right if the current item moved (see _bt_restscan()).
+
++  Lehman and Yao fail to discuss what must happen when the root page
+   becomes full and must be split.  Our implementation is to split the
+   root in the same way that any other page would be split, then construct
+   a new root page holding pointers to both of the resulting pages (which
+   now become siblings on level 2 of the tree).  The new root page is then
+   installed by altering the root pointer in the meta-data page (see
+   below).  This works because the root is not treated specially in any
+   other way --- in particular, searches will move right using its link
+   pointer if the link is set.  Therefore, searches will find the data
+   that's been moved into the right sibling even if they read the metadata
+   page before it got updated.  This is the same reasoning that makes a
+   split of a non-root page safe.  The locking considerations are similar too.
+
++  Lehman and Yao assume fixed-size keys, but we must deal with
+   variable-size keys.  Therefore there is not a fixed maximum number of
+   keys per page; we just stuff in as many as will fit.  When we split a
+   page, we try to equalize the number of bytes, not items, assigned to
+   each of the resulting pages.  Note we must include the incoming item in
+   this calculation, otherwise it is possible to find that the incoming
+   item doesn't fit on the split page where it needs to go!
 
 In addition, the following things are handy to know:
 
-	+  Page zero of every btree is a meta-data page.  This page stores
-	   the location of the root page, a pointer to a list of free
-	   pages, and other stuff that's handy to know.
-
-	+  This algorithm doesn't really work, since it requires ordered
-	   writes, and UNIX doesn't support ordered writes.
-
-	+  There's one other case where we may screw up in this
-	   implementation.  When we start a scan, we descend the tree
-	   to the key nearest the one in the qual, and once we get there,
-	   position ourselves correctly for the qual type (eg, <, >=, etc).
-	   If we happen to step off a page, decide we want to get back to
-	   it, and fetch the page again, and if some bad person has split
-	   the page and moved the last tuple we saw off of it, then the
-	   code complains about botched concurrency in an elog(WARN, ...)
-	   and gives up the ghost.  This is the ONLY violation of Lehman
-	   and Yao's guarantee of correct behavior that I am aware of in
-	   this code.
++  Page zero of every btree is a meta-data page.  This page stores
+   the location of the root page, a pointer to a list of free
+   pages, and other stuff that's handy to know.  (Currently, we
+   never shrink btree indexes so there are never any free pages.)
+
++  The algorithm assumes we can fit at least three items per page
+   (a "high key" and two real data items).  Therefore it's unsafe
+   to accept items larger than 1/3rd page size.  Larger items would
+   work sometimes, but could cause failures later on depending on
+   what else gets put on their page.
+
++  This algorithm doesn't guarantee btree consistency after a kernel crash
+   or hardware failure.  To do that, we'd need ordered writes, and UNIX
+   doesn't support ordered writes (short of fsync'ing every update, which
+   is too high a price).  Rebuilding corrupted indexes during restart
+   seems more attractive.
+
++  On deletions, we need to adjust the position of active scans on
+   the index.  The code in nbtscan.c handles this.  We don't need to
+   do this for insertions or splits because _bt_restscan can find the
+   new position of the previously-found item.  NOTE that nbtscan.c
+   only copes with deletions issued by the current backend.  This
+   essentially means that concurrent deletions are not supported, but
+   that's true already in the Lehman and Yao algorithm.  nbtscan.c
+   exists only to support VACUUM and allow it to delete items while
+   it's scanning the index.
+
+Notes about data representation:
+
++  The right-sibling link required by L&Y is kept in the page "opaque
+   data" area, as is the left-sibling link and some flags.
+
++  We also keep a parent link in the opaque data, but this link is not
+   very trustworthy because it is not updated when the parent page splits.
+   Thus, it points to some page on the parent level, but possibly a page
+   well to the left of the page's actual current parent.  In most cases
+   we do not need this link at all.  Normally we return to a parent page
+   using a stack of entries that are made as we descend the tree, as in L&Y.
+   There is exactly one case where the stack will not help: concurrent
+   root splits.  If an inserter process needs to split what had been the
+   root when it started its descent, but finds that that page is no longer
+   the root (because someone else split it meanwhile), then it uses the
+   parent link to move up to the next level.  This is OK because we do fix
+   the parent link in a former root page when splitting it.  This logic
+   will work even if the root is split multiple times (even up to creation
+   of multiple new levels) before an inserter returns to it.  The same
+   could not be said of finding the new root via the metapage, since that
+   would work only for a single level of added root.
+
++  The Postgres disk block data format (an array of items) doesn't fit
+   Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
+   so we have to play some games.
+
++  On a page that is not rightmost in its tree level, the "high key" is
+   kept in the page's first item, and real data items start at item 2.
+   The link portion of the "high key" item goes unused.  A page that is
+   rightmost has no "high key", so data items start with the first item.
+   Putting the high key at the left, rather than the right, may seem odd,
+   but it avoids moving the high key as we add data items.
+
++  On a leaf page, the data items are simply links to (TIDs of) tuples
+   in the relation being indexed, with the associated key values.
+
++  On a non-leaf page, the data items are down-links to child pages with
+   bounding keys.  The key in each data item is the *lower* bound for
+   keys on that child page, so logically the key is to the left of that
+   downlink.  The high key (if present) is the upper bound for the last
+   downlink.  The first data item on each such page has no lower bound
+   --- or lower bound of minus infinity, if you prefer.  The comparison
+   routines must treat it accordingly.  The actual key stored in the
+   item is irrelevant, and need not be stored at all.  This arrangement
+   corresponds to the fact that an L&Y non-leaf page has one more pointer
+   than key.
 
 Notes to operator class implementors:
 
-	With this implementation, we require the user to supply us with
-	a procedure for pg_amproc.  This procedure should take two keys
-	A and B and return < 0, 0, or > 0 if A < B, A = B, or A > B,
-	respectively.  See the contents of that relation for the btree
-	access method for some samples.
-
-Notes to mao for implementation document:
-
-	On deletions, we need to adjust the position of active scans on
-	the index.  The code in nbtscan.c handles this.  We don't need to
-	do this for splits because of the way splits are handled; if they
-	happen behind us, we'll automatically go to the next page, and if
-	they happen in front of us, we're not affected by them.  For
-	insertions, if we inserted a tuple behind the current scan location
-	on the current scan page, we move one space ahead.
++  With this implementation, we require the user to supply us with
+   a procedure for pg_amproc.  This procedure should take two keys
+   A and B and return < 0, 0, or > 0 if A < B, A = B, or A > B,
+   respectively.  See the contents of that relation for the btree
+   access method for some samples.
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 7d65c63dc805d7ae1dee3fa8999f43d9bb5524e3..6be8e97b50851298ab77a06e6de4dc6fc3368175 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -8,7 +8,7 @@
  *
  *
  * IDENTIFICATION
- *	  $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.59 2000/06/08 22:36:52 momjian Exp $
+ *	  $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.60 2000/07/21 06:42:32 tgl Exp $
  *
  *-------------------------------------------------------------------------
  */
@@ -19,53 +19,76 @@
 #include "access/nbtree.h"
 
 
-static InsertIndexResult _bt_insertonpg(Relation rel, Buffer buf, BTStack stack, int keysz, ScanKey scankey, BTItem btitem, BTItem afteritem);
-static Buffer _bt_split(Relation rel, Size keysz, ScanKey scankey,
-		  Buffer buf, OffsetNumber firstright);
-static OffsetNumber _bt_findsplitloc(Relation rel, Size keysz, ScanKey scankey,
-				 Page page, OffsetNumber start,
-				 OffsetNumber maxoff, Size llimit);
+typedef struct
+{
+	/* context data for _bt_checksplitloc */
+	Size	newitemsz;			/* size of new item to be inserted */
+	bool	non_leaf;			/* T if splitting an internal node */
+
+	bool	have_split;			/* found a valid split? */
+
+	/* these fields valid only if have_split is true */
+	bool	newitemonleft;		/* new item on left or right of best split */
+	OffsetNumber firstright;	/* best split point */
+	int		best_delta;			/* best size delta so far */
+} FindSplitData;
+
+
+static TransactionId _bt_check_unique(Relation rel, BTItem btitem,
+									  Relation heapRel, Buffer buf,
+									  ScanKey itup_scankey);
+static InsertIndexResult _bt_insertonpg(Relation rel, Buffer buf,
+										BTStack stack,
+										int keysz, ScanKey scankey,
+										BTItem btitem,
+										OffsetNumber afteritem);
+static Buffer _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
+						OffsetNumber newitemoff, Size newitemsz,
+						BTItem newitem, bool newitemonleft,
+						OffsetNumber *itup_off, BlockNumber *itup_blkno);
+static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
+									 OffsetNumber newitemoff,
+									 Size newitemsz,
+									 bool *newitemonleft);
+static void _bt_checksplitloc(FindSplitData *state, OffsetNumber firstright,
+							  int leftfree, int rightfree,
+							  bool newitemonleft, Size firstrightitemsz);
+static Buffer _bt_getstackbuf(Relation rel, BTStack stack);
 static void _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
-static OffsetNumber _bt_pgaddtup(Relation rel, Buffer buf, int keysz, ScanKey itup_scankey, Size itemsize, BTItem btitem, BTItem afteritem);
-static bool _bt_goesonpg(Relation rel, Buffer buf, Size keysz, ScanKey scankey, BTItem afteritem);
-static void _bt_updateitem(Relation rel, Size keysz, Buffer buf, BTItem oldItem, BTItem newItem);
-static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum, int keysz, ScanKey scankey);
-static int32 _bt_tuplecompare(Relation rel, Size keysz, ScanKey scankey,
-				 IndexTuple tuple1, IndexTuple tuple2);
+static void _bt_pgaddtup(Relation rel, Page page,
+						 Size itemsize, BTItem btitem,
+						 OffsetNumber itup_off, const char *where);
+static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
+						int keysz, ScanKey scankey);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single btitem in the tree.
  *
  *		This routine is called by the public interface routines, btbuild
- *		and btinsert.  By here, btitem is filled in, and has a unique
- *		(xid, seqno) pair.
+ *		and btinsert.  By here, btitem is filled in, including the TID.
  */
 InsertIndexResult
-_bt_doinsert(Relation rel, BTItem btitem, bool index_is_unique, Relation heapRel)
+_bt_doinsert(Relation rel, BTItem btitem,
+			 bool index_is_unique, Relation heapRel)
 {
+	IndexTuple	itup = &(btitem->bti_itup);
+	int			natts = rel->rd_rel->relnatts;
 	ScanKey		itup_scankey;
-	IndexTuple	itup;
 	BTStack		stack;
 	Buffer		buf;
-	BlockNumber blkno;
-	int			natts = rel->rd_rel->relnatts;
 	InsertIndexResult res;
-	Buffer		buffer;
-
-	itup = &(btitem->bti_itup);
 
 	/* we need a scan key to do our search, so build one */
 	itup_scankey = _bt_mkscankey(rel, itup);
 
+top:
 	/* find the page containing this key */
-	stack = _bt_search(rel, natts, itup_scankey, &buf);
+	stack = _bt_search(rel, natts, itup_scankey, &buf, BT_WRITE);
 
 	/* trade in our read lock for a write lock */
 	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 	LockBuffer(buf, BT_WRITE);
 
-l1:
-
 	/*
 	 * If the page was split between the time that we surrendered our read
 	 * lock and acquired our write lock, then this page may no longer be
@@ -73,176 +96,212 @@ l1:
 	 * need to move right in the tree.	See Lehman and Yao for an
 	 * excruciatingly precise description.
 	 */
-
 	buf = _bt_moveright(rel, buf, natts, itup_scankey, BT_WRITE);
-	blkno = BufferGetBlockNumber(buf);
 
-	/* if we're not allowing duplicates, make sure the key isn't */
-	/* already in the node */
+	/*
+	 * If we're not allowing duplicates, make sure the key isn't
+	 * already in the index.  XXX this belongs somewhere else, likely
+	 */
 	if (index_is_unique)
 	{
-		OffsetNumber offset,
-					maxoff;
-		Page		page;
+		TransactionId xwait;
 
-		page = BufferGetPage(buf);
-		maxoff = PageGetMaxOffsetNumber(page);
+		xwait = _bt_check_unique(rel, btitem, heapRel, buf, itup_scankey);
+
+		if (TransactionIdIsValid(xwait))
+		{
+			/* Have to wait for the other guy ... */
+			_bt_relbuf(rel, buf, BT_WRITE);
+			XactLockTableWait(xwait);
+			/* start over... */
+			_bt_freestack(stack);
+			goto top;
+		}
+	}
+
+	/* do the insertion */
+	res = _bt_insertonpg(rel, buf, stack, natts, itup_scankey, btitem, 0);
+
+	/* be tidy */
+	_bt_freestack(stack);
+	_bt_freeskey(itup_scankey);
+
+	return res;
+}
+
+/*
+ *	_bt_check_unique() -- Check for violation of unique index constraint
+ *
+ * Returns NullTransactionId if there is no conflict, else an xact ID we
+ * must wait for to see if it commits a conflicting tuple.  If an actual
+ * conflict is detected, no return --- just elog().
+ */
+static TransactionId
+_bt_check_unique(Relation rel, BTItem btitem, Relation heapRel,
+				 Buffer buf, ScanKey itup_scankey)
+{
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			natts = rel->rd_rel->relnatts;
+	OffsetNumber offset,
+				maxoff;
+	Page		page;
+	BTPageOpaque opaque;
+	Buffer		nbuf = InvalidBuffer;
+	bool		chtup = true;
+
+	page = BufferGetPage(buf);
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Find first item >= proposed new item.  Note we could also get
+	 * a pointer to end-of-page here.
+	 */
+	offset = _bt_binsrch(rel, buf, natts, itup_scankey);
 
-		offset = _bt_binsrch(rel, buf, natts, itup_scankey, BT_DESCENT);
+	/*
+	 * Scan over all equal tuples, looking for live conflicts.
+	 */
+	for (;;)
+	{
+		HeapTupleData htup;
+		Buffer		buffer;
+		BTItem		cbti;
+		BlockNumber nblkno;
 
-		/* make sure the offset we're given points to an actual */
-		/* key on the page before trying to compare it */
-		if (!PageIsEmpty(page) && offset <= maxoff)
+		/*
+		 * _bt_compare returns 0 for (1,NULL) and (1,NULL) - this's
+		 * how we handling NULLs - and so we must not use _bt_compare
+		 * in real comparison, but only for ordering/finding items on
+		 * pages. - vadim 03/24/97
+		 *
+		 * make sure the offset points to an actual key
+		 * before trying to compare it...
+		 */
+		if (offset <= maxoff)
 		{
-			TupleDesc	itupdesc;
-			BTItem		cbti;
-			HeapTupleData htup;
-			BTPageOpaque opaque;
-			Buffer		nbuf;
-			BlockNumber nblkno;
-			bool		chtup = true;
-
-			itupdesc = RelationGetDescr(rel);
-			nbuf = InvalidBuffer;
-			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+			if (! _bt_isequal(itupdesc, page, offset, natts, itup_scankey))
+				break;			/* we're past all the equal tuples */
 
 			/*
-			 * _bt_compare returns 0 for (1,NULL) and (1,NULL) - this's
-			 * how we handling NULLs - and so we must not use _bt_compare
-			 * in real comparison, but only for ordering/finding items on
-			 * pages. - vadim 03/24/97
-			 *
-			 * while ( !_bt_compare (rel, itupdesc, page, natts,
-			 * itup_scankey, offset) )
+			 * Have to check is inserted heap tuple deleted one (i.e.
+			 * just moved to another place by vacuum)!  We only need to
+			 * do this once, but don't want to do it at all unless
+			 * we see equal tuples, so as not to slow down unequal case.
 			 */
-			while (_bt_isequal(itupdesc, page, offset, natts, itup_scankey))
-			{					/* they're equal */
-
-				/*
-				 * Have to check is inserted heap tuple deleted one (i.e.
-				 * just moved to another place by vacuum)!
-				 */
-				if (chtup)
-				{
-					htup.t_self = btitem->bti_itup.t_tid;
-					heap_fetch(heapRel, SnapshotDirty, &htup, &buffer);
-					if (htup.t_data == NULL)	/* YES! */
-						break;
-					/* Live tuple was inserted */
-					ReleaseBuffer(buffer);
-					chtup = false;
-				}
-				cbti = (BTItem) PageGetItem(page, PageGetItemId(page, offset));
-				htup.t_self = cbti->bti_itup.t_tid;
+			if (chtup)
+			{
+				htup.t_self = btitem->bti_itup.t_tid;
 				heap_fetch(heapRel, SnapshotDirty, &htup, &buffer);
-				if (htup.t_data != NULL)		/* it is a duplicate */
-				{
-					TransactionId xwait =
+				if (htup.t_data == NULL)		/* YES! */
+					break;
+				/* Live tuple is being inserted, so continue checking */
+				ReleaseBuffer(buffer);
+				chtup = false;
+			}
+
+			cbti = (BTItem) PageGetItem(page, PageGetItemId(page, offset));
+			htup.t_self = cbti->bti_itup.t_tid;
+			heap_fetch(heapRel, SnapshotDirty, &htup, &buffer);
+			if (htup.t_data != NULL)		/* it is a duplicate */
+			{
+				TransactionId xwait =
 					(TransactionIdIsValid(SnapshotDirty->xmin)) ?
 					SnapshotDirty->xmin : SnapshotDirty->xmax;
 
-					/*
-					 * If this tuple is being updated by other transaction
-					 * then we have to wait for its commit/abort.
-					 */
-					ReleaseBuffer(buffer);
-					if (TransactionIdIsValid(xwait))
-					{
-						if (nbuf != InvalidBuffer)
-							_bt_relbuf(rel, nbuf, BT_READ);
-						_bt_relbuf(rel, buf, BT_WRITE);
-						XactLockTableWait(xwait);
-						buf = _bt_getbuf(rel, blkno, BT_WRITE);
-						goto l1;/* continue from the begin */
-					}
-					elog(ERROR, "Cannot insert a duplicate key into unique index %s", RelationGetRelationName(rel));
-				}
-				/* htup null so no buffer to release */
-				/* get next offnum */
-				if (offset < maxoff)
-					offset = OffsetNumberNext(offset);
-				else
-				{				/* move right ? */
-					if (P_RIGHTMOST(opaque))
-						break;
-					if (!_bt_isequal(itupdesc, page, P_HIKEY,
-									 natts, itup_scankey))
-						break;
-
-					/*
-					 * min key of the right page is the same, ooh - so
-					 * many dead duplicates...
-					 */
-					nblkno = opaque->btpo_next;
+				/*
+				 * If this tuple is being updated by other transaction
+				 * then we have to wait for its commit/abort.
+				 */
+				ReleaseBuffer(buffer);
+				if (TransactionIdIsValid(xwait))
+				{
 					if (nbuf != InvalidBuffer)
 						_bt_relbuf(rel, nbuf, BT_READ);
-					for (nbuf = InvalidBuffer;;)
-					{
-						nbuf = _bt_getbuf(rel, nblkno, BT_READ);
-						page = BufferGetPage(nbuf);
-						maxoff = PageGetMaxOffsetNumber(page);
-						opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-						offset = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
-						if (!PageIsEmpty(page) && offset <= maxoff)
-						{		/* Found some key */
-							break;
-						}
-						else
-						{		/* Empty or "pseudo"-empty page - get next */
-							nblkno = opaque->btpo_next;
-							_bt_relbuf(rel, nbuf, BT_READ);
-							nbuf = InvalidBuffer;
-							if (nblkno == P_NONE)
-								break;
-						}
-					}
-					if (nbuf == InvalidBuffer)
-						break;
+					/* Tell _bt_doinsert to wait... */
+					return xwait;
 				}
+				/*
+				 * Otherwise we have a definite conflict.
+				 */
+				elog(ERROR, "Cannot insert a duplicate key into unique index %s",
+					 RelationGetRelationName(rel));
 			}
+			/* htup null so no buffer to release */
+		}
+
+		/*
+		 * Advance to next tuple to continue checking.
+		 */
+		if (offset < maxoff)
+			offset = OffsetNumberNext(offset);
+		else
+		{
+			/* If scankey == hikey we gotta check the next page too */
+			if (P_RIGHTMOST(opaque))
+				break;
+			if (!_bt_isequal(itupdesc, page, P_HIKEY,
+							 natts, itup_scankey))
+				break;
+			nblkno = opaque->btpo_next;
 			if (nbuf != InvalidBuffer)
 				_bt_relbuf(rel, nbuf, BT_READ);
+			nbuf = _bt_getbuf(rel, nblkno, BT_READ);
+			page = BufferGetPage(nbuf);
+			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+			maxoff = PageGetMaxOffsetNumber(page);
+			offset = P_FIRSTDATAKEY(opaque);
 		}
 	}
 
-	/* do the insertion */
-	res = _bt_insertonpg(rel, buf, stack, natts, itup_scankey,
-						 btitem, (BTItem) NULL);
+	if (nbuf != InvalidBuffer)
+		_bt_relbuf(rel, nbuf, BT_READ);
 
-	/* be tidy */
-	_bt_freestack(stack);
-	_bt_freeskey(itup_scankey);
-
-	return res;
+	return NullTransactionId;
 }
 
-/*
+/*----------
  *	_bt_insertonpg() -- Insert a tuple on a particular page in the index.
  *
  *		This recursive procedure does the following things:
  *
- *			+  if necessary, splits the target page.
- *			+  finds the right place to insert the tuple (taking into
- *			   account any changes induced by a split).
+ *			+  finds the right place to insert the tuple.
+ *			+  if necessary, splits the target page (making sure that the
+ *			   split is equitable as far as post-insert free space goes).
  *			+  inserts the tuple.
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
- *			+  invoking itself with the appropriate tuple for the right
+ *			+  invokes itself with the appropriate tuple for the right
  *			   child page on the parent.
  *
  *		On entry, we must have the right buffer on which to do the
  *		insertion, and the buffer must be pinned and locked.  On return,
  *		we will have dropped both the pin and the write lock on the buffer.
  *
+ *		If 'afteritem' is >0 then the new tuple must be inserted after the
+ *		existing item of that number, noplace else.  If 'afteritem' is 0
+ *		then the procedure finds the exact spot to insert it by searching.
+ *		(keysz and scankey parameters are used ONLY if afteritem == 0.)
+ *
+ *		NOTE: if the new key is equal to one or more existing keys, we can
+ *		legitimately place it anywhere in the series of equal keys --- in fact,
+ *		if the new key is equal to the page's "high key" we can place it on
+ *		the next page.  If it is equal to the high key, and there's not room
+ *		to insert the new tuple on the current page without splitting, then
+ *		we move right hoping to find more free space and avoid a split.
+ *		Ordinarily, though, we'll insert it before the existing equal keys
+ *		because of the way _bt_binsrch() works.
+ *
  *		The locking interactions in this code are critical.  You should
  *		grok Lehman and Yao's paper before making any changes.  In addition,
  *		you need to understand how we disambiguate duplicate keys in this
  *		implementation, in order to be able to find our location using
  *		L&Y "move right" operations.  Since we may insert duplicate user
- *		keys, and since these dups may propogate up the tree, we use the
+ *		keys, and since these dups may propagate up the tree, we use the
  *		'afteritem' parameter to position ourselves correctly for the
  *		insertion on internal pages.
+ *----------
  */
 static InsertIndexResult
 _bt_insertonpg(Relation rel,
@@ -251,17 +310,16 @@ _bt_insertonpg(Relation rel,
 			   int keysz,
 			   ScanKey scankey,
 			   BTItem btitem,
-			   BTItem afteritem)
+			   OffsetNumber afteritem)
 {
 	InsertIndexResult res;
 	Page		page;
 	BTPageOpaque lpageop;
-	BlockNumber itup_blkno;
 	OffsetNumber itup_off;
+	BlockNumber itup_blkno;
+	OffsetNumber newitemoff;
 	OffsetNumber firstright = InvalidOffsetNumber;
 	Size		itemsz;
-	bool		do_split = false;
-	bool		keys_equal = false;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -285,355 +343,117 @@ _bt_insertonpg(Relation rel,
 			 (PageGetPageSize(page) - sizeof(PageHeaderData) - MAXALIGN(sizeof(BTPageOpaqueData))) /3 - sizeof(ItemIdData));
 
 	/*
-	 * If we have to insert item on the leftmost page which is the first
-	 * page in the chain of duplicates then: 1. if scankey == hikey (i.e.
-	 * - new duplicate item) then insert it here; 2. if scankey < hikey
-	 * then: 2.a if there is duplicate key(s) here - we force splitting;
-	 * 2.b else - we may "eat" this page from duplicates chain.
+	 * Determine exactly where new item will go.
 	 */
-	if (lpageop->btpo_flags & BTP_CHAIN)
+	if (afteritem > 0)
 	{
-		OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
-		ItemId		hitemid;
-		BTItem		hitem;
-
-		Assert(!P_RIGHTMOST(lpageop));
-		hitemid = PageGetItemId(page, P_HIKEY);
-		hitem = (BTItem) PageGetItem(page, hitemid);
-		if (maxoff > P_HIKEY &&
-			!_bt_itemcmp(rel, keysz, scankey, hitem,
-			 (BTItem) PageGetItem(page, PageGetItemId(page, P_FIRSTKEY)),
-						 BTEqualStrategyNumber))
-			elog(FATAL, "btree: bad key on the page in the chain of duplicates");
-
-		if (!_bt_skeycmp(rel, keysz, scankey, page, hitemid,
-						 BTEqualStrategyNumber))
-		{
-			if (!P_LEFTMOST(lpageop))
-				elog(FATAL, "btree: attempt to insert bad key on the non-leftmost page in the chain of duplicates");
-			if (!_bt_skeycmp(rel, keysz, scankey, page, hitemid,
-							 BTLessStrategyNumber))
-				elog(FATAL, "btree: attempt to insert higher key on the leftmost page in the chain of duplicates");
-			if (maxoff > P_HIKEY)		/* have duplicate(s) */
-			{
-				firstright = P_FIRSTKEY;
-				do_split = true;
-			}
-			else
-/* "eat" page */
-			{
-				Buffer		pbuf;
-				Page		ppage;
-
-				itup_blkno = BufferGetBlockNumber(buf);
-				itup_off = PageAddItem(page, (Item) btitem, itemsz,
-									   P_FIRSTKEY, LP_USED);
-				if (itup_off == InvalidOffsetNumber)
-					elog(FATAL, "btree: failed to add item");
-				lpageop->btpo_flags &= ~BTP_CHAIN;
-				pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
-				ppage = BufferGetPage(pbuf);
-				PageIndexTupleDelete(ppage, stack->bts_offset);
-				pfree(stack->bts_btitem);
-				stack->bts_btitem = _bt_formitem(&(btitem->bti_itup));
-				ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid),
-							   itup_blkno, P_HIKEY);
-				_bt_wrtbuf(rel, buf);
-				res = _bt_insertonpg(rel, pbuf, stack->bts_parent,
-									 keysz, scankey, stack->bts_btitem,
-									 NULL);
-				ItemPointerSet(&(res->pointerData), itup_blkno, itup_off);
-				return res;
-			}
-		}
-		else
-		{
-			keys_equal = true;
-			if (PageGetFreeSpace(page) < itemsz)
-				do_split = true;
-		}
+		newitemoff = afteritem + 1;
 	}
-	else if (PageGetFreeSpace(page) < itemsz)
-		do_split = true;
-	else if (PageGetFreeSpace(page) < 3 * itemsz + 2 * sizeof(ItemIdData))
-	{
-		OffsetNumber offnum = (P_RIGHTMOST(lpageop)) ? P_HIKEY : P_FIRSTKEY;
-		OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
-
-		if (offnum < maxoff)	/* can't split unless at least 2 items... */
-		{
-			ItemId		itid;
-			BTItem		previtem,
-						chkitem;
-			Size		maxsize;
-			Size		currsize;
-
-			/* find largest group of identically-keyed items on page */
-			itid = PageGetItemId(page, offnum);
-			previtem = (BTItem) PageGetItem(page, itid);
-			maxsize = currsize = (ItemIdGetLength(itid) + sizeof(ItemIdData));
-			for (offnum = OffsetNumberNext(offnum);
-				 offnum <= maxoff; offnum = OffsetNumberNext(offnum))
-			{
-				itid = PageGetItemId(page, offnum);
-				chkitem = (BTItem) PageGetItem(page, itid);
-				if (!_bt_itemcmp(rel, keysz, scankey,
-								 previtem, chkitem,
-								 BTEqualStrategyNumber))
-				{
-					if (currsize > maxsize)
-						maxsize = currsize;
-					currsize = 0;
-					previtem = chkitem;
-				}
-				currsize += (ItemIdGetLength(itid) + sizeof(ItemIdData));
-			}
-			if (currsize > maxsize)
-				maxsize = currsize;
-			/* Decide to split if largest group is > 1/2 page size */
-			maxsize += sizeof(PageHeaderData) +
-				MAXALIGN(sizeof(BTPageOpaqueData));
-			if (maxsize >= PageGetPageSize(page) / 2)
-				do_split = true;
-		}
-	}
-
-	if (do_split)
+	else
 	{
-		Buffer		rbuf;
-		Page		rpage;
-		BTItem		ritem;
-		BlockNumber rbknum;
-		BTPageOpaque rpageop;
-		Buffer		pbuf;
-		Page		ppage;
-		BTPageOpaque ppageop;
-		BlockNumber bknum = BufferGetBlockNumber(buf);
-		BTItem		lowLeftItem;
-		OffsetNumber maxoff;
-		bool		shifted = false;
-		bool		left_chained = (lpageop->btpo_flags & BTP_CHAIN) ? true : false;
-		bool		is_root = lpageop->btpo_flags & BTP_ROOT;
-
 		/*
-		 * Instead of splitting leaf page in the chain of duplicates by
-		 * new duplicate, insert it into some right page.
+		 * If we will need to split the page to put the item here,
+		 * check whether we can put the tuple somewhere to the right,
+		 * instead.  Keep scanning until we find enough free space or
+		 * reach the last page where the tuple can legally go.
 		 */
-		if ((lpageop->btpo_flags & BTP_CHAIN) &&
-			(lpageop->btpo_flags & BTP_LEAF) && keys_equal)
+		while (PageGetFreeSpace(page) < itemsz &&
+			   !P_RIGHTMOST(lpageop) &&
+			   _bt_compare(rel, keysz, scankey, page, P_HIKEY) == 0)
 		{
-			rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
-			rpage = BufferGetPage(rbuf);
-			rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
+			/* step right one page */
+			BlockNumber		rblkno = lpageop->btpo_next;
 
-			/*
-			 * some checks
-			 */
-			if (!P_RIGHTMOST(rpageop))	/* non-rightmost page */
-			{					/* If we have the same hikey here then
-								 * it's yet another page in chain. */
-				if (_bt_skeycmp(rel, keysz, scankey, rpage,
-								PageGetItemId(rpage, P_HIKEY),
-								BTEqualStrategyNumber))
-				{
-					if (!(rpageop->btpo_flags & BTP_CHAIN))
-						elog(FATAL, "btree: lost page in the chain of duplicates");
-				}
-				else if (_bt_skeycmp(rel, keysz, scankey, rpage,
-									 PageGetItemId(rpage, P_HIKEY),
-									 BTGreaterStrategyNumber))
-					elog(FATAL, "btree: hikey is out of order");
-				else if (rpageop->btpo_flags & BTP_CHAIN)
-
-					/*
-					 * If hikey > scankey then it's last page in chain and
-					 * BTP_CHAIN must be OFF
-					 */
-					elog(FATAL, "btree: lost last page in the chain of duplicates");
-			}
-			else
-/* rightmost page */
-				Assert(!(rpageop->btpo_flags & BTP_CHAIN));
 			_bt_relbuf(rel, buf, BT_WRITE);
-			return (_bt_insertonpg(rel, rbuf, stack, keysz,
-								   scankey, btitem, afteritem));
+			buf = _bt_getbuf(rel, rblkno, BT_WRITE);
+			page = BufferGetPage(buf);
+			lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 		}
-
 		/*
-		 * If after splitting un-chained page we'll got chain of pages
-		 * with duplicates then we want to know 1. on which of two pages
-		 * new btitem will go (current _bt_findsplitloc is quite bad); 2.
-		 * what parent (if there's one) thinking about it (remember about
-		 * deletions)
+		 * This is it, so find the position...
 		 */
-		else if (!(lpageop->btpo_flags & BTP_CHAIN))
-		{
-			OffsetNumber start = (P_RIGHTMOST(lpageop)) ? P_HIKEY : P_FIRSTKEY;
-			Size		llimit;
-
-			maxoff = PageGetMaxOffsetNumber(page);
-			llimit = PageGetPageSize(page) - sizeof(PageHeaderData) -
-				MAXALIGN(sizeof(BTPageOpaqueData))
-				+sizeof(ItemIdData);
-			llimit /= 2;
-			firstright = _bt_findsplitloc(rel, keysz, scankey,
-										  page, start, maxoff, llimit);
-
-			if (_bt_itemcmp(rel, keysz, scankey,
-				  (BTItem) PageGetItem(page, PageGetItemId(page, start)),
-			 (BTItem) PageGetItem(page, PageGetItemId(page, firstright)),
-							BTEqualStrategyNumber))
-			{
-				if (_bt_skeycmp(rel, keysz, scankey, page,
-								PageGetItemId(page, firstright),
-								BTLessStrategyNumber))
-
-					/*
-					 * force moving current items to the new page: new
-					 * item will go on the current page.
-					 */
-					firstright = start;
-				else
-
-					/*
-					 * new btitem >= firstright, start item == firstright
-					 * - new chain of duplicates: if this non-leftmost
-					 * leaf page and parent item < start item then force
-					 * moving all items to the new page - current page
-					 * will be "empty" after it.
-					 */
-				{
-					if (!P_LEFTMOST(lpageop) &&
-						(lpageop->btpo_flags & BTP_LEAF))
-					{
-						ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid),
-									   bknum, P_HIKEY);
-						pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
-						if (_bt_itemcmp(rel, keysz, scankey,
-										stack->bts_btitem,
-										(BTItem) PageGetItem(page,
-											 PageGetItemId(page, start)),
-										BTLessStrategyNumber))
-						{
-							firstright = start;
-							shifted = true;
-						}
-						_bt_relbuf(rel, pbuf, BT_WRITE);
-					}
-				}
-			}					/* else - no new chain if start item <
-								 * firstright one */
-		}
+		newitemoff = _bt_binsrch(rel, buf, keysz, scankey);
+	}
 
-		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, keysz, scankey, buf, firstright);
+	/*
+	 * Do we need to split the page to fit the item on it?
+	 */
+	if (PageGetFreeSpace(page) < itemsz)
+	{
+		Buffer		rbuf;
+		BlockNumber bknum = BufferGetBlockNumber(buf);
+		BlockNumber rbknum;
+		bool		is_root = P_ISROOT(lpageop);
+		bool		newitemonleft;
 
-		/* which new page (left half or right half) gets the tuple? */
-		if (_bt_goesonpg(rel, buf, keysz, scankey, afteritem))
-		{
-			/* left page */
-			itup_off = _bt_pgaddtup(rel, buf, keysz, scankey,
-									itemsz, btitem, afteritem);
-			itup_blkno = BufferGetBlockNumber(buf);
-		}
-		else
-		{
-			/* right page */
-			itup_off = _bt_pgaddtup(rel, rbuf, keysz, scankey,
-									itemsz, btitem, afteritem);
-			itup_blkno = BufferGetBlockNumber(rbuf);
-		}
+		/* Choose the split point */
+		firstright = _bt_findsplitloc(rel, page,
+									  newitemoff, itemsz,
+									  &newitemonleft);
 
-		maxoff = PageGetMaxOffsetNumber(page);
-		if (shifted)
-		{
-			if (maxoff > P_FIRSTKEY)
-				elog(FATAL, "btree: shifted page is not empty");
-			lowLeftItem = (BTItem) NULL;
-		}
-		else
-		{
-			if (maxoff < P_FIRSTKEY)
-				elog(FATAL, "btree: un-shifted page is empty");
-			lowLeftItem = (BTItem) PageGetItem(page,
-										PageGetItemId(page, P_FIRSTKEY));
-			if (_bt_itemcmp(rel, keysz, scankey, lowLeftItem,
-				(BTItem) PageGetItem(page, PageGetItemId(page, P_HIKEY)),
-							BTEqualStrategyNumber))
-				lpageop->btpo_flags |= BTP_CHAIN;
-		}
+		/* split the buffer into left and right halves */
+		rbuf = _bt_split(rel, buf, firstright,
+						 newitemoff, itemsz, btitem, newitemonleft,
+						 &itup_off, &itup_blkno);
 
-		/*
+		/*----------
 		 * By here,
 		 *
-		 * +  our target page has been split; +  the original tuple has been
-		 * inserted; +	we have write locks on both the old (left half)
-		 * and new (right half) buffers, after the split; and +  we have
-		 * the key we want to insert into the parent.
+		 *		+  our target page has been split;
+		 *		+  the original tuple has been inserted;
+		 *		+  we have write locks on both the old (left half)
+		 *		   and new (right half) buffers, after the split; and
+		 *		+  we know the key we want to insert into the parent
+		 *		   (it's the "high key" on the left child page).
+		 *
+		 * We're ready to do the parent insertion.  We need to hold onto the
+		 * locks for the child pages until we locate the parent, but we can
+		 * release them before doing the actual insertion (see Lehman and Yao
+		 * for the reasoning).
 		 *
-		 * Do the parent insertion.  We need to hold onto the locks for the
-		 * child pages until we locate the parent, but we can release them
-		 * before doing the actual insertion (see Lehman and Yao for the
-		 * reasoning).
+		 * Here we have to do something Lehman and Yao don't talk about:
+		 * deal with a root split and construction of a new root.  If our
+		 * stack is empty then we have just split a node on what had been
+		 * the root level when we descended the tree.  If it is still the
+		 * root then we perform a new-root construction.  If it *wasn't*
+		 * the root anymore, use the parent pointer to get up to the root
+		 * level that someone constructed meanwhile, and find the right
+		 * place to insert as for the normal case.
+		 *----------
 		 */
 
-l_spl:	;
-		if (stack == (BTStack) NULL)
+		if (is_root)
 		{
-			if (!is_root)		/* if this page was not root page */
-			{
-				elog(DEBUG, "btree: concurrent ROOT page split");
-				stack = (BTStack) palloc(sizeof(BTStackData));
-				stack->bts_blkno = lpageop->btpo_parent;
-				stack->bts_offset = InvalidOffsetNumber;
-				stack->bts_btitem = (BTItem) palloc(sizeof(BTItemData));
-				/* bts_btitem will be initialized below */
-				stack->bts_parent = NULL;
-				goto l_spl;
-			}
+			Assert(stack == (BTStack) NULL);
 			/* create a new root node and release the split buffers */
 			_bt_newroot(rel, buf, rbuf);
 		}
 		else
 		{
-			ScanKey		newskey;
 			InsertIndexResult newres;
 			BTItem		new_item;
-			OffsetNumber upditem_offset = P_HIKEY;
-			bool		do_update = false;
-			bool		update_in_place = true;
-			bool		parent_chained;
+			BTStackData	fakestack;
+			BTItem		ritem;
+			Buffer		pbuf;
 
-			/* form a index tuple that points at the new right page */
-			rbknum = BufferGetBlockNumber(rbuf);
-			rpage = BufferGetPage(rbuf);
-			rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
-
-			/*
-			 * By convention, the first entry (1) on every non-rightmost
-			 * page is the high key for that page.	In order to get the
-			 * lowest key on the new right page, we actually look at its
-			 * second (2) entry.
-			 */
-
-			if (!P_RIGHTMOST(rpageop))
+			/* Set up a phony stack entry if we haven't got a real one */
+			if (stack == (BTStack) NULL)
 			{
-				ritem = (BTItem) PageGetItem(rpage,
-									   PageGetItemId(rpage, P_FIRSTKEY));
-				if (_bt_itemcmp(rel, keysz, scankey,
-								ritem,
-								(BTItem) PageGetItem(rpage,
-										  PageGetItemId(rpage, P_HIKEY)),
-								BTEqualStrategyNumber))
-					rpageop->btpo_flags |= BTP_CHAIN;
+				elog(DEBUG, "btree: concurrent ROOT page split");
+				stack = &fakestack;
+				stack->bts_blkno = lpageop->btpo_parent;
+				stack->bts_offset = InvalidOffsetNumber;
+				/* bts_btitem will be initialized below */
+				stack->bts_parent = NULL;
 			}
-			else
-				ritem = (BTItem) PageGetItem(rpage,
-										  PageGetItemId(rpage, P_HIKEY));
 
-			/* get a unique btitem for this key */
-			new_item = _bt_formitem(&(ritem->bti_itup));
+			/* get high key from left page == lowest key on new right page */
+			ritem = (BTItem) PageGetItem(page,
+										 PageGetItemId(page, P_HIKEY));
 
+			/* form an index tuple that points at the new right page */
+			new_item = _bt_formitem(&(ritem->bti_itup));
+			rbknum = BufferGetBlockNumber(rbuf);
 			ItemPointerSet(&(new_item->bti_itup.t_tid), rbknum, P_HIKEY);
 
 			/*
@@ -642,192 +462,39 @@ l_spl:	;
 			 * Oops - if we were moved right then we need to change stack
 			 * item! We want to find parent pointing to where we are,
 			 * right ?	  - vadim 05/27/97
-			 */
-			ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid),
-						   bknum, P_HIKEY);
-			pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
-			ppage = BufferGetPage(pbuf);
-			ppageop = (BTPageOpaque) PageGetSpecialPointer(ppage);
-			parent_chained = ((ppageop->btpo_flags & BTP_CHAIN)) ? true : false;
-
-			if (parent_chained && !left_chained)
-				elog(FATAL, "nbtree: unexpected chained parent of unchained page");
-
-			/*
-			 * If the key of new_item is < than the key of the item in the
-			 * parent page pointing to the left page (stack->bts_btitem),
-			 * we have to update the latter key; otherwise the keys on the
-			 * parent page wouldn't be monotonically increasing after we
-			 * inserted the new pointer to the right page (new_item). This
-			 * only happens if our left page is the leftmost page and a
-			 * new minimum key had been inserted before, which is not
-			 * reflected in the parent page but didn't matter so far. If
-			 * there are duplicate keys and this new minimum key spills
-			 * over to our new right page, we get an inconsistency if we
-			 * don't update the left key in the parent page.
 			 *
-			 * Also, new duplicates handling code require us to update parent
-			 * item if some smaller items left on the left page (which is
-			 * possible in splitting leftmost page) and current parent
-			 * item == new_item.		- vadim 05/27/97
+			 * Interestingly, this means we didn't *really* need to stack
+			 * the parent key at all; all we really care about is the
+			 * saved block and offset as a starting point for our search...
 			 */
-			if (_bt_itemcmp(rel, keysz, scankey,
-							stack->bts_btitem, new_item,
-							BTGreaterStrategyNumber) ||
-				(!shifted &&
-				 _bt_itemcmp(rel, keysz, scankey,
-							 stack->bts_btitem, new_item,
-							 BTEqualStrategyNumber) &&
-				 _bt_itemcmp(rel, keysz, scankey,
-							 lowLeftItem, new_item,
-							 BTLessStrategyNumber)))
-			{
-				do_update = true;
-
-				/*
-				 * figure out which key is leftmost (if the parent page is
-				 * rightmost, too, it must be the root)
-				 */
-				if (P_RIGHTMOST(ppageop))
-					upditem_offset = P_HIKEY;
-				else
-					upditem_offset = P_FIRSTKEY;
-				if (!P_LEFTMOST(lpageop) ||
-					stack->bts_offset != upditem_offset)
-					elog(FATAL, "btree: items are out of order (leftmost %d, stack %u, update %u)",
-						 P_LEFTMOST(lpageop), stack->bts_offset, upditem_offset);
-			}
-
-			if (do_update)
-			{
-				if (shifted)
-					elog(FATAL, "btree: attempt to update parent for shifted page");
-
-				/*
-				 * Try to update in place. If out parent page is chained
-				 * then we must forse insertion.
-				 */
-				if (!parent_chained &&
-					MAXALIGN(IndexTupleDSize(lowLeftItem->bti_itup)) ==
-				  MAXALIGN(IndexTupleDSize(stack->bts_btitem->bti_itup)))
-				{
-					_bt_updateitem(rel, keysz, pbuf,
-								   stack->bts_btitem, lowLeftItem);
-					_bt_wrtbuf(rel, buf);
-					_bt_wrtbuf(rel, rbuf);
-				}
-				else
-				{
-					update_in_place = false;
-					PageIndexTupleDelete(ppage, upditem_offset);
-
-					/*
-					 * don't write anything out yet--we still have the
-					 * write lock, and now we call another _bt_insertonpg
-					 * to insert the correct key. First, make a new item,
-					 * using the tuple data from lowLeftItem. Point it to
-					 * the left child. Update it on the stack at the same
-					 * time.
-					 */
-					pfree(stack->bts_btitem);
-					stack->bts_btitem = _bt_formitem(&(lowLeftItem->bti_itup));
-					ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid),
-								   bknum, P_HIKEY);
-
-					/*
-					 * Unlock the children before doing this
-					 */
-					_bt_wrtbuf(rel, buf);
-					_bt_wrtbuf(rel, rbuf);
-
-					/*
-					 * A regular _bt_binsrch should find the right place
-					 * to put the new entry, since it should be lower than
-					 * any other key on the page. Therefore set afteritem
-					 * to NULL.
-					 */
-					newskey = _bt_mkscankey(rel, &(stack->bts_btitem->bti_itup));
-					newres = _bt_insertonpg(rel, pbuf, stack->bts_parent,
-									   keysz, newskey, stack->bts_btitem,
-											NULL);
-
-					pfree(newres);
-					pfree(newskey);
-
-					/*
-					 * we have now lost our lock on the parent buffer, and
-					 * need to get it back.
-					 */
-					pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
-				}
-			}
-			else
-			{
-				_bt_wrtbuf(rel, buf);
-				_bt_wrtbuf(rel, rbuf);
-			}
+			ItemPointerSet(&(stack->bts_btitem.bti_itup.t_tid),
+						   bknum, P_HIKEY);
 
-			newskey = _bt_mkscankey(rel, &(new_item->bti_itup));
+			pbuf = _bt_getstackbuf(rel, stack);
 
-			afteritem = stack->bts_btitem;
-			if (parent_chained && !update_in_place)
-			{
-				ppage = BufferGetPage(pbuf);
-				ppageop = (BTPageOpaque) PageGetSpecialPointer(ppage);
-				if (ppageop->btpo_flags & BTP_CHAIN)
-					elog(FATAL, "btree: unexpected BTP_CHAIN flag in parent after update");
-				if (P_RIGHTMOST(ppageop))
-					elog(FATAL, "btree: chained parent is RIGHTMOST after update");
-				maxoff = PageGetMaxOffsetNumber(ppage);
-				if (maxoff != P_FIRSTKEY)
-					elog(FATAL, "btree: FIRSTKEY was unexpected in parent after update");
-				if (_bt_skeycmp(rel, keysz, newskey, ppage,
-								PageGetItemId(ppage, P_FIRSTKEY),
-								BTLessEqualStrategyNumber))
-					elog(FATAL, "btree: parent FIRSTKEY is >= duplicate key after update");
-				if (!_bt_skeycmp(rel, keysz, newskey, ppage,
-								 PageGetItemId(ppage, P_HIKEY),
-								 BTEqualStrategyNumber))
-					elog(FATAL, "btree: parent HIGHKEY is not equal duplicate key after update");
-				afteritem = (BTItem) NULL;
-			}
-			else if (left_chained && !update_in_place)
-			{
-				ppage = BufferGetPage(pbuf);
-				ppageop = (BTPageOpaque) PageGetSpecialPointer(ppage);
-				if (!P_RIGHTMOST(ppageop) &&
-					_bt_skeycmp(rel, keysz, newskey, ppage,
-								PageGetItemId(ppage, P_HIKEY),
-								BTGreaterStrategyNumber))
-					afteritem = (BTItem) NULL;
-			}
-			if (afteritem == (BTItem) NULL)
-			{
-				rbuf = _bt_getbuf(rel, ppageop->btpo_next, BT_WRITE);
-				_bt_relbuf(rel, pbuf, BT_WRITE);
-				pbuf = rbuf;
-			}
+			/* Now we can write and unlock the children */
+			_bt_wrtbuf(rel, rbuf);
+			_bt_wrtbuf(rel, buf);
 
+			/* Recursively update the parent */
 			newres = _bt_insertonpg(rel, pbuf, stack->bts_parent,
-									keysz, newskey, new_item,
-									afteritem);
+									0, NULL, new_item, stack->bts_offset);
 
 			/* be tidy */
 			pfree(newres);
-			pfree(newskey);
 			pfree(new_item);
 		}
 	}
 	else
 	{
-		itup_off = _bt_pgaddtup(rel, buf, keysz, scankey,
-								itemsz, btitem, afteritem);
+		_bt_pgaddtup(rel, page, itemsz, btitem, newitemoff, "page");
+		itup_off = newitemoff;
 		itup_blkno = BufferGetBlockNumber(buf);
-
-		_bt_relbuf(rel, buf, BT_WRITE);
+		/* Write out the updated page and release pin/lock */
+		_bt_wrtbuf(rel, buf);
 	}
 
-	/* by here, the new tuple is inserted */
+	/* by here, the new tuple is inserted at itup_blkno/itup_off */
 	res = (InsertIndexResult) palloc(sizeof(InsertIndexResultData));
 	ItemPointerSet(&(res->pointerData), itup_blkno, itup_off);
 
@@ -838,12 +505,19 @@ l_spl:	;
  *	_bt_split() -- split a page in the btree.
  *
  *		On entry, buf is the page to split, and is write-locked and pinned.
- *		Returns the new right sibling of buf, pinned and write-locked.	The
- *		pin and lock on buf are maintained.
+ *		firstright is the item index of the first item to be moved to the
+ *		new right page.  newitemoff etc. tell us about the new item that
+ *		must be inserted along with the data from the old page.
+ *
+ *		Returns the new right sibling of buf, pinned and write-locked.
+ *		The pin and lock on buf are maintained.  *itup_off and *itup_blkno
+ *		are set to the exact location where newitem was inserted.
  */
 static Buffer
-_bt_split(Relation rel, Size keysz, ScanKey scankey,
-		  Buffer buf, OffsetNumber firstright)
+_bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
+		  OffsetNumber newitemoff, Size newitemsz, BTItem newitem,
+		  bool newitemonleft,
+		  OffsetNumber *itup_off, BlockNumber *itup_blkno)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -860,7 +534,6 @@ _bt_split(Relation rel, Size keysz, ScanKey scankey,
 	BTItem		item;
 	OffsetNumber leftoff,
 				rightoff;
-	OffsetNumber start;
 	OffsetNumber maxoff;
 	OffsetNumber i;
 
@@ -869,8 +542,8 @@ _bt_split(Relation rel, Size keysz, ScanKey scankey,
 	leftpage = PageGetTempPage(origpage, sizeof(BTPageOpaqueData));
 	rightpage = BufferGetPage(rbuf);
 
-	_bt_pageinit(rightpage, BufferGetPageSize(rbuf));
 	_bt_pageinit(leftpage, BufferGetPageSize(buf));
+	_bt_pageinit(rightpage, BufferGetPageSize(rbuf));
 
 	/* init btree private data */
 	oopaque = (BTPageOpaque) PageGetSpecialPointer(origpage);
@@ -879,106 +552,130 @@ _bt_split(Relation rel, Size keysz, ScanKey scankey,
 
 	/* if we're splitting this page, it won't be the root when we're done */
 	oopaque->btpo_flags &= ~BTP_ROOT;
-	oopaque->btpo_flags &= ~BTP_CHAIN;
 	lopaque->btpo_flags = ropaque->btpo_flags = oopaque->btpo_flags;
 	lopaque->btpo_prev = oopaque->btpo_prev;
-	ropaque->btpo_prev = BufferGetBlockNumber(buf);
 	lopaque->btpo_next = BufferGetBlockNumber(rbuf);
+	ropaque->btpo_prev = BufferGetBlockNumber(buf);
 	ropaque->btpo_next = oopaque->btpo_next;
 
+	/*
+	 * Must copy the original parent link into both new pages, even though
+	 * it might be quite obsolete by now.  We might need it if this level
+	 * is or recently was the root (see README).
+	 */
 	lopaque->btpo_parent = ropaque->btpo_parent = oopaque->btpo_parent;
 
 	/*
 	 * If the page we're splitting is not the rightmost page at its level
-	 * in the tree, then the first (0) entry on the page is the high key
+	 * in the tree, then the first entry on the page is the high key
 	 * for the page.  We need to copy that to the right half.  Otherwise
-	 * (meaning the rightmost page case), we should treat the line
-	 * pointers beginning at zero as user data.
-	 *
-	 * We leave a blank space at the start of the line table for the left
-	 * page.  We'll come back later and fill it in with the high key item
-	 * we get from the right key.
+	 * (meaning the rightmost page case), all the items on the right half
+	 * will be user data.
 	 */
+	rightoff = P_HIKEY;
 
-	leftoff = P_FIRSTKEY;
-	ropaque->btpo_next = oopaque->btpo_next;
 	if (!P_RIGHTMOST(oopaque))
 	{
-		/* splitting a non-rightmost page, start at the first data item */
-		start = P_FIRSTKEY;
-
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (BTItem) PageGetItem(origpage, itemid);
-		if (PageAddItem(rightpage, (Item) item, itemsz, P_HIKEY, LP_USED) == InvalidOffsetNumber)
+		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
+						LP_USED) == InvalidOffsetNumber)
 			elog(FATAL, "btree: failed to add hikey to the right sibling");
-		rightoff = P_FIRSTKEY;
+		rightoff = OffsetNumberNext(rightoff);
 	}
-	else
-	{
-		/* splitting a rightmost page, "high key" is the first data item */
-		start = P_HIKEY;
 
-		/* the new rightmost page will not have a high key */
-		rightoff = P_HIKEY;
+	/*
+	 * The "high key" for the new left page will be the first key that's
+	 * going to go into the new right page.  This might be either the
+	 * existing data item at position firstright, or the incoming tuple.
+	 */
+	leftoff = P_HIKEY;
+	if (!newitemonleft && newitemoff == firstright)
+	{
+		/* incoming tuple will become first on right page */
+		itemsz = newitemsz;
+		item = newitem;
 	}
-	maxoff = PageGetMaxOffsetNumber(origpage);
-	if (firstright == InvalidOffsetNumber)
+	else
 	{
-		Size		llimit = PageGetFreeSpace(leftpage) / 2;
-
-		firstright = _bt_findsplitloc(rel, keysz, scankey,
-									  origpage, start, maxoff, llimit);
+		/* existing item at firstright will become first on right page */
+		itemid = PageGetItemId(origpage, firstright);
+		itemsz = ItemIdGetLength(itemid);
+		item = (BTItem) PageGetItem(origpage, itemid);
 	}
+	if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+					LP_USED) == InvalidOffsetNumber)
+		elog(FATAL, "btree: failed to add hikey to the left sibling");
+	leftoff = OffsetNumberNext(leftoff);
 
-	for (i = start; i <= maxoff; i = OffsetNumberNext(i))
+	/*
+	 * Now transfer all the data items to the appropriate page
+	 */
+	maxoff = PageGetMaxOffsetNumber(origpage);
+
+	for (i = P_FIRSTDATAKEY(oopaque); i <= maxoff; i = OffsetNumberNext(i))
 	{
 		itemid = PageGetItemId(origpage, i);
 		itemsz = ItemIdGetLength(itemid);
 		item = (BTItem) PageGetItem(origpage, itemid);
 
+		/* does new item belong before this one? */
+		if (i == newitemoff)
+		{
+			if (newitemonleft)
+			{
+				_bt_pgaddtup(rel, leftpage, newitemsz, newitem, leftoff,
+							 "left sibling");
+				*itup_off = leftoff;
+				*itup_blkno = BufferGetBlockNumber(buf);
+				leftoff = OffsetNumberNext(leftoff);
+			}
+			else
+			{
+				_bt_pgaddtup(rel, rightpage, newitemsz, newitem, rightoff,
+							 "right sibling");
+				*itup_off = rightoff;
+				*itup_blkno = BufferGetBlockNumber(rbuf);
+				rightoff = OffsetNumberNext(rightoff);
+			}
+		}
+
 		/* decide which page to put it on */
 		if (i < firstright)
 		{
-			if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
-							LP_USED) == InvalidOffsetNumber)
-				elog(FATAL, "btree: failed to add item to the left sibling");
+			_bt_pgaddtup(rel, leftpage, itemsz, item, leftoff,
+							 "left sibling");
 			leftoff = OffsetNumberNext(leftoff);
 		}
 		else
 		{
-			if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
-							LP_USED) == InvalidOffsetNumber)
-				elog(FATAL, "btree: failed to add item to the right sibling");
+			_bt_pgaddtup(rel, rightpage, itemsz, item, rightoff,
+							 "right sibling");
 			rightoff = OffsetNumberNext(rightoff);
 		}
 	}
 
-	/*
-	 * Okay, page has been split, high key on right page is correct.  Now
-	 * set the high key on the left page to be the min key on the right
-	 * page.
-	 */
-
-	if (P_RIGHTMOST(ropaque))
-		itemid = PageGetItemId(rightpage, P_HIKEY);
-	else
-		itemid = PageGetItemId(rightpage, P_FIRSTKEY);
-	itemsz = ItemIdGetLength(itemid);
-	item = (BTItem) PageGetItem(rightpage, itemid);
-
-	/*
-	 * We left a hole for the high key on the left page; fill it.  The
-	 * modal crap is to tell the page manager to put the new item on the
-	 * page and not screw around with anything else.  Whoever designed
-	 * this interface has presumably crawled back into the dung heap they
-	 * came from.  No one here will admit to it.
-	 */
-
-	PageManagerModeSet(OverwritePageManagerMode);
-	if (PageAddItem(leftpage, (Item) item, itemsz, P_HIKEY, LP_USED) == InvalidOffsetNumber)
-		elog(FATAL, "btree: failed to add hikey to the left sibling");
-	PageManagerModeSet(ShufflePageManagerMode);
+	/* cope with possibility that newitem goes at the end */
+	if (i <= newitemoff)
+	{
+		if (newitemonleft)
+		{
+			_bt_pgaddtup(rel, leftpage, newitemsz, newitem, leftoff,
+						 "left sibling");
+			*itup_off = leftoff;
+			*itup_blkno = BufferGetBlockNumber(buf);
+			leftoff = OffsetNumberNext(leftoff);
+		}
+		else
+		{
+			_bt_pgaddtup(rel, rightpage, newitemsz, newitem, rightoff,
+						 "right sibling");
+			*itup_off = rightoff;
+			*itup_blkno = BufferGetBlockNumber(rbuf);
+			rightoff = OffsetNumberNext(rightoff);
+		}
+	}
 
 	/*
 	 * By here, the original data page has been split into two new halves,
@@ -992,14 +689,10 @@ _bt_split(Relation rel, Size keysz, ScanKey scankey,
 
 	PageRestoreTempPage(leftpage, origpage);
 
-	/* write these guys out */
-	_bt_wrtnorelbuf(rel, rbuf);
-	_bt_wrtnorelbuf(rel, buf);
-
 	/*
 	 * Finally, we need to grab the right sibling (if any) and fix the
 	 * prev pointer there.	We are guaranteed that this is deadlock-free
-	 * since no other writer will be moving holding a lock on that page
+	 * since no other writer will be holding a lock on that page
 	 * and trying to move left, and all readers release locks on a page
 	 * before trying to fetch its neighbors.
 	 */
@@ -1020,87 +713,214 @@ _bt_split(Relation rel, Size keysz, ScanKey scankey,
 }
 
 /*
- *	_bt_findsplitloc() -- find a safe place to split a page.
+ *	_bt_findsplitloc() -- find an appropriate place to split a page.
+ *
+ * The idea here is to equalize the free space that will be on each split
+ * page, *after accounting for the inserted tuple*.  (If we fail to account
+ * for it, we might find ourselves with too little room on the page that
+ * it needs to go into!)
  *
- *		In order to guarantee the proper handling of searches for duplicate
- *		keys, the first duplicate in the chain must either be the first
- *		item on the page after the split, or the entire chain must be on
- *		one of the two pages.  That is,
- *				[1 2 2 2 3 4 5]
- *		must become
- *				[1] [2 2 2 3 4 5]
- *		or
- *				[1 2 2 2] [3 4 5]
- *		but not
- *				[1 2 2] [2 3 4 5].
- *		However,
- *				[2 2 2 2 2 3 4]
- *		may be split as
- *				[2 2 2 2] [2 3 4].
+ * We are passed the intended insert position of the new tuple, expressed as
+ * the offsetnumber of the tuple it must go in front of.  (This could be
+ * maxoff+1 if the tuple is to go at the end.)
+ *
+ * We return the index of the first existing tuple that should go on the
+ * righthand page, plus a boolean indicating whether the new tuple goes on
+ * the left or right page.  The bool is necessary to disambiguate the case
+ * where firstright == newitemoff.
  */
 static OffsetNumber
 _bt_findsplitloc(Relation rel,
-				 Size keysz,
-				 ScanKey scankey,
 				 Page page,
-				 OffsetNumber start,
-				 OffsetNumber maxoff,
-				 Size llimit)
+				 OffsetNumber newitemoff,
+				 Size newitemsz,
+				 bool *newitemonleft)
 {
-	OffsetNumber i;
-	OffsetNumber saferight;
-	ItemId		nxtitemid,
-				safeitemid;
-	BTItem		safeitem,
-				nxtitem;
-	Size		nbytes;
-
-	if (start >= maxoff)
-		elog(FATAL, "btree: cannot split if start (%d) >= maxoff (%d)",
-			 start, maxoff);
-	saferight = start;
-	safeitemid = PageGetItemId(page, saferight);
-	nbytes = ItemIdGetLength(safeitemid) + sizeof(ItemIdData);
-	safeitem = (BTItem) PageGetItem(page, safeitemid);
-
-	i = OffsetNumberNext(start);
-
-	while (nbytes < llimit)
+	BTPageOpaque opaque;
+	OffsetNumber offnum;
+	OffsetNumber maxoff;
+	ItemId		itemid;
+	FindSplitData state;
+	int			leftspace,
+				rightspace,
+				dataitemtotal,
+				dataitemstoleft;
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	state.newitemsz = newitemsz;
+	state.non_leaf = ! P_ISLEAF(opaque);
+	state.have_split = false;
+
+	/* Total free space available on a btree page, after fixed overhead */
+	leftspace = rightspace =
+		PageGetPageSize(page) - sizeof(PageHeaderData) -
+		MAXALIGN(sizeof(BTPageOpaqueData))
+		+ sizeof(ItemIdData);
+
+	/* The right page will have the same high key as the old page */
+	if (!P_RIGHTMOST(opaque))
 	{
-		/* check the next item on the page */
-		nxtitemid = PageGetItemId(page, i);
-		nbytes += (ItemIdGetLength(nxtitemid) + sizeof(ItemIdData));
-		nxtitem = (BTItem) PageGetItem(page, nxtitemid);
+		itemid = PageGetItemId(page, P_HIKEY);
+		rightspace -= (int) (ItemIdGetLength(itemid) + sizeof(ItemIdData));
+	}
+
+	/* Count up total space in data items without actually scanning 'em */
+	dataitemtotal = rightspace - (int) PageGetFreeSpace(page);
+
+	/*
+	 * Scan through the data items and calculate space usage for a split
+	 * at each possible position.  XXX we could probably stop somewhere
+	 * near the middle...
+	 */
+	dataitemstoleft = 0;
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	for (offnum = P_FIRSTDATAKEY(opaque);
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		Size		itemsz;
+		int			leftfree,
+					rightfree;
+
+		itemid = PageGetItemId(page, offnum);
+		itemsz = ItemIdGetLength(itemid) + sizeof(ItemIdData);
 
 		/*
-		 * Test against last known safe item: if the tuple we're looking
-		 * at isn't equal to the last safe one we saw, then it's our new
-		 * safe tuple.
+		 * We have to allow for the current item becoming the high key of
+		 * the left page; therefore it counts against left space.
 		 */
-		if (!_bt_itemcmp(rel, keysz, scankey,
-						 safeitem, nxtitem, BTEqualStrategyNumber))
+		leftfree = leftspace - dataitemstoleft - (int) itemsz;
+		rightfree = rightspace - (dataitemtotal - dataitemstoleft);
+		if (offnum < newitemoff)
+			_bt_checksplitloc(&state, offnum, leftfree, rightfree,
+							  false, itemsz);
+		else if (offnum > newitemoff)
+			_bt_checksplitloc(&state, offnum, leftfree, rightfree,
+							  true, itemsz);
+		else
 		{
-			safeitem = nxtitem;
-			saferight = i;
+			/* need to try it both ways!! */
+			_bt_checksplitloc(&state, offnum, leftfree, rightfree,
+							  false, newitemsz);
+			_bt_checksplitloc(&state, offnum, leftfree, rightfree,
+							  true, itemsz);
 		}
-		if (i < maxoff)
-			i = OffsetNumberNext(i);
-		else
-			break;
+
+		dataitemstoleft += itemsz;
 	}
 
+	if (! state.have_split)
+		elog(FATAL, "_bt_findsplitloc: can't find a feasible split point for %s",
+			 RelationGetRelationName(rel));
+	*newitemonleft = state.newitemonleft;
+	return state.firstright;
+}
+
+static void
+_bt_checksplitloc(FindSplitData *state, OffsetNumber firstright,
+				  int leftfree, int rightfree,
+				  bool newitemonleft, Size firstrightitemsz)
+{
+	if (newitemonleft)
+		leftfree -= (int) state->newitemsz;
+	else
+		rightfree -= (int) state->newitemsz;
+	/*
+	 * If we are not on the leaf level, we will be able to discard the
+	 * key data from the first item that winds up on the right page.
+	 */
+	if (state->non_leaf)
+		rightfree += (int) firstrightitemsz -
+			(int) (sizeof(BTItemData) + sizeof(ItemIdData));
 	/*
-	 * If the chain of dups starts at the beginning of the page and
-	 * extends past the halfway mark, we can split it in the middle.
+	 * If feasible split point, remember best delta.
 	 */
+	if (leftfree >= 0 && rightfree >= 0)
+	{
+		int		delta = leftfree - rightfree;
+
+		if (delta < 0)
+			delta = -delta;
+		if (!state->have_split || delta < state->best_delta)
+		{
+			state->have_split = true;
+			state->newitemonleft = newitemonleft;
+			state->firstright = firstright;
+			state->best_delta = delta;
+		}
+	}
+}
+
+/*
+ *	_bt_getstackbuf() -- Walk back up the tree one step, and find the item
+ *						 we last looked at in the parent.
+ *
+ *		This is possible because we save a bit image of the last item
+ *		we looked at in the parent, and the update algorithm guarantees
+ *		that if items above us in the tree move, they only move right.
+ *
+ *		Also, re-set bts_blkno & bts_offset if changed.
+ */
+static Buffer
+_bt_getstackbuf(Relation rel, BTStack stack)
+{
+	BlockNumber blkno;
+	Buffer		buf;
+	OffsetNumber start,
+				offnum,
+				maxoff;
+	Page		page;
+	ItemId		itemid;
+	BTItem		item;
+	BTPageOpaque opaque;
+
+	blkno = stack->bts_blkno;
+	buf = _bt_getbuf(rel, blkno, BT_WRITE);
+	page = BufferGetPage(buf);
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	maxoff = PageGetMaxOffsetNumber(page);
 
-	if (saferight == start)
-		saferight = i;
+	start = stack->bts_offset;
+	/*
+	 * _bt_insertonpg set bts_offset to InvalidOffsetNumber in the
+	 * case of concurrent ROOT page split.  Also, watch out for
+	 * possibility that page has a high key now when it didn't before.
+	 */
+	if (start < P_FIRSTDATAKEY(opaque))
+		start = P_FIRSTDATAKEY(opaque);
 
-	if (saferight == maxoff && (maxoff - start) > 1)
-		saferight = start + (maxoff - start) / 2;
+	for (;;)
+	{
+		/* see if it's on this page */
+		for (offnum = start;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			itemid = PageGetItemId(page, offnum);
+			item = (BTItem) PageGetItem(page, itemid);
+			if (BTItemSame(item, &stack->bts_btitem))
+			{
+				/* Return accurate pointer to where link is now */
+				stack->bts_blkno = blkno;
+				stack->bts_offset = offnum;
+				return buf;
+			}
+		}
+		/* by here, the item we're looking for moved right at least one page */
+		if (P_RIGHTMOST(opaque))
+			elog(FATAL, "_bt_getstackbuf: my bits moved right off the end of the world!"
+				 "\n\tRecreate index %s.", RelationGetRelationName(rel));
 
-	return saferight;
+		blkno = opaque->btpo_next;
+		_bt_relbuf(rel, buf, BT_WRITE);
+		buf = _bt_getbuf(rel, blkno, BT_WRITE);
+		page = BufferGetPage(buf);
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		maxoff = PageGetMaxOffsetNumber(page);
+		start = P_FIRSTDATAKEY(opaque);
+	}
 }
 
 /*
@@ -1116,9 +936,9 @@ _bt_findsplitloc(Relation rel,
  *		graph.
  *
  *		On entry, lbuf (the old root) and rbuf (its new peer) are write-
- *		locked.  We don't drop the locks in this routine; that's done by
- *		the caller.  On exit, a new root page exists with entries for the
- *		two new children.  The new root page is neither pinned nor locked.
+ *		locked.  On exit, a new root page exists with entries for the
+ *		two new children.  The new root page is neither pinned nor locked, and
+ *		we have also written out lbuf and rbuf and dropped their pins/locks.
  */
 static void
 _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
@@ -1140,52 +960,52 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
 	rootpage = BufferGetPage(rootbuf);
 	rootbknum = BufferGetBlockNumber(rootbuf);
-	_bt_pageinit(rootpage, BufferGetPageSize(rootbuf));
 
 	/* set btree special data */
 	rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
 	rootopaque->btpo_prev = rootopaque->btpo_next = P_NONE;
 	rootopaque->btpo_flags |= BTP_ROOT;
 
-	/*
-	 * Insert the internal tuple pointers.
-	 */
-
 	lbkno = BufferGetBlockNumber(lbuf);
 	rbkno = BufferGetBlockNumber(rbuf);
 	lpage = BufferGetPage(lbuf);
 	rpage = BufferGetPage(rbuf);
 
+	/*
+	 * Make sure pages in old root level have valid parent links --- we will
+	 * need this in _bt_insertonpg() if a concurrent root split happens (see
+	 * README).
+	 */
 	((BTPageOpaque) PageGetSpecialPointer(lpage))->btpo_parent =
 		((BTPageOpaque) PageGetSpecialPointer(rpage))->btpo_parent =
 		rootbknum;
 
 	/*
-	 * step over the high key on the left page while building the left
-	 * page pointer.
+	 * Create downlink item for left page (old root).  Since this will be
+	 * the first item in a non-leaf page, it implicitly has minus-infinity
+	 * key value, so we need not store any actual key in it.
 	 */
-	itemid = PageGetItemId(lpage, P_FIRSTKEY);
-	itemsz = ItemIdGetLength(itemid);
-	item = (BTItem) PageGetItem(lpage, itemid);
-	new_item = _bt_formitem(&(item->bti_itup));
+	itemsz = sizeof(BTItemData);
+	new_item = (BTItem) palloc(itemsz);
+	new_item->bti_itup.t_info = itemsz;
 	ItemPointerSet(&(new_item->bti_itup.t_tid), lbkno, P_HIKEY);
 
 	/*
-	 * insert the left page pointer into the new root page.  the root page
-	 * is the rightmost page on its level so the "high key" item is the
-	 * first data item.
+	 * Insert the left page pointer into the new root page.  The root page
+	 * is the rightmost page on its level so there is no "high key" in it;
+	 * the two items will go into positions P_HIKEY and P_FIRSTKEY.
 	 */
 	if (PageAddItem(rootpage, (Item) new_item, itemsz, P_HIKEY, LP_USED) == InvalidOffsetNumber)
 		elog(FATAL, "btree: failed to add leftkey to new root page");
 	pfree(new_item);
 
 	/*
-	 * the right page is the rightmost page on the second level, so the
-	 * "high key" item is the first data item on that page as well.
+	 * Create downlink item for right page.  The key for it is obtained from
+	 * the "high key" position in the left page.
 	 */
-	itemid = PageGetItemId(rpage, P_HIKEY);
+	itemid = PageGetItemId(lpage, P_HIKEY);
 	itemsz = ItemIdGetLength(itemid);
-	item = (BTItem) PageGetItem(rpage, itemid);
+	item = (BTItem) PageGetItem(lpage, itemid);
 	new_item = _bt_formitem(&(item->bti_itup));
 	ItemPointerSet(&(new_item->bti_itup.t_tid), rbkno, P_HIKEY);
 
@@ -1196,497 +1016,101 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		elog(FATAL, "btree: failed to add rightkey to new root page");
 	pfree(new_item);
 
-	/* write and let go of the root buffer */
+	/* write and let go of the new root buffer */
 	_bt_wrtbuf(rel, rootbuf);
 
 	/* update metadata page with new root block number */
 	_bt_metaproot(rel, rootbknum, 0);
 
-	_bt_wrtbuf(rel, lbuf);
+	/* update and release new sibling, and finally the old root */
 	_bt_wrtbuf(rel, rbuf);
+	_bt_wrtbuf(rel, lbuf);
 }
 
 /*
  *	_bt_pgaddtup() -- add a tuple to a particular page in the index.
  *
- *		This routine adds the tuple to the page as requested, and keeps the
- *		write lock and reference associated with the page's buffer.  It is
- *		an error to call pgaddtup() without a write lock and reference.  If
- *		afteritem is non-null, it's the item that we expect our new item
- *		to follow.	Otherwise, we do a binary search for the correct place
- *		and insert the new item there.
+ *		This routine adds the tuple to the page as requested.  It does
+ *		not affect pin/lock status, but you'd better have a write lock
+ *		and pin on the target buffer!  Don't forget to write and release
+ *		the buffer afterwards, either.
+ *
+ *		The main difference between this routine and a bare PageAddItem call
+ *		is that this code knows that the leftmost data item on a non-leaf
+ *		btree page doesn't need to have a key.  Therefore, it strips such
+ *		items down to just the item header.  CAUTION: this works ONLY if
+ *		we insert the items in order, so that the given itup_off does
+ *		represent the final position of the item!
  */
-static OffsetNumber
+static void
 _bt_pgaddtup(Relation rel,
-			 Buffer buf,
-			 int keysz,
-			 ScanKey itup_scankey,
+			 Page page,
 			 Size itemsize,
 			 BTItem btitem,
-			 BTItem afteritem)
-{
-	OffsetNumber itup_off;
-	OffsetNumber first;
-	Page		page;
-	BTPageOpaque opaque;
-	BTItem		chkitem;
-
-	page = BufferGetPage(buf);
-	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-	first = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
-
-	if (afteritem == (BTItem) NULL)
-		itup_off = _bt_binsrch(rel, buf, keysz, itup_scankey, BT_INSERTION);
-	else
-	{
-		itup_off = first;
-
-		do
-		{
-			chkitem = (BTItem) PageGetItem(page, PageGetItemId(page, itup_off));
-			itup_off = OffsetNumberNext(itup_off);
-		} while (!BTItemSame(chkitem, afteritem));
-	}
-
-	if (PageAddItem(page, (Item) btitem, itemsize, itup_off, LP_USED) == InvalidOffsetNumber)
-		elog(FATAL, "btree: failed to add item to the page");
-
-	/* write the buffer, but hold our lock */
-	_bt_wrtnorelbuf(rel, buf);
-
-	return itup_off;
-}
-
-/*
- *	_bt_goesonpg() -- Does a new tuple belong on this page?
- *
- *		This is part of the complexity introduced by allowing duplicate
- *		keys into the index.  The tuple belongs on this page if:
- *
- *				+ there is no page to the right of this one; or
- *				+ it is less than the high key on the page; or
- *				+ the item it is to follow ("afteritem") appears on this
- *				  page.
- */
-static bool
-_bt_goesonpg(Relation rel,
-			 Buffer buf,
-			 Size keysz,
-			 ScanKey scankey,
-			 BTItem afteritem)
-{
-	Page		page;
-	ItemId		hikey;
-	BTPageOpaque opaque;
-	BTItem		chkitem;
-	OffsetNumber offnum,
-				maxoff;
-	bool		found;
-
-	page = BufferGetPage(buf);
-
-	/* no right neighbor? */
-	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-	if (P_RIGHTMOST(opaque))
-		return true;
-
-	/*
-	 * this is a non-rightmost page, so it must have a high key item.
-	 *
-	 * If the scan key is < the high key (the min key on the next page), then
-	 * it for sure belongs here.
-	 */
-	hikey = PageGetItemId(page, P_HIKEY);
-	if (_bt_skeycmp(rel, keysz, scankey, page, hikey, BTLessStrategyNumber))
-		return true;
-
-	/*
-	 * If the scan key is > the high key, then it for sure doesn't belong
-	 * here.
-	 */
-
-	if (_bt_skeycmp(rel, keysz, scankey, page, hikey, BTGreaterStrategyNumber))
-		return false;
-
-	/*
-	 * If we have no adjacency information, and the item is equal to the
-	 * high key on the page (by here it is), then the item does not belong
-	 * on this page.
-	 *
-	 * Now it's not true in all cases.		- vadim 06/10/97
-	 */
-
-	if (afteritem == (BTItem) NULL)
-	{
-		if (opaque->btpo_flags & BTP_LEAF)
-			return false;
-		if (opaque->btpo_flags & BTP_CHAIN)
-			return true;
-		if (_bt_skeycmp(rel, keysz, scankey, page,
-						PageGetItemId(page, P_FIRSTKEY),
-						BTEqualStrategyNumber))
-			return true;
-		return false;
-	}
-
-	/* damn, have to work for it.  i hate that. */
-	maxoff = PageGetMaxOffsetNumber(page);
-
-	/*
-	 * Search the entire page for the afteroid.  We need to do this,
-	 * rather than doing a binary search and starting from there, because
-	 * if the key we're searching for is the leftmost key in the tree at
-	 * this level, then a binary search will do the wrong thing.  Splits
-	 * are pretty infrequent, so the cost isn't as bad as it could be.
-	 */
-
-	found = false;
-	for (offnum = P_FIRSTKEY;
-		 offnum <= maxoff;
-		 offnum = OffsetNumberNext(offnum))
-	{
-		chkitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
-
-		if (BTItemSame(chkitem, afteritem))
-		{
-			found = true;
-			break;
-		}
-	}
-
-	return found;
-}
-
-/*
- *		_bt_tuplecompare() -- compare two IndexTuples,
- *							  return -1, 0, or +1
- *
- */
-static int32
-_bt_tuplecompare(Relation rel,
-				 Size keysz,
-				 ScanKey scankey,
-				 IndexTuple tuple1,
-				 IndexTuple tuple2)
+			 OffsetNumber itup_off,
+			 const char *where)
 {
-	TupleDesc	tupDes;
-	int			i;
-	int32		compare = 0;
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	BTItemData truncitem;
 
-	tupDes = RelationGetDescr(rel);
-
-	for (i = 1; i <= (int) keysz; i++)
-	{
-		ScanKey		entry = &scankey[i - 1];
-		Datum		attrDatum1,
-					attrDatum2;
-		bool		isFirstNull,
-					isSecondNull;
-
-		attrDatum1 = index_getattr(tuple1, i, tupDes, &isFirstNull);
-		attrDatum2 = index_getattr(tuple2, i, tupDes, &isSecondNull);
-
-		/* see comments about NULLs handling in btbuild */
-		if (isFirstNull)		/* attr in tuple1 is NULL */
-		{
-			if (isSecondNull)	/* attr in tuple2 is NULL too */
-				compare = 0;
-			else
-				compare = 1;	/* NULL ">" not-NULL */
-		}
-		else if (isSecondNull)	/* attr in tuple1 is NOT_NULL and */
-		{						/* attr in tuple2 is NULL */
-			compare = -1;		/* not-NULL "<" NULL */
-		}
-		else
-		{
-			compare = DatumGetInt32(FunctionCall2(&entry->sk_func,
-												  attrDatum1, attrDatum2));
-		}
-
-		if (compare != 0)
-			break;				/* done when we find unequal attributes */
-	}
-
-	return compare;
-}
-
-/*
- *		_bt_itemcmp() -- compare two BTItems using a requested
- *						 strategy (<, <=, =, >=, >)
- *
- */
-bool
-_bt_itemcmp(Relation rel,
-			Size keysz,
-			ScanKey scankey,
-			BTItem item1,
-			BTItem item2,
-			StrategyNumber strat)
-{
-	int32		compare;
-
-	compare = _bt_tuplecompare(rel, keysz, scankey,
-							   &(item1->bti_itup),
-							   &(item2->bti_itup));
-
-	switch (strat)
+	if (! P_ISLEAF(opaque) && itup_off == P_FIRSTDATAKEY(opaque))
 	{
-		case BTLessStrategyNumber:
-			return (bool) (compare < 0);
-		case BTLessEqualStrategyNumber:
-			return (bool) (compare <= 0);
-		case BTEqualStrategyNumber:
-			return (bool) (compare == 0);
-		case BTGreaterEqualStrategyNumber:
-			return (bool) (compare >= 0);
-		case BTGreaterStrategyNumber:
-			return (bool) (compare > 0);
+		memcpy(&truncitem, btitem, sizeof(BTItemData));
+		truncitem.bti_itup.t_info = sizeof(BTItemData);
+		btitem = &truncitem;
+		itemsize = sizeof(BTItemData);
 	}
 
-	elog(ERROR, "_bt_itemcmp: bogus strategy %d", (int) strat);
-	return false;
-}
-
-/*
- *		_bt_updateitem() -- updates the key of the item identified by the
- *							oid with the key of newItem (done in place if
- *							possible)
- *
- */
-static void
-_bt_updateitem(Relation rel,
-			   Size keysz,
-			   Buffer buf,
-			   BTItem oldItem,
-			   BTItem newItem)
-{
-	Page		page;
-	OffsetNumber maxoff;
-	OffsetNumber i;
-	ItemPointerData itemPtrData;
-	BTItem		item;
-	IndexTuple	oldIndexTuple,
-				newIndexTuple;
-	int			first;
-
-	page = BufferGetPage(buf);
-	maxoff = PageGetMaxOffsetNumber(page);
-
-	/* locate item on the page */
-	first = P_RIGHTMOST((BTPageOpaque) PageGetSpecialPointer(page))
-		? P_HIKEY : P_FIRSTKEY;
-	i = first;
-	do
-	{
-		item = (BTItem) PageGetItem(page, PageGetItemId(page, i));
-		i = OffsetNumberNext(i);
-	} while (i <= maxoff && !BTItemSame(item, oldItem));
-
-	/* this should never happen (in theory) */
-	if (!BTItemSame(item, oldItem))
-		elog(FATAL, "_bt_getstackbuf was lying!!");
-
-	/*
-	 * It's  defined by caller (_bt_insertonpg)
-	 */
-
-	/*
-	 * if(IndexTupleDSize(newItem->bti_itup) >
-	 * IndexTupleDSize(item->bti_itup)) { elog(NOTICE, "trying to
-	 * overwrite a smaller value with a bigger one in _bt_updateitem");
-	 * elog(ERROR, "this is not good."); }
-	 */
-
-	oldIndexTuple = &(item->bti_itup);
-	newIndexTuple = &(newItem->bti_itup);
-
-	/* keep the original item pointer */
-	ItemPointerCopy(&(oldIndexTuple->t_tid), &itemPtrData);
-	CopyIndexTuple(newIndexTuple, &oldIndexTuple);
-	ItemPointerCopy(&itemPtrData, &(oldIndexTuple->t_tid));
-
+	if (PageAddItem(page, (Item) btitem, itemsize, itup_off,
+					LP_USED) == InvalidOffsetNumber)
+		elog(FATAL, "btree: failed to add item to the %s for %s",
+			 where, RelationGetRelationName(rel));
 }
 
 /*
  * _bt_isequal - used in _bt_doinsert in check for duplicates.
  *
+ * This is very similar to _bt_compare, except for NULL handling.
  * Rule is simple: NOT_NULL not equal NULL, NULL not_equal NULL too.
  */
 static bool
 _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
 			int keysz, ScanKey scankey)
 {
-	Datum		datum;
 	BTItem		btitem;
 	IndexTuple	itup;
-	ScanKey		entry;
-	AttrNumber	attno;
-	int32		result;
 	int			i;
-	bool		null;
+
+	/* Better be comparing to a leaf item */
+	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
 
 	btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
 	itup = &(btitem->bti_itup);
 
 	for (i = 1; i <= keysz; i++)
 	{
-		entry = &scankey[i - 1];
+		ScanKey		entry = &scankey[i - 1];
+		AttrNumber	attno;
+		Datum		datum;
+		bool		isNull;
+		int32		result;
+
 		attno = entry->sk_attno;
 		Assert(attno == i);
-		datum = index_getattr(itup, attno, itupdesc, &null);
+		datum = index_getattr(itup, attno, itupdesc, &isNull);
 
-		/* NULLs are not equal */
-		if (entry->sk_flags & SK_ISNULL || null)
+		/* NULLs are never equal to anything */
+		if (entry->sk_flags & SK_ISNULL || isNull)
 			return false;
 
 		result = DatumGetInt32(FunctionCall2(&entry->sk_func,
-											 entry->sk_argument, datum));
+											 entry->sk_argument,
+											 datum));
+
 		if (result != 0)
 			return false;
 	}
 
-	/* by here, the keys are equal */
+	/* if we get here, the keys are equal */
 	return true;
 }
-
-#ifdef NOT_USED
-/*
- * _bt_shift - insert btitem on the passed page after shifting page
- *			   to the right in the tree.
- *
- * NOTE: tested for shifting leftmost page only, having btitem < hikey.
- */
-static InsertIndexResult
-_bt_shift(Relation rel, Buffer buf, BTStack stack, int keysz,
-		  ScanKey scankey, BTItem btitem, BTItem hikey)
-{
-	InsertIndexResult res;
-	int			itemsz;
-	Page		page;
-	BlockNumber bknum;
-	BTPageOpaque pageop;
-	Buffer		rbuf;
-	Page		rpage;
-	BTPageOpaque rpageop;
-	Buffer		pbuf;
-	Page		ppage;
-	BTPageOpaque ppageop;
-	Buffer		nbuf;
-	Page		npage;
-	BTPageOpaque npageop;
-	BlockNumber nbknum;
-	BTItem		nitem;
-	OffsetNumber afteroff;
-
-	btitem = _bt_formitem(&(btitem->bti_itup));
-	hikey = _bt_formitem(&(hikey->bti_itup));
-
-	page = BufferGetPage(buf);
-
-	/* grab new page */
-	nbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
-	nbknum = BufferGetBlockNumber(nbuf);
-	npage = BufferGetPage(nbuf);
-	_bt_pageinit(npage, BufferGetPageSize(nbuf));
-	npageop = (BTPageOpaque) PageGetSpecialPointer(npage);
-
-	/* copy content of the passed page */
-	memmove((char *) npage, (char *) page, BufferGetPageSize(buf));
-
-	/* re-init old (passed) page */
-	_bt_pageinit(page, BufferGetPageSize(buf));
-	pageop = (BTPageOpaque) PageGetSpecialPointer(page);
-
-	/* init old page opaque */
-	pageop->btpo_flags = npageop->btpo_flags;	/* restore flags */
-	pageop->btpo_flags &= ~BTP_CHAIN;
-	if (_bt_itemcmp(rel, keysz, scankey, hikey, btitem, BTEqualStrategyNumber))
-		pageop->btpo_flags |= BTP_CHAIN;
-	pageop->btpo_prev = npageop->btpo_prev;		/* restore prev */
-	pageop->btpo_next = nbknum; /* next points to the new page */
-	pageop->btpo_parent = npageop->btpo_parent;
-
-	/* init shifted page opaque */
-	npageop->btpo_prev = bknum = BufferGetBlockNumber(buf);
-
-	/* shifted page is ok, populate old page */
-
-	/* add passed hikey */
-	itemsz = IndexTupleDSize(hikey->bti_itup)
-		+ (sizeof(BTItemData) - sizeof(IndexTupleData));
-	itemsz = MAXALIGN(itemsz);
-	if (PageAddItem(page, (Item) hikey, itemsz, P_HIKEY, LP_USED) == InvalidOffsetNumber)
-		elog(FATAL, "btree: failed to add hikey in _bt_shift");
-	pfree(hikey);
-
-	/* add btitem */
-	itemsz = IndexTupleDSize(btitem->bti_itup)
-		+ (sizeof(BTItemData) - sizeof(IndexTupleData));
-	itemsz = MAXALIGN(itemsz);
-	if (PageAddItem(page, (Item) btitem, itemsz, P_FIRSTKEY, LP_USED) == InvalidOffsetNumber)
-		elog(FATAL, "btree: failed to add firstkey in _bt_shift");
-	pfree(btitem);
-	nitem = (BTItem) PageGetItem(page, PageGetItemId(page, P_FIRSTKEY));
-	btitem = _bt_formitem(&(nitem->bti_itup));
-	ItemPointerSet(&(btitem->bti_itup.t_tid), bknum, P_HIKEY);
-
-	/* ok, write them out */
-	_bt_wrtnorelbuf(rel, nbuf);
-	_bt_wrtnorelbuf(rel, buf);
-
-	/* fix btpo_prev on right sibling of old page */
-	if (!P_RIGHTMOST(npageop))
-	{
-		rbuf = _bt_getbuf(rel, npageop->btpo_next, BT_WRITE);
-		rpage = BufferGetPage(rbuf);
-		rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
-		rpageop->btpo_prev = nbknum;
-		_bt_wrtbuf(rel, rbuf);
-	}
-
-	/* get parent pointing to the old page */
-	ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid),
-				   bknum, P_HIKEY);
-	pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
-	ppage = BufferGetPage(pbuf);
-	ppageop = (BTPageOpaque) PageGetSpecialPointer(ppage);
-
-	_bt_relbuf(rel, nbuf, BT_WRITE);
-	_bt_relbuf(rel, buf, BT_WRITE);
-
-	/* re-set parent' pointer - we shifted our page to the right ! */
-	nitem = (BTItem) PageGetItem(ppage,
-								 PageGetItemId(ppage, stack->bts_offset));
-	ItemPointerSet(&(nitem->bti_itup.t_tid), nbknum, P_HIKEY);
-	ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid), nbknum, P_HIKEY);
-	_bt_wrtnorelbuf(rel, pbuf);
-
-	/*
-	 * Now we want insert into the parent pointer to our old page. It has
-	 * to be inserted before the pointer to new page. You may get problems
-	 * here (in the _bt_goesonpg and/or _bt_pgaddtup), but may be not - I
-	 * don't know. It works if old page is leftmost (nitem is NULL) and
-	 * btitem < hikey and it's all what we need currently. - vadim
-	 * 05/30/97
-	 */
-	nitem = NULL;
-	afteroff = P_FIRSTKEY;
-	if (!P_RIGHTMOST(ppageop))
-		afteroff = OffsetNumberNext(afteroff);
-	if (stack->bts_offset >= afteroff)
-	{
-		afteroff = OffsetNumberPrev(stack->bts_offset);
-		nitem = (BTItem) PageGetItem(ppage, PageGetItemId(ppage, afteroff));
-		nitem = _bt_formitem(&(nitem->bti_itup));
-	}
-	res = _bt_insertonpg(rel, pbuf, stack->bts_parent,
-						 keysz, scankey, btitem, nitem);
-	pfree(btitem);
-
-	ItemPointerSet(&(res->pointerData), nbknum, P_HIKEY);
-
-	return res;
-}
-
-#endif
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 1a623698f57c34711c62225384e7c73feb510281..40604dbc25830dfbaea2f03b4f39a3290ad0e477 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -9,7 +9,7 @@
  *
  *
  * IDENTIFICATION
- *	  $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtpage.c,v 1.36 2000/04/12 17:14:49 momjian Exp $
+ *	  $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtpage.c,v 1.37 2000/07/21 06:42:32 tgl Exp $
  *
  *	NOTES
  *	   Postgres btree pages look like ordinary relation pages.	The opaque
@@ -90,7 +90,7 @@ _bt_metapinit(Relation rel)
 	metad.btm_version = BTREE_VERSION;
 	metad.btm_root = P_NONE;
 	metad.btm_level = 0;
-	memmove((char *) BTPageGetMeta(pg), (char *) &metad, sizeof(metad));
+	memcpy((char *) BTPageGetMeta(pg), (char *) &metad, sizeof(metad));
 
 	op = (BTPageOpaque) PageGetSpecialPointer(pg);
 	op->btpo_flags = BTP_META;
@@ -102,52 +102,6 @@ _bt_metapinit(Relation rel)
 		UnlockRelation(rel, AccessExclusiveLock);
 }
 
-#ifdef NOT_USED
-/*
- *	_bt_checkmeta() -- Verify that the metadata stored in a btree are
- *					   reasonable.
- */
-void
-_bt_checkmeta(Relation rel)
-{
-	Buffer		metabuf;
-	Page		metap;
-	BTMetaPageData *metad;
-	BTPageOpaque op;
-	int			nblocks;
-
-	/* if the relation is empty, this is init time; don't complain */
-	if ((nblocks = RelationGetNumberOfBlocks(rel)) == 0)
-		return;
-
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
-	metap = BufferGetPage(metabuf);
-	op = (BTPageOpaque) PageGetSpecialPointer(metap);
-	if (!(op->btpo_flags & BTP_META))
-	{
-		elog(ERROR, "Invalid metapage for index %s",
-			 RelationGetRelationName(rel));
-	}
-	metad = BTPageGetMeta(metap);
-
-	if (metad->btm_magic != BTREE_MAGIC)
-	{
-		elog(ERROR, "Index %s is not a btree",
-			 RelationGetRelationName(rel));
-	}
-
-	if (metad->btm_version != BTREE_VERSION)
-	{
-		elog(ERROR, "Version mismatch on %s:  version %d file, version %d code",
-			 RelationGetRelationName(rel),
-			 metad->btm_version, BTREE_VERSION);
-	}
-
-	_bt_relbuf(rel, metabuf, BT_READ);
-}
-
-#endif
-
 /*
  *	_bt_getroot() -- Get the root page of the btree.
  *
@@ -157,11 +111,15 @@ _bt_checkmeta(Relation rel)
  *		standard class of race conditions exists here; I think I covered
  *		them all in the Hopi Indian rain dance of lock requests below.
  *
- *		We pass in the access type (BT_READ or BT_WRITE), and return the
- *		root page's buffer with the appropriate lock type set.  Reference
- *		count on the root page gets bumped by ReadBuffer.  The metadata
- *		page is unlocked and unreferenced by this process when this routine
- *		returns.
+ *		The access type parameter (BT_READ or BT_WRITE) controls whether
+ *		a new root page will be created or not.  If access = BT_READ,
+ *		and no root page exists, we just return InvalidBuffer.  For
+ *		BT_WRITE, we try to create the root page if it doesn't exist.
+ *		NOTE that the returned root page will have only a read lock set
+ *		on it even if access = BT_WRITE!
+ *
+ *		On successful return, the root page is pinned and read-locked.
+ *		The metadata page is not locked or pinned on exit.
  */
 Buffer
 _bt_getroot(Relation rel, int access)
@@ -178,78 +136,71 @@ _bt_getroot(Relation rel, int access)
 	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
-	Assert(metaopaque->btpo_flags & BTP_META);
 	metad = BTPageGetMeta(metapg);
 
-	if (metad->btm_magic != BTREE_MAGIC)
-	{
+	if (!(metaopaque->btpo_flags & BTP_META) ||
+		metad->btm_magic != BTREE_MAGIC)
 		elog(ERROR, "Index %s is not a btree",
 			 RelationGetRelationName(rel));
-	}
 
 	if (metad->btm_version != BTREE_VERSION)
-	{
-		elog(ERROR, "Version mismatch on %s:  version %d file, version %d code",
+		elog(ERROR, "Version mismatch on %s: version %d file, version %d code",
 			 RelationGetRelationName(rel),
 			 metad->btm_version, BTREE_VERSION);
-	}
 
 	/* if no root page initialized yet, do it */
 	if (metad->btm_root == P_NONE)
 	{
+		/* If access = BT_READ, caller doesn't want us to create root yet */
+		if (access == BT_READ)
+		{
+			_bt_relbuf(rel, metabuf, BT_READ);
+			return InvalidBuffer;
+		}
 
-		/* turn our read lock in for a write lock */
-		_bt_relbuf(rel, metabuf, BT_READ);
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
-		metapg = BufferGetPage(metabuf);
-		metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
-		Assert(metaopaque->btpo_flags & BTP_META);
-		metad = BTPageGetMeta(metapg);
+		/* trade in our read lock for a write lock */
+		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+		LockBuffer(metabuf, BT_WRITE);
 
 		/*
 		 * Race condition:	if someone else initialized the metadata
 		 * between the time we released the read lock and acquired the
-		 * write lock, above, we want to avoid doing it again.
+		 * write lock, above, we must avoid doing it again.
 		 */
-
 		if (metad->btm_root == P_NONE)
 		{
 
 			/*
 			 * Get, initialize, write, and leave a lock of the appropriate
 			 * type on the new root page.  Since this is the first page in
-			 * the tree, it's a leaf.
+			 * the tree, it's a leaf as well as the root.
 			 */
-
 			rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
 			rootblkno = BufferGetBlockNumber(rootbuf);
 			rootpg = BufferGetPage(rootbuf);
+
 			metad->btm_root = rootblkno;
 			metad->btm_level = 1;
+
 			_bt_pageinit(rootpg, BufferGetPageSize(rootbuf));
 			rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpg);
 			rootopaque->btpo_flags |= (BTP_LEAF | BTP_ROOT);
 			_bt_wrtnorelbuf(rel, rootbuf);
 
-			/* swap write lock for read lock, if appropriate */
-			if (access != BT_WRITE)
-			{
-				LockBuffer(rootbuf, BUFFER_LOCK_UNLOCK);
-				LockBuffer(rootbuf, BT_READ);
-			}
+			/* swap write lock for read lock */
+			LockBuffer(rootbuf, BUFFER_LOCK_UNLOCK);
+			LockBuffer(rootbuf, BT_READ);
 
-			/* okay, metadata is correct */
+			/* okay, metadata is correct, write and release it */
 			_bt_wrtbuf(rel, metabuf);
 		}
 		else
 		{
-
 			/*
 			 * Metadata initialized by someone else.  In order to
 			 * guarantee no deadlocks, we have to release the metadata
 			 * page and start all over again.
 			 */
-
 			_bt_relbuf(rel, metabuf, BT_WRITE);
 			return _bt_getroot(rel, access);
 		}
@@ -259,22 +210,21 @@ _bt_getroot(Relation rel, int access)
 		rootblkno = metad->btm_root;
 		_bt_relbuf(rel, metabuf, BT_READ);		/* done with the meta page */
 
-		rootbuf = _bt_getbuf(rel, rootblkno, access);
+		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
 	}
 
 	/*
 	 * Race condition:	If the root page split between the time we looked
 	 * at the metadata page and got the root buffer, then we got the wrong
-	 * buffer.
+	 * buffer.  Release it and try again.
 	 */
-
 	rootpg = BufferGetPage(rootbuf);
 	rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpg);
-	if (!(rootopaque->btpo_flags & BTP_ROOT))
-	{
 
+	if (! P_ISROOT(rootopaque))
+	{
 		/* it happened, try again */
-		_bt_relbuf(rel, rootbuf, access);
+		_bt_relbuf(rel, rootbuf, BT_READ);
 		return _bt_getroot(rel, access);
 	}
 
@@ -283,7 +233,6 @@ _bt_getroot(Relation rel, int access)
 	 * count is correct, and we have no lock set on the metadata page.
 	 * Return the root block.
 	 */
-
 	return rootbuf;
 }
 
@@ -291,33 +240,38 @@ _bt_getroot(Relation rel, int access)
  *	_bt_getbuf() -- Get a buffer by block number for read or write.
  *
  *		When this routine returns, the appropriate lock is set on the
- *		requested buffer its reference count is correct.
+ *		requested buffer and its reference count has been incremented
+ *		(ie, the buffer is "locked and pinned").
  */
 Buffer
 _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 {
 	Buffer		buf;
-	Page		page;
 
 	if (blkno != P_NEW)
 	{
+		/* Read an existing block of the relation */
 		buf = ReadBuffer(rel, blkno);
 		LockBuffer(buf, access);
 	}
 	else
 	{
+		Page		page;
 
 		/*
-		 * Extend bufmgr code is unclean and so we have to use locking
+		 * Extend the relation by one page.
+		 *
+		 * Extend bufmgr code is unclean and so we have to use extra locking
 		 * here.
 		 */
 		LockPage(rel, 0, ExclusiveLock);
 		buf = ReadBuffer(rel, blkno);
+		LockBuffer(buf, access);
 		UnlockPage(rel, 0, ExclusiveLock);
-		blkno = BufferGetBlockNumber(buf);
+
+		/* Initialize the new page before returning it */
 		page = BufferGetPage(buf);
 		_bt_pageinit(page, BufferGetPageSize(buf));
-		LockBuffer(buf, access);
 	}
 
 	/* ref count and lock type are correct */
@@ -326,6 +280,8 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 
 /*
  *	_bt_relbuf() -- release a locked buffer.
+ *
+ * Lock and pin (refcount) are both dropped.
  */
 void
 _bt_relbuf(Relation rel, Buffer buf, int access)
@@ -337,9 +293,15 @@ _bt_relbuf(Relation rel, Buffer buf, int access)
 /*
  *	_bt_wrtbuf() -- write a btree page to disk.
  *
- *		This routine releases the lock held on the buffer and our reference
- *		to it.	It is an error to call _bt_wrtbuf() without a write lock
- *		or a reference to the buffer.
+ *		This routine releases the lock held on the buffer and our refcount
+ *		for it.  It is an error to call _bt_wrtbuf() without a write lock
+ *		and a pin on the buffer.
+ *
+ * NOTE: actually, the buffer manager just marks the shared buffer page
+ * dirty here, the real I/O happens later.  Since we can't persuade the
+ * Unix kernel to schedule disk writes in a particular order, there's not
+ * much point in worrying about this.  The most we can say is that all the
+ * writes will occur before commit.
  */
 void
 _bt_wrtbuf(Relation rel, Buffer buf)
@@ -353,7 +315,9 @@ _bt_wrtbuf(Relation rel, Buffer buf)
  *						 our reference or lock.
  *
  *		It is an error to call _bt_wrtnorelbuf() without a write lock
- *		or a reference to the buffer.
+ *		and a pin on the buffer.
+ *
+ * See above NOTE.
  */
 void
 _bt_wrtnorelbuf(Relation rel, Buffer buf)
@@ -389,10 +353,10 @@ _bt_pageinit(Page page, Size size)
  *		we split the root page, we record the new parent in the metadata page
  *		for the relation.  This routine does the work.
  *
- *		No direct preconditions, but if you don't have the a write lock on
+ *		No direct preconditions, but if you don't have the write lock on
  *		at least the old root page when you call this, you're making a big
  *		mistake.  On exit, metapage data is correct and we no longer have
- *		a reference to or lock on the metapage.
+ *		a pin or lock on the metapage.
  */
 void
 _bt_metaproot(Relation rel, BlockNumber rootbknum, int level)
@@ -416,127 +380,8 @@ _bt_metaproot(Relation rel, BlockNumber rootbknum, int level)
 }
 
 /*
- *	_bt_getstackbuf() -- Walk back up the tree one step, and find the item
- *						 we last looked at in the parent.
- *
- *		This is possible because we save a bit image of the last item
- *		we looked at in the parent, and the update algorithm guarantees
- *		that if items above us in the tree move, they only move right.
- *
- *		Also, re-set bts_blkno & bts_offset if changed and
- *		bts_btitem (it may be changed - see _bt_insertonpg).
+ * Delete an item from a btree.  It had better be a leaf item...
  */
-Buffer
-_bt_getstackbuf(Relation rel, BTStack stack, int access)
-{
-	Buffer		buf;
-	BlockNumber blkno;
-	OffsetNumber start,
-				offnum,
-				maxoff;
-	OffsetNumber i;
-	Page		page;
-	ItemId		itemid;
-	BTItem		item;
-	BTPageOpaque opaque;
-	BTItem		item_save;
-	int			item_nbytes;
-
-	blkno = stack->bts_blkno;
-	buf = _bt_getbuf(rel, blkno, access);
-	page = BufferGetPage(buf);
-	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-	maxoff = PageGetMaxOffsetNumber(page);
-
-	if (stack->bts_offset == InvalidOffsetNumber ||
-		maxoff >= stack->bts_offset)
-	{
-
-		/*
-		 * _bt_insertonpg set bts_offset to InvalidOffsetNumber in the
-		 * case of concurrent ROOT page split
-		 */
-		if (stack->bts_offset == InvalidOffsetNumber)
-			i = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
-		else
-		{
-			itemid = PageGetItemId(page, stack->bts_offset);
-			item = (BTItem) PageGetItem(page, itemid);
-
-			/* if the item is where we left it, we're done */
-			if (BTItemSame(item, stack->bts_btitem))
-			{
-				pfree(stack->bts_btitem);
-				item_nbytes = ItemIdGetLength(itemid);
-				item_save = (BTItem) palloc(item_nbytes);
-				memmove((char *) item_save, (char *) item, item_nbytes);
-				stack->bts_btitem = item_save;
-				return buf;
-			}
-			i = OffsetNumberNext(stack->bts_offset);
-		}
-
-		/* if the item has just moved right on this page, we're done */
-		for (;
-			 i <= maxoff;
-			 i = OffsetNumberNext(i))
-		{
-			itemid = PageGetItemId(page, i);
-			item = (BTItem) PageGetItem(page, itemid);
-
-			/* if the item is where we left it, we're done */
-			if (BTItemSame(item, stack->bts_btitem))
-			{
-				stack->bts_offset = i;
-				pfree(stack->bts_btitem);
-				item_nbytes = ItemIdGetLength(itemid);
-				item_save = (BTItem) palloc(item_nbytes);
-				memmove((char *) item_save, (char *) item, item_nbytes);
-				stack->bts_btitem = item_save;
-				return buf;
-			}
-		}
-	}
-
-	/* by here, the item we're looking for moved right at least one page */
-	for (;;)
-	{
-		blkno = opaque->btpo_next;
-		if (P_RIGHTMOST(opaque))
-			elog(FATAL, "my bits moved right off the end of the world!\
-\n\tRecreate index %s.", RelationGetRelationName(rel));
-
-		_bt_relbuf(rel, buf, access);
-		buf = _bt_getbuf(rel, blkno, access);
-		page = BufferGetPage(buf);
-		maxoff = PageGetMaxOffsetNumber(page);
-		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
-		/* if we have a right sibling, step over the high key */
-		start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
-
-		/* see if it's on this page */
-		for (offnum = start;
-			 offnum <= maxoff;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			itemid = PageGetItemId(page, offnum);
-			item = (BTItem) PageGetItem(page, itemid);
-			if (BTItemSame(item, stack->bts_btitem))
-			{
-				stack->bts_offset = offnum;
-				stack->bts_blkno = blkno;
-				pfree(stack->bts_btitem);
-				item_nbytes = ItemIdGetLength(itemid);
-				item_save = (BTItem) palloc(item_nbytes);
-				memmove((char *) item_save, (char *) item, item_nbytes);
-				stack->bts_btitem = item_save;
-				return buf;
-			}
-		}
-	}
-}
-
 void
 _bt_pagedel(Relation rel, ItemPointer tid)
 {
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b174d303176e95f04db9429992b673b42d7125f8..072d4000705dd0e8ec87fc662f4571133d368929 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -12,7 +12,7 @@
  * Portions Copyright (c) 1994, Regents of the University of California
  *
  * IDENTIFICATION
- *	  $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtree.c,v 1.61 2000/07/14 22:17:33 tgl Exp $
+ *	  $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtree.c,v 1.62 2000/07/21 06:42:32 tgl Exp $
  *
  *-------------------------------------------------------------------------
  */
@@ -26,6 +26,7 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 
+
 bool		BuildingBtree = false;		/* see comment in btbuild() */
 bool		FastBuild = true;	/* use sort/build instead of insertion
 								 * build */
@@ -206,8 +207,8 @@ btbuild(PG_FUNCTION_ARGS)
 		 * btree pages - NULLs greater NOT_NULLs and NULL = NULL is TRUE.
 		 * Sure, it's just rule for placing/finding items and no more -
 		 * keytest'll return FALSE for a = 5 for items having 'a' isNULL.
-		 * Look at _bt_skeycmp, _bt_compare and _bt_itemcmp for how it
-		 * works.				 - vadim 03/23/97
+		 * Look at _bt_compare for how it works.
+		 *				 - vadim 03/23/97
 		 *
 		 * if (itup->t_info & INDEX_NULL_MASK) { pfree(itup); continue; }
 		 */
@@ -321,14 +322,6 @@ btinsert(PG_FUNCTION_ARGS)
 	/* generate an index tuple */
 	itup = index_formtuple(RelationGetDescr(rel), datum, nulls);
 	itup->t_tid = *ht_ctid;
-
-	/*
-	 * See comments in btbuild.
-	 *
-	 * if (itup->t_info & INDEX_NULL_MASK)
-	 *		PG_RETURN_POINTER((InsertIndexResult) NULL);
-	 */
-
 	btitem = _bt_formitem(itup);
 
 	res = _bt_doinsert(rel, btitem, rel->rd_uniqueindex, heapRel);
@@ -357,10 +350,10 @@ btgettuple(PG_FUNCTION_ARGS)
 
 	if (ItemPointerIsValid(&(scan->currentItemData)))
 	{
-
 		/*
 		 * Restore scan position using heap TID returned by previous call
-		 * to btgettuple(). _bt_restscan() locks buffer.
+		 * to btgettuple(). _bt_restscan() re-grabs the read lock on
+		 * the buffer, too.
 		 */
 		_bt_restscan(scan);
 		res = _bt_next(scan, dir);
@@ -369,8 +362,9 @@ btgettuple(PG_FUNCTION_ARGS)
 		res = _bt_first(scan, dir);
 
 	/*
-	 * Save heap TID to use it in _bt_restscan. Unlock buffer before
-	 * leaving index !
+	 * Save heap TID to use it in _bt_restscan.  Then release the read
+	 * lock on the buffer so that we aren't blocking other backends.
+	 * NOTE: we do keep the pin on the buffer!
 	 */
 	if (res)
 	{
@@ -419,7 +413,18 @@ btrescan(PG_FUNCTION_ARGS)
 
 	so = (BTScanOpaque) scan->opaque;
 
-	/* we don't hold a read lock on the current page in the scan */
+	if (so == NULL)				/* if called from btbeginscan */
+	{
+		so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
+		so->btso_curbuf = so->btso_mrkbuf = InvalidBuffer;
+		so->keyData = (ScanKey) NULL;
+		if (scan->numberOfKeys > 0)
+			so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
+		scan->opaque = so;
+		scan->flags = 0x0;
+	}
+
+	/* we aren't holding any read locks, but gotta drop the pins */
 	if (ItemPointerIsValid(iptr = &(scan->currentItemData)))
 	{
 		ReleaseBuffer(so->btso_curbuf);
@@ -427,7 +432,6 @@ btrescan(PG_FUNCTION_ARGS)
 		ItemPointerSetInvalid(iptr);
 	}
 
-	/* and we don't hold a read lock on the last marked item in the scan */
 	if (ItemPointerIsValid(iptr = &(scan->currentMarkData)))
 	{
 		ReleaseBuffer(so->btso_mrkbuf);
@@ -435,17 +439,6 @@ btrescan(PG_FUNCTION_ARGS)
 		ItemPointerSetInvalid(iptr);
 	}
 
-	if (so == NULL)				/* if called from btbeginscan */
-	{
-		so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
-		so->btso_curbuf = so->btso_mrkbuf = InvalidBuffer;
-		so->keyData = (ScanKey) NULL;
-		if (scan->numberOfKeys > 0)
-			so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
-		scan->opaque = so;
-		scan->flags = 0x0;
-	}
-
 	/*
 	 * Reset the scan keys. Note that keys ordering stuff moved to
 	 * _bt_first.	   - vadim 05/05/97
@@ -472,7 +465,7 @@ btmovescan(IndexScanDesc scan, Datum v)
 
 	so = (BTScanOpaque) scan->opaque;
 
-	/* we don't hold a read lock on the current page in the scan */
+	/* we aren't holding any read locks, but gotta drop the pin */
 	if (ItemPointerIsValid(iptr = &(scan->currentItemData)))
 	{
 		ReleaseBuffer(so->btso_curbuf);
@@ -480,7 +473,6 @@ btmovescan(IndexScanDesc scan, Datum v)
 		ItemPointerSetInvalid(iptr);
 	}
 
-/*	  scan->keyData[0].sk_argument = v; */
 	so->keyData[0].sk_argument = v;
 }
 
@@ -496,7 +488,7 @@ btendscan(PG_FUNCTION_ARGS)
 
 	so = (BTScanOpaque) scan->opaque;
 
-	/* we don't hold any read locks */
+	/* we aren't holding any read locks, but gotta drop the pins */
 	if (ItemPointerIsValid(iptr = &(scan->currentItemData)))
 	{
 		if (BufferIsValid(so->btso_curbuf))
@@ -534,7 +526,7 @@ btmarkpos(PG_FUNCTION_ARGS)
 
 	so = (BTScanOpaque) scan->opaque;
 
-	/* we don't hold any read locks */
+	/* we aren't holding any read locks, but gotta drop the pin */
 	if (ItemPointerIsValid(iptr = &(scan->currentMarkData)))
 	{
 		ReleaseBuffer(so->btso_mrkbuf);
@@ -542,7 +534,7 @@ btmarkpos(PG_FUNCTION_ARGS)
 		ItemPointerSetInvalid(iptr);
 	}
 
-	/* bump pin on current buffer */
+	/* bump pin on current buffer for assignment to mark buffer */
 	if (ItemPointerIsValid(&(scan->currentItemData)))
 	{
 		so->btso_mrkbuf = ReadBuffer(scan->relation,
@@ -566,7 +558,7 @@ btrestrpos(PG_FUNCTION_ARGS)
 
 	so = (BTScanOpaque) scan->opaque;
 
-	/* we don't hold any read locks */
+	/* we aren't holding any read locks, but gotta drop the pin */
 	if (ItemPointerIsValid(iptr = &(scan->currentItemData)))
 	{
 		ReleaseBuffer(so->btso_curbuf);
@@ -579,7 +571,6 @@ btrestrpos(PG_FUNCTION_ARGS)
 	{
 		so->btso_curbuf = ReadBuffer(scan->relation,
 								  BufferGetBlockNumber(so->btso_mrkbuf));
-
 		scan->currentItemData = scan->currentMarkData;
 		so->curHeapIptr = so->mrkHeapIptr;
 	}
@@ -603,6 +594,9 @@ btdelete(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+/*
+ * Restore scan position when btgettuple is called to continue a scan.
+ */
 static void
 _bt_restscan(IndexScanDesc scan)
 {
@@ -618,7 +612,12 @@ _bt_restscan(IndexScanDesc scan)
 	BTItem		item;
 	BlockNumber blkno;
 
-	LockBuffer(buf, BT_READ);	/* lock buffer first! */
+	/*
+	 * Get back the read lock we were holding on the buffer.
+	 * (We still have a reference-count pin on it, though.)
+	 */
+	LockBuffer(buf, BT_READ);
+
 	page = BufferGetPage(buf);
 	maxoff = PageGetMaxOffsetNumber(page);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -631,43 +630,40 @@ _bt_restscan(IndexScanDesc scan)
 	 */
 	if (!ItemPointerIsValid(&target))
 	{
-		ItemPointerSetOffsetNumber(&(scan->currentItemData),
-		   OffsetNumberPrev(P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY));
+		ItemPointerSetOffsetNumber(current,
+								   OffsetNumberPrev(P_FIRSTDATAKEY(opaque)));
 		return;
 	}
 
-	if (maxoff >= offnum)
+	/*
+	 * The item we were on may have moved right due to insertions.
+	 * Find it again.
+	 */
+	for (;;)
 	{
-
-		/*
-		 * if the item is where we left it or has just moved right on this
-		 * page, we're done
-		 */
+		/* Check for item on this page */
 		for (;
 			 offnum <= maxoff;
 			 offnum = OffsetNumberNext(offnum))
 		{
 			item = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
-			if (item->bti_itup.t_tid.ip_blkid.bi_hi == \
-				target.ip_blkid.bi_hi && \
-				item->bti_itup.t_tid.ip_blkid.bi_lo == \
-				target.ip_blkid.bi_lo && \
+			if (item->bti_itup.t_tid.ip_blkid.bi_hi ==
+				target.ip_blkid.bi_hi &&
+				item->bti_itup.t_tid.ip_blkid.bi_lo ==
+				target.ip_blkid.bi_lo &&
 				item->bti_itup.t_tid.ip_posid == target.ip_posid)
 			{
 				current->ip_posid = offnum;
 				return;
 			}
 		}
-	}
 
-	/*
-	 * By here, the item we're looking for moved right at least one page
-	 */
-	for (;;)
-	{
+		/*
+		 * By here, the item we're looking for moved right at least one page
+		 */
 		if (P_RIGHTMOST(opaque))
-			elog(FATAL, "_bt_restscan: my bits moved right off the end of the world!\
-\n\tRecreate index %s.", RelationGetRelationName(rel));
+			elog(FATAL, "_bt_restscan: my bits moved right off the end of the world!"
+				 "\n\tRecreate index %s.", RelationGetRelationName(rel));
 
 		blkno = opaque->btpo_next;
 		_bt_relbuf(rel, buf, BT_READ);
@@ -675,23 +671,8 @@ _bt_restscan(IndexScanDesc scan)
 		page = BufferGetPage(buf);
 		maxoff = PageGetMaxOffsetNumber(page);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
-		/* see if it's on this page */
-		for (offnum = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
-			 offnum <= maxoff;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			item = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
-			if (item->bti_itup.t_tid.ip_blkid.bi_hi == \
-				target.ip_blkid.bi_hi && \
-				item->bti_itup.t_tid.ip_blkid.bi_lo == \
-				target.ip_blkid.bi_lo && \
-				item->bti_itup.t_tid.ip_posid == target.ip_posid)
-			{
-				ItemPointerSet(current, blkno, offnum);
-				so->btso_curbuf = buf;
-				return;
-			}
-		}
+		offnum = P_FIRSTDATAKEY(opaque);
+		ItemPointerSet(current, blkno, offnum);
+		so->btso_curbuf = buf;
 	}
 }
diff --git a/src/backend/access/nbtree/nbtscan.c b/src/backend/access/nbtree/nbtscan.c
index 37469365bcd42e2b37c06346341d6a9bdf9e95ab..5d48895c1ad28ca3ed76ae6c6f9f49d9258581d5 100644
--- a/src/backend/access/nbtree/nbtscan.c
+++ b/src/backend/access/nbtree/nbtscan.c
@@ -8,22 +8,25 @@
  *
  *
  * IDENTIFICATION
- *	  $Header: /cvsroot/pgsql/src/backend/access/nbtree/Attic/nbtscan.c,v 1.31 2000/04/12 17:14:49 momjian Exp $
+ *	  $Header: /cvsroot/pgsql/src/backend/access/nbtree/Attic/nbtscan.c,v 1.32 2000/07/21 06:42:32 tgl Exp $
  *
  *
  * NOTES
  *	 Because we can be doing an index scan on a relation while we update
  *	 it, we need to avoid missing data that moves around in the index.
- *	 The routines and global variables in this file guarantee that all
- *	 scans in the local address space stay correctly positioned.  This
- *	 is all we need to worry about, since write locking guarantees that
- *	 no one else will be on the same page at the same time as we are.
+ *	 Insertions and page splits are no problem because _bt_restscan()
+ *	 can figure out where the current item moved to, but if a deletion
+ *	 happens at or before the current scan position, we'd better do
+ *	 something to stay in sync.
+ *
+ *	 The routines in this file handle the problem for deletions issued
+ *	 by the current backend.  Currently, that's all we need, since
+ *	 deletions are only done by VACUUM and it gets an exclusive lock.
  *
  *	 The scheme is to manage a list of active scans in the current backend.
- *	 Whenever we add or remove records from an index, or whenever we
- *	 split a leaf page, we check the list of active scans to see if any
- *	 has been affected.  A scan is affected only if it is on the same
- *	 relation, and the same page, as the update.
+ *	 Whenever we remove a record from an index, we check the list of active
+ *	 scans to see if any has been affected.  A scan is affected only if it
+ *	 is on the same relation, and the same page, as the update.
  *
  *-------------------------------------------------------------------------
  */
@@ -111,7 +114,7 @@ _bt_dropscan(IndexScanDesc scan)
 
 /*
  *	_bt_adjscans() -- adjust all scans in the scan list to compensate
- *					  for a given deletion or insertion
+ *					  for a given deletion
  */
 void
 _bt_adjscans(Relation rel, ItemPointer tid)
@@ -153,7 +156,7 @@ _bt_scandel(IndexScanDesc scan, BlockNumber blkno, OffsetNumber offno)
 	{
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-		start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
+		start = P_FIRSTDATAKEY(opaque);
 		if (ItemPointerGetOffsetNumber(current) == start)
 			ItemPointerSetInvalid(&(so->curHeapIptr));
 		else
@@ -165,7 +168,6 @@ _bt_scandel(IndexScanDesc scan, BlockNumber blkno, OffsetNumber offno)
 			 */
 			LockBuffer(buf, BT_READ);
 			_bt_step(scan, &buf, BackwardScanDirection);
-			so->btso_curbuf = buf;
 			if (ItemPointerIsValid(current))
 			{
 				Page		pg = BufferGetPage(buf);
@@ -183,10 +185,9 @@ _bt_scandel(IndexScanDesc scan, BlockNumber blkno, OffsetNumber offno)
 		&& ItemPointerGetBlockNumber(current) == blkno
 		&& ItemPointerGetOffsetNumber(current) >= offno)
 	{
-
 		page = BufferGetPage(so->btso_mrkbuf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-		start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
+		start = P_FIRSTDATAKEY(opaque);
 
 		if (ItemPointerGetOffsetNumber(current) == start)
 			ItemPointerSetInvalid(&(so->mrkHeapIptr));
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 54c15b2f6a9b57272c5057e91c18b8c0ae315a03..49aec3b23d9b05788b9b9302d41df90c94fa11ed 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1,14 +1,14 @@
 /*-------------------------------------------------------------------------
  *
- * btsearch.c
+ * nbtsearch.c
  *	  search code for postgres btrees.
  *
+ *
  * Portions Copyright (c) 1996-2000, PostgreSQL, Inc
  * Portions Copyright (c) 1994, Regents of the University of California
  *
- *
  * IDENTIFICATION
- *	  $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtsearch.c,v 1.60 2000/05/30 04:24:33 tgl Exp $
+ *	  $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtsearch.c,v 1.61 2000/07/21 06:42:32 tgl Exp $
  *
  *-------------------------------------------------------------------------
  */
@@ -19,102 +19,96 @@
 #include "access/nbtree.h"
 
 
+static RetrieveIndexResult _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 
-static BTStack _bt_searchr(Relation rel, int keysz, ScanKey scankey,
-			Buffer *bufP, BTStack stack_in);
-static int32 _bt_compare(Relation rel, TupleDesc itupdesc, Page page,
-						 int keysz, ScanKey scankey, OffsetNumber offnum);
-static bool
-			_bt_twostep(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
-static RetrieveIndexResult
-			_bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 
 /*
- *	_bt_search() -- Search for a scan key in the index.
+ *	_bt_search() -- Search the tree for a particular scankey,
+ *		or more precisely for the first leaf page it could be on.
+ *
+ * Return value is a stack of parent-page pointers.  *bufP is set to the
+ * address of the leaf-page buffer, which is read-locked and pinned.
+ * No locks are held on the parent pages, however!
  *
- *		This routine is actually just a helper that sets things up and
- *		calls a recursive-descent search routine on the tree.
+ * NOTE that the returned buffer is read-locked regardless of the access
+ * parameter.  However, access = BT_WRITE will allow an empty root page
+ * to be created and returned.  When access = BT_READ, an empty index
+ * will result in *bufP being set to InvalidBuffer.
  */
 BTStack
-_bt_search(Relation rel, int keysz, ScanKey scankey, Buffer *bufP)
-{
-	*bufP = _bt_getroot(rel, BT_READ);
-	return _bt_searchr(rel, keysz, scankey, bufP, (BTStack) NULL);
-}
-
-/*
- *	_bt_searchr() -- Search the tree recursively for a particular scankey.
- */
-static BTStack
-_bt_searchr(Relation rel,
-			int keysz,
-			ScanKey scankey,
-			Buffer *bufP,
-			BTStack stack_in)
+_bt_search(Relation rel, int keysz, ScanKey scankey,
+		   Buffer *bufP, int access)
 {
-	BTStack		stack;
-	OffsetNumber offnum;
-	Page		page;
-	BTPageOpaque opaque;
-	BlockNumber par_blkno;
-	BlockNumber blkno;
-	ItemId		itemid;
-	BTItem		btitem;
-	BTItem		item_save;
-	int			item_nbytes;
-	IndexTuple	itup;
+	BTStack stack_in = NULL;
 
-	/* if this is a leaf page, we're done */
-	page = BufferGetPage(*bufP);
-	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-	if (opaque->btpo_flags & BTP_LEAF)
-		return stack_in;
+	/* Get the root page to start with */
+	*bufP = _bt_getroot(rel, access);
 
-	/*
-	 * Find the appropriate item on the internal page, and get the child
-	 * page that it points to.
-	 */
+	/* If index is empty and access = BT_READ, no root page is created. */
+	if (! BufferIsValid(*bufP))
+		return (BTStack) NULL;
 
-	par_blkno = BufferGetBlockNumber(*bufP);
-	offnum = _bt_binsrch(rel, *bufP, keysz, scankey, BT_DESCENT);
-	itemid = PageGetItemId(page, offnum);
-	btitem = (BTItem) PageGetItem(page, itemid);
-	itup = &(btitem->bti_itup);
-	blkno = ItemPointerGetBlockNumber(&(itup->t_tid));
+	/* Loop iterates once per level descended in the tree */
+	for (;;)
+	{
+		Page		page;
+		BTPageOpaque opaque;
+		OffsetNumber offnum;
+		ItemId		itemid;
+		BTItem		btitem;
+		IndexTuple	itup;
+		BlockNumber blkno;
+		BlockNumber par_blkno;
+		BTStack		new_stack;
+
+		/* if this is a leaf page, we're done */
+		page = BufferGetPage(*bufP);
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		if (P_ISLEAF(opaque))
+			break;
 
-	/*
-	 * We need to save the bit image of the index entry we chose in the
-	 * parent page on a stack.	In case we split the tree, we'll use this
-	 * bit image to figure out what our real parent page is, in case the
-	 * parent splits while we're working lower in the tree.  See the paper
-	 * by Lehman and Yao for how this is detected and handled.	(We use
-	 * unique OIDs to disambiguate duplicate keys in the index -- Lehman
-	 * and Yao disallow duplicate keys).
-	 */
+		/*
+		 * Find the appropriate item on the internal page, and get the
+		 * child page that it points to.
+		 */
+		offnum = _bt_binsrch(rel, *bufP, keysz, scankey);
+		itemid = PageGetItemId(page, offnum);
+		btitem = (BTItem) PageGetItem(page, itemid);
+		itup = &(btitem->bti_itup);
+		blkno = ItemPointerGetBlockNumber(&(itup->t_tid));
+		par_blkno = BufferGetBlockNumber(*bufP);
 
-	item_nbytes = ItemIdGetLength(itemid);
-	item_save = (BTItem) palloc(item_nbytes);
-	memmove((char *) item_save, (char *) btitem, item_nbytes);
-	stack = (BTStack) palloc(sizeof(BTStackData));
-	stack->bts_blkno = par_blkno;
-	stack->bts_offset = offnum;
-	stack->bts_btitem = item_save;
-	stack->bts_parent = stack_in;
+		/*
+		 * We need to save the bit image of the index entry we chose in the
+		 * parent page on a stack. In case we split the tree, we'll use this
+		 * bit image to figure out what our real parent page is, in case the
+		 * parent splits while we're working lower in the tree.  See the paper
+		 * by Lehman and Yao for how this is detected and handled. (We use the
+		 * child link to disambiguate duplicate keys in the index -- Lehman
+		 * and Yao disallow duplicate keys.)
+		 */
+		new_stack = (BTStack) palloc(sizeof(BTStackData));
+		new_stack->bts_blkno = par_blkno;
+		new_stack->bts_offset = offnum;
+		memcpy(&new_stack->bts_btitem, btitem, sizeof(BTItemData));
+		new_stack->bts_parent = stack_in;
 
-	/* drop the read lock on the parent page and acquire one on the child */
-	_bt_relbuf(rel, *bufP, BT_READ);
-	*bufP = _bt_getbuf(rel, blkno, BT_READ);
+		/* drop the read lock on the parent page, acquire one on the child */
+		_bt_relbuf(rel, *bufP, BT_READ);
+		*bufP = _bt_getbuf(rel, blkno, BT_READ);
 
-	/*
-	 * Race -- the page we just grabbed may have split since we read its
-	 * pointer in the parent.  If it has, we may need to move right to its
-	 * new sibling.  Do that.
-	 */
+		/*
+		 * Race -- the page we just grabbed may have split since we read its
+		 * pointer in the parent.  If it has, we may need to move right to its
+		 * new sibling.  Do that.
+		 */
+		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, BT_READ);
 
-	*bufP = _bt_moveright(rel, *bufP, keysz, scankey, BT_READ);
+		/* okay, all set to move down a level */
+		stack_in = new_stack;
+	}
 
-	/* okay, all set to move down a level */
-	return _bt_searchr(rel, keysz, scankey, bufP, stack);
+	return stack_in;
 }
 
 /*
@@ -133,7 +127,7 @@ _bt_searchr(Relation rel,
  *
  *		On entry, we have the buffer pinned and a lock of the proper type.
  *		If we move right, we release the buffer and lock and acquire the
- *		same on the right sibling.
+ *		same on the right sibling.  Return value is the buffer we stop at.
  */
 Buffer
 _bt_moveright(Relation rel,
@@ -144,231 +138,81 @@ _bt_moveright(Relation rel,
 {
 	Page		page;
 	BTPageOpaque opaque;
-	ItemId		hikey;
-	BlockNumber rblkno;
-	int			natts = rel->rd_rel->relnatts;
 
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
-	/* if we're on a rightmost page, we don't need to move right */
-	if (P_RIGHTMOST(opaque))
-		return buf;
-
-	/* by convention, item 0 on non-rightmost pages is the high key */
-	hikey = PageGetItemId(page, P_HIKEY);
-
 	/*
-	 * If the scan key that brought us to this page is >= the high key
+	 * If the scan key that brought us to this page is > the high key
 	 * stored on the page, then the page has split and we need to move
-	 * right.
+	 * right.  (If the scan key is equal to the high key, we might or
+	 * might not need to move right; have to scan the page first anyway.)
+	 * It could even have split more than once, so scan as far as needed.
 	 */
-
-	if (_bt_skeycmp(rel, keysz, scankey, page, hikey,
-					BTGreaterEqualStrategyNumber))
+	while (!P_RIGHTMOST(opaque) &&
+		   _bt_compare(rel, keysz, scankey, page, P_HIKEY) > 0)
 	{
-		/* move right as long as we need to */
-		do
-		{
-			OffsetNumber offmax = PageGetMaxOffsetNumber(page);
-
-			/*
-			 * If this page consists of all duplicate keys (hikey and
-			 * first key on the page have the same value), then we don't
-			 * need to step right.
-			 *
-			 * NOTE for multi-column indices: we may do scan using keys not
-			 * for all attrs. But we handle duplicates using all attrs in
-			 * _bt_insert/_bt_spool code. And so we've to compare scankey
-			 * with _last_ item on this page to do not lose "good" tuples
-			 * if number of attrs > keysize. Example: (2,0) - last items
-			 * on this page, (2,1) - first item on next page (hikey), our
-			 * scankey is x = 2. Scankey == (2,1) because of we compare
-			 * first attrs only, but we shouldn't to move right of here. -
-			 * vadim 04/15/97
-			 *
-			 * Also, if this page is not LEAF one (and # of attrs > keysize)
-			 * then we can't move too. - vadim 10/22/97
-			 */
-
-			if (_bt_skeycmp(rel, keysz, scankey, page, hikey,
-							BTEqualStrategyNumber))
-			{
-				if (opaque->btpo_flags & BTP_CHAIN)
-				{
-					Assert((opaque->btpo_flags & BTP_LEAF) || offmax > P_HIKEY);
-					break;
-				}
-				if (offmax > P_HIKEY)
-				{
-					if (natts == keysz) /* sanity checks */
-					{
-						if (_bt_skeycmp(rel, keysz, scankey, page,
-										PageGetItemId(page, P_FIRSTKEY),
-										BTEqualStrategyNumber))
-							elog(FATAL, "btree: BTP_CHAIN flag was expected in %s (access = %s)",
-								 RelationGetRelationName(rel), access ? "bt_write" : "bt_read");
-						if (_bt_skeycmp(rel, keysz, scankey, page,
-										PageGetItemId(page, offmax),
-										BTEqualStrategyNumber))
-							elog(FATAL, "btree: unexpected equal last item");
-						if (_bt_skeycmp(rel, keysz, scankey, page,
-										PageGetItemId(page, offmax),
-										BTLessStrategyNumber))
-							elog(FATAL, "btree: unexpected greater last item");
-						/* move right */
-					}
-					else if (!(opaque->btpo_flags & BTP_LEAF))
-						break;
-					else if (_bt_skeycmp(rel, keysz, scankey, page,
-										 PageGetItemId(page, offmax),
-										 BTLessEqualStrategyNumber))
-						break;
-				}
-			}
+		/* step right one page */
+		BlockNumber		rblkno = opaque->btpo_next;
 
-			/* step right one page */
-			rblkno = opaque->btpo_next;
-			_bt_relbuf(rel, buf, access);
-			buf = _bt_getbuf(rel, rblkno, access);
-			page = BufferGetPage(buf);
-			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-			hikey = PageGetItemId(page, P_HIKEY);
-
-		} while (!P_RIGHTMOST(opaque)
-				 && _bt_skeycmp(rel, keysz, scankey, page, hikey,
-								BTGreaterEqualStrategyNumber));
+		_bt_relbuf(rel, buf, access);
+		buf = _bt_getbuf(rel, rblkno, access);
+		page = BufferGetPage(buf);
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	}
+
 	return buf;
 }
 
 /*
- *	_bt_skeycmp() -- compare a scan key to a particular item on a page using
- *					 a requested strategy (<, <=, =, >=, >).
+ *	_bt_binsrch() -- Do a binary search for a key on a particular page.
  *
- *		We ignore the unique OIDs stored in the btree item here.  Those
- *		numbers are intended for use internally only, in repositioning a
- *		scan after a page split.  They do not impose any meaningful ordering.
+ * The scankey we get has the compare function stored in the procedure
+ * entry of each data struct.  We invoke this regproc to do the
+ * comparison for every key in the scankey.
  *
- *		The comparison is A <op> B, where A is the scan key and B is the
- *		tuple pointed at by itemid on page.
- */
-bool
-_bt_skeycmp(Relation rel,
-			Size keysz,
-			ScanKey scankey,
-			Page page,
-			ItemId itemid,
-			StrategyNumber strat)
-{
-	BTItem		item;
-	IndexTuple	indexTuple;
-	TupleDesc	tupDes;
-	int			i;
-	int32		compare = 0;
-
-	item = (BTItem) PageGetItem(page, itemid);
-	indexTuple = &(item->bti_itup);
-
-	tupDes = RelationGetDescr(rel);
-
-	for (i = 1; i <= (int) keysz; i++)
-	{
-		ScanKey		entry = &scankey[i - 1];
-		Datum		attrDatum;
-		bool		isNull;
-
-		Assert(entry->sk_attno == i);
-		attrDatum = index_getattr(indexTuple,
-								  entry->sk_attno,
-								  tupDes,
-								  &isNull);
-
-		/* see comments about NULLs handling in btbuild */
-		if (entry->sk_flags & SK_ISNULL)		/* key is NULL */
-		{
-			if (isNull)
-				compare = 0;	/* NULL key "=" NULL datum */
-			else
-				compare = 1;	/* NULL key ">" not-NULL datum */
-		}
-		else if (isNull)		/* key is NOT_NULL and item is NULL */
-		{
-			compare = -1;		/* not-NULL key "<" NULL datum */
-		}
-		else
-			compare = DatumGetInt32(FunctionCall2(&entry->sk_func,
-												  entry->sk_argument,
-												  attrDatum));
-
-		if (compare != 0)
-			break;				/* done when we find unequal attributes */
-	}
-
-	switch (strat)
-	{
-		case BTLessStrategyNumber:
-			return (bool) (compare < 0);
-		case BTLessEqualStrategyNumber:
-			return (bool) (compare <= 0);
-		case BTEqualStrategyNumber:
-			return (bool) (compare == 0);
-		case BTGreaterEqualStrategyNumber:
-			return (bool) (compare >= 0);
-		case BTGreaterStrategyNumber:
-			return (bool) (compare > 0);
-	}
-
-	elog(ERROR, "_bt_skeycmp: bogus strategy %d", (int) strat);
-	return false;
-}
-
-/*
- *	_bt_binsrch() -- Do a binary search for a key on a particular page.
+ * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
+ * key >= given scankey.  (NOTE: in particular, this means it is possible
+ * to return a value 1 greater than the number of keys on the page,
+ * if the scankey is > all keys on the page.)
  *
- *		The scankey we get has the compare function stored in the procedure
- *		entry of each data struct.	We invoke this regproc to do the
- *		comparison for every key in the scankey.  _bt_binsrch() returns
- *		the OffsetNumber of the first matching key on the page, or the
- *		OffsetNumber at which the matching key would appear if it were
- *		on this page.  (NOTE: in particular, this means it is possible to
- *		return a value 1 greater than the number of keys on the page, if
- *		the scankey is > all keys on the page.)
+ * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
+ * of the last key < given scankey.  (Since _bt_compare treats the first
+ * data key of such a page as minus infinity, there will be at least one
+ * key < scankey, so the result always points at one of the keys on the
+ * page.)  This key indicates the right place to descend to be sure we
+ * find all leaf keys >= given scankey.
  *
- *		By the time this procedure is called, we're sure we're looking
- *		at the right page -- don't need to walk right.  _bt_binsrch() has
- *		no lock or refcount side effects on the buffer.
+ * This procedure is not responsible for walking right, it just examines
+ * the given page.  _bt_binsrch() has no lock or refcount side effects
+ * on the buffer.
  */
 OffsetNumber
 _bt_binsrch(Relation rel,
 			Buffer buf,
 			int keysz,
-			ScanKey scankey,
-			int srchtype)
+			ScanKey scankey)
 {
 	TupleDesc	itupdesc;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber low,
 				high;
-	bool		haveEq;
-	int			natts = rel->rd_rel->relnatts;
 	int32		result;
 
 	itupdesc = RelationGetDescr(rel);
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
-	/* by convention, item 1 on any non-rightmost page is the high key */
-	low = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
-
+	low = P_FIRSTDATAKEY(opaque);
 	high = PageGetMaxOffsetNumber(page);
 
 	/*
 	 * If there are no keys on the page, return the first available slot.
 	 * Note this covers two cases: the page is really empty (no keys), or
 	 * it contains only a high key.  The latter case is possible after
-	 * vacuuming.
+	 * vacuuming.  This can never happen on an internal page, however,
+	 * since they are never empty (an internal page must have children).
 	 */
 	if (high < low)
 		return low;
@@ -376,11 +220,9 @@ _bt_binsrch(Relation rel,
 	/*
 	 * Binary search to find the first key on the page >= scan key. Loop
 	 * invariant: all slots before 'low' are < scan key, all slots at or
-	 * after 'high' are >= scan key.  Also, haveEq is true if the tuple at
-	 * 'high' is == scan key. We can fall out when high == low.
+	 * after 'high' are >= scan key.  We can fall out when high == low.
 	 */
 	high++;						/* establish the loop invariant for high */
-	haveEq = false;
 
 	while (high > low)
 	{
@@ -388,175 +230,77 @@ _bt_binsrch(Relation rel,
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, itupdesc, page, keysz, scankey, mid);
+		result = _bt_compare(rel, keysz, scankey, page, mid);
 
 		if (result > 0)
 			low = mid + 1;
 		else
-		{
 			high = mid;
-			haveEq = (result == 0);
-		}
 	}
 
 	/*--------------------
 	 * At this point we have high == low, but be careful: they could point
-	 * past the last slot on the page.	We also know that haveEq is true
-	 * if and only if there is an equal key (in which case high&low point
-	 * at the first equal key).
+	 * past the last slot on the page.
 	 *
 	 * On a leaf page, we always return the first key >= scan key
 	 * (which could be the last slot + 1).
 	 *--------------------
 	 */
-
-	if (opaque->btpo_flags & BTP_LEAF)
+	if (P_ISLEAF(opaque))
 		return low;
 
 	/*--------------------
-	 * On a non-leaf page, there are special cases:
-	 *
-	 * For an insertion (srchtype != BT_DESCENT and natts == keysz)
-	 * always return first key >= scan key (which could be off the end).
-	 *
-	 * For a standard search (srchtype == BT_DESCENT and natts == keysz)
-	 * return the first equal key if one exists, else the last lesser key
-	 * if one exists, else the first slot on the page.
-	 *
-	 * For a partial-match search (srchtype == BT_DESCENT and natts > keysz)
-	 * return the last lesser key if one exists, else the first slot.
-	 *
-	 * Old comments:
-	 *	For multi-column indices, we may scan using keys
-	 *	not for all attrs. But we handle duplicates using all attrs
-	 *	in _bt_insert/_bt_spool code. And so while searching on
-	 *	internal pages having number of attrs > keysize we want to
-	 *	point at the last item < the scankey, not at the first item
-	 *	= the scankey (!!!), and let _bt_moveright decide later
-	 *	whether to move right or not (see comments and example
-	 *	there). Note also that INSERTions are not affected by this
-	 *	code (since natts == keysz for inserts).		   - vadim 04/15/97
+	 * On a non-leaf page, return the last key < scan key.
+	 * There must be one if _bt_compare() is playing by the rules.
 	 *--------------------
 	 */
-
-	if (haveEq)
-	{
-
-		/*
-		 * There is an equal key.  We return either the first equal key
-		 * (which we just found), or the last lesser key.
-		 *
-		 * We need not check srchtype != BT_DESCENT here, since if that is
-		 * true then natts == keysz by assumption.
-		 */
-		if (natts == keysz)
-			return low;			/* return first equal key */
-	}
-	else
-	{
-
-		/*
-		 * There is no equal key.  We return either the first greater key
-		 * (which we just found), or the last lesser key.
-		 */
-		if (srchtype != BT_DESCENT)
-			return low;			/* return first greater key */
-	}
-
-
-	if (low == (P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY))
-		return low;				/* there is no prior item */
+	Assert(low > P_FIRSTDATAKEY(opaque));
 
 	return OffsetNumberPrev(low);
 }
 
-/*
+/*----------
  *	_bt_compare() -- Compare scankey to a particular tuple on the page.
  *
+ *	keysz: number of key conditions to be checked (might be less than the
+ *	total length of the scan key!)
+ *	page/offnum: location of btree item to be compared to.
+ *
  *		This routine returns:
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
+ *		NULLs in the keys are treated as sortable values.  Therefore
+ *		"equality" does not necessarily mean that the item should be
+ *		returned to the caller as a matching key!
  *
- *		-- Old comments:
- *		In order to avoid having to propagate changes up the tree any time
- *		a new minimal key is inserted, the leftmost entry on the leftmost
- *		page is less than all possible keys, by definition.
- *
- *		-- New ones:
- *		New insertion code (fix against updating _in_place_ if new minimal
- *		key has bigger size than old one) may delete P_HIKEY entry on the
- *		root page in order to insert new minimal key - and so this definition
- *		does not work properly in this case and breaks key' order on root
- *		page. BTW, this propagation occures only while page' splitting,
- *		but not "any time a new min key is inserted" (see _bt_insertonpg).
- *				- vadim 12/05/96
+ * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
+ * "minus infinity": this routine will always claim it is less than the
+ * scankey.  The actual key value stored (if any, which there probably isn't)
+ * does not matter.  This convention allows us to implement the Lehman and
+ * Yao convention that the first down-link pointer is before the first key.
+ * See backend/access/nbtree/README for details.
+ *----------
  */
-static int32
+int32
 _bt_compare(Relation rel,
-			TupleDesc itupdesc,
-			Page page,
 			int keysz,
 			ScanKey scankey,
+			Page page,
 			OffsetNumber offnum)
 {
-	Datum		datum;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	BTItem		btitem;
 	IndexTuple	itup;
-	BTPageOpaque opaque;
-	ScanKey		entry;
-	AttrNumber	attno;
-	int32		result;
 	int			i;
-	bool		null;
 
 	/*
-	 * If this is a leftmost internal page, and if our comparison is with
-	 * the first key on the page, then the item at that position is by
-	 * definition less than the scan key.
-	 *
-	 * - see new comments above...
+	 * Force result ">" if target item is first data item on an internal
+	 * page --- see NOTE above.
 	 */
-
-	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
-	if (!(opaque->btpo_flags & BTP_LEAF)
-		&& P_LEFTMOST(opaque)
-		&& offnum == P_HIKEY)
-	{
-
-		/*
-		 * we just have to believe that this will only be called with
-		 * offnum == P_HIKEY when P_HIKEY is the OffsetNumber of the first
-		 * actual data key (i.e., this is also a rightmost page).  there
-		 * doesn't seem to be any code that implies that the leftmost page
-		 * is normally missing a high key as well as the rightmost page.
-		 * but that implies that this code path only applies to the root
-		 * -- which seems unlikely..
-		 *
-		 * - see new comments above...
-		 */
-		if (!P_RIGHTMOST(opaque))
-			elog(ERROR, "_bt_compare: invalid comparison to high key");
-
-#ifdef NOT_USED
-
-		/*
-		 * We just have to belive that right answer will not break
-		 * anything. I've checked code and all seems to be ok. See new
-		 * comments above...
-		 *
-		 * -- Old comments If the item on the page is equal to the scankey,
-		 * that's okay to admit.  We just can't claim that the first key
-		 * on the page is greater than anything.
-		 */
-
-		if (_bt_skeycmp(rel, keysz, scankey, page, PageGetItemId(page, offnum),
-						BTEqualStrategyNumber))
-			return 0;
+	if (! P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
 		return 1;
-#endif
-	}
 
 	btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
 	itup = &(btitem->bti_itup);
@@ -568,37 +312,45 @@ _bt_compare(Relation rel,
 	 * they be in order.  If you think about how multi-key ordering works,
 	 * you'll understand why this is.
 	 *
-	 * We don't test for violation of this condition here.
+	 * We don't test for violation of this condition here, however.  The
+	 * initial setup for the index scan had better have gotten it right
+	 * (see _bt_first).
 	 */
 
-	for (i = 1; i <= keysz; i++)
+	for (i = 0; i < keysz; i++)
 	{
-		entry = &scankey[i - 1];
-		attno = entry->sk_attno;
-		datum = index_getattr(itup, attno, itupdesc, &null);
+		ScanKey		entry = &scankey[i];
+		Datum		datum;
+		bool		isNull;
+		int32		result;
+
+		datum = index_getattr(itup, entry->sk_attno, itupdesc, &isNull);
 
 		/* see comments about NULLs handling in btbuild */
-		if (entry->sk_flags & SK_ISNULL)		/* key is NULL */
+		if (entry->sk_flags & SK_ISNULL)	/* key is NULL */
 		{
-			if (null)
+			if (isNull)
 				result = 0;		/* NULL "=" NULL */
 			else
 				result = 1;		/* NULL ">" NOT_NULL */
 		}
-		else if (null)			/* key is NOT_NULL and item is NULL */
+		else if (isNull)		/* key is NOT_NULL and item is NULL */
 		{
 			result = -1;		/* NOT_NULL "<" NULL */
 		}
 		else
+		{
 			result = DatumGetInt32(FunctionCall2(&entry->sk_func,
-												 entry->sk_argument, datum));
+												 entry->sk_argument,
+												 datum));
+		}
 
 		/* if the keys are unequal, return the difference */
 		if (result != 0)
 			return result;
 	}
 
-	/* by here, the keys are equal */
+	/* if we get here, the keys are equal */
 	return 0;
 }
 
@@ -606,10 +358,10 @@ _bt_compare(Relation rel,
  *	_bt_next() -- Get the next item in a scan.
  *
  *		On entry, we have a valid currentItemData in the scan, and a
- *		read lock on the page that contains that item.	We do not have
- *		the page pinned.  We return the next item in the scan.	On
- *		exit, we have the page containing the next item locked but not
- *		pinned.
+ *		read lock and pin count on the page that contains that item.
+ *		We return the next item in the scan, or NULL if no more.
+ *		On successful exit, the page containing the new item is locked
+ *		and pinned; on NULL exit, no lock or pin is held.
  */
 RetrieveIndexResult
 _bt_next(IndexScanDesc scan, ScanDirection dir)
@@ -618,7 +370,6 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	OffsetNumber offnum;
-	RetrieveIndexResult res;
 	ItemPointer current;
 	BTItem		btitem;
 	IndexTuple	itup;
@@ -629,10 +380,9 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 	so = (BTScanOpaque) scan->opaque;
 	current = &(scan->currentItemData);
 
-	Assert(BufferIsValid(so->btso_curbuf));
-
 	/* we still have the buffer pinned and locked */
 	buf = so->btso_curbuf;
+	Assert(BufferIsValid(buf));
 
 	do
 	{
@@ -640,7 +390,7 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 		if (!_bt_step(scan, &buf, dir))
 			return (RetrieveIndexResult) NULL;
 
-		/* by here, current is the tuple we want to return */
+		/* current is the next candidate tuple to return */
 		offnum = ItemPointerGetOffsetNumber(current);
 		page = BufferGetPage(buf);
 		btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
@@ -648,17 +398,16 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 
 		if (_bt_checkkeys(scan, itup, &keysok))
 		{
+			/* tuple passes all scan key conditions, so return it */
 			Assert(keysok == so->numberOfKeys);
-			res = FormRetrieveIndexResult(current, &(itup->t_tid));
-
-			/* remember which buffer we have pinned and locked */
-			so->btso_curbuf = buf;
-			return res;
+			return FormRetrieveIndexResult(current, &(itup->t_tid));
 		}
 
+		/* This tuple doesn't pass, but there might be more that do */
 	} while (keysok >= so->numberOfFirstKeys ||
 			 (keysok == ((Size) -1) && ScanDirectionIsBackward(dir)));
 
+	/* No more items, so close down the current-item info */
 	ItemPointerSetInvalid(current);
 	so->btso_curbuf = InvalidBuffer;
 	_bt_relbuf(rel, buf, BT_READ);
@@ -680,14 +429,10 @@ RetrieveIndexResult
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel;
-	TupleDesc	itupdesc;
 	Buffer		buf;
 	Page		page;
-	BTPageOpaque pop;
 	BTStack		stack;
-	OffsetNumber offnum,
-				maxoff;
-	bool		offGmax = false;
+	OffsetNumber offnum;
 	BTItem		btitem;
 	IndexTuple	itup;
 	ItemPointer current;
@@ -698,7 +443,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	int32		result;
 	BTScanOpaque so;
 	Size		keysok;
-
 	bool		strategyCheck;
 	ScanKey		scankeys = 0;
 	int			keysCount = 0;
@@ -784,20 +528,17 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		return _bt_endpoint(scan, dir);
 	}
 
-	itupdesc = RelationGetDescr(rel);
-	current = &(scan->currentItemData);
-
 	/*
 	 * Okay, we want something more complicated.  What we'll do is use the
 	 * first item in the scan key passed in (which has been correctly
 	 * ordered to take advantage of index ordering) to position ourselves
 	 * at the right place in the scan.
 	 */
-	/* _bt_orderkeys disallows it, but it's place to add some code latter */
 	scankeys = (ScanKey) palloc(keysCount * sizeof(ScanKeyData));
 	for (i = 0; i < keysCount; i++)
 	{
 		j = nKeyIs[i];
+		/* _bt_orderkeys disallows it, but it's place to add some code latter */
 		if (so->keyData[j].sk_flags & SK_ISNULL)
 		{
 			pfree(nKeyIs);
@@ -812,234 +553,213 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	if (nKeyIs)
 		pfree(nKeyIs);
 
-	stack = _bt_search(rel, keysCount, scankeys, &buf);
-	_bt_freestack(stack);
-
-	blkno = BufferGetBlockNumber(buf);
-	page = BufferGetPage(buf);
+	current = &(scan->currentItemData);
 
 	/*
-	 * This will happen if the tree we're searching is entirely empty, or
-	 * if we're doing a search for a key that would appear on an entirely
-	 * empty internal page.  In either case, there are no matching tuples
-	 * in the index.
+	 * Use the manufactured scan key to descend the tree and position
+	 * ourselves on the target leaf page.
 	 */
+	stack = _bt_search(rel, keysCount, scankeys, &buf, BT_READ);
 
-	if (PageIsEmpty(page))
+	/* don't need to keep the stack around... */
+	_bt_freestack(stack);
+
+	if (! BufferIsValid(buf))
 	{
+		/* Only get here if index is completely empty */
 		ItemPointerSetInvalid(current);
 		so->btso_curbuf = InvalidBuffer;
-		_bt_relbuf(rel, buf, BT_READ);
 		pfree(scankeys);
 		return (RetrieveIndexResult) NULL;
 	}
-	maxoff = PageGetMaxOffsetNumber(page);
-	pop = (BTPageOpaque) PageGetSpecialPointer(page);
-
-	/*
-	 * Now _bt_moveright doesn't move from non-rightmost leaf page if
-	 * scankey == hikey and there is only hikey there. It's good for
-	 * insertion, but we need to do work for scan here. - vadim 05/27/97
-	 */
-
-	while (maxoff == P_HIKEY && !P_RIGHTMOST(pop) &&
-		   _bt_skeycmp(rel, keysCount, scankeys, page,
-					   PageGetItemId(page, P_HIKEY),
-					   BTGreaterEqualStrategyNumber))
-	{
-		/* step right one page */
-		blkno = pop->btpo_next;
-		_bt_relbuf(rel, buf, BT_READ);
-		buf = _bt_getbuf(rel, blkno, BT_READ);
-		page = BufferGetPage(buf);
-		if (PageIsEmpty(page))
-		{
-			ItemPointerSetInvalid(current);
-			so->btso_curbuf = InvalidBuffer;
-			_bt_relbuf(rel, buf, BT_READ);
-			pfree(scankeys);
-			return (RetrieveIndexResult) NULL;
-		}
-		maxoff = PageGetMaxOffsetNumber(page);
-		pop = (BTPageOpaque) PageGetSpecialPointer(page);
-	}
-
 
-	/* find the nearest match to the manufactured scan key on the page */
-	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, BT_DESCENT);
+	/* remember which buffer we have pinned */
+	so->btso_curbuf = buf;
+	blkno = BufferGetBlockNumber(buf);
+	page = BufferGetPage(buf);
 
-	if (offnum > maxoff)
-	{
-		offnum = maxoff;
-		offGmax = true;
-	}
+	offnum = _bt_binsrch(rel, buf, keysCount, scankeys);
 
 	ItemPointerSet(current, blkno, offnum);
 
-	/*
-	 * Now find the right place to start the scan.	Result is the value
-	 * we're looking for minus the value we're looking at in the index.
+	/*----------
+	 * At this point we are positioned at the first item >= scan key,
+	 * or possibly at the end of a page on which all the existing items
+	 * are < scan key and we know that everything on later pages is
+	 * >= scan key.  We could step forward in the latter case, but that'd
+	 * be a waste of time if we want to scan backwards.  So, it's now time to
+	 * examine the scan strategy to find the exact place to start the scan.
+	 *
+	 * Note: if _bt_step fails (meaning we fell off the end of the index
+	 * in one direction or the other), we either return NULL (no matches) or
+	 * call _bt_endpoint() to set up a scan starting at that index endpoint,
+	 * as appropriate for the desired scan type.
+	 *
+	 * it's yet other place to add some code latter for is(not)null ...
+	 *----------
 	 */
 
-	result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum);
-
-	/* it's yet other place to add some code latter for is(not)null */
-
-	strat = strat_total;
-	switch (strat)
+	switch (strat_total)
 	{
 		case BTLessStrategyNumber:
-			if (result <= 0)
+			/*
+			 * Back up one to arrive at last item < scankey
+			 */
+			if (!_bt_step(scan, &buf, BackwardScanDirection))
 			{
-				do
-				{
-					if (!_bt_twostep(scan, &buf, BackwardScanDirection))
-						break;
-
-					offnum = ItemPointerGetOffsetNumber(current);
-					page = BufferGetPage(buf);
-					result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum);
-				} while (result <= 0);
-
+				pfree(scankeys);
+				return (RetrieveIndexResult) NULL;
 			}
 			break;
 
 		case BTLessEqualStrategyNumber:
-			if (result >= 0)
+			/*
+			 * We need to find the last item <= scankey, so step forward
+			 * till we find one > scankey, then step back one.
+			 */
+			if (offnum > PageGetMaxOffsetNumber(page))
 			{
-				do
+				if (!_bt_step(scan, &buf, ForwardScanDirection))
 				{
-					if (!_bt_twostep(scan, &buf, ForwardScanDirection))
-						break;
-
-					offnum = ItemPointerGetOffsetNumber(current);
-					page = BufferGetPage(buf);
-					result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum);
-				} while (result >= 0);
+					pfree(scankeys);
+					return _bt_endpoint(scan, dir);
+				}
+			}
+			for (;;)
+			{
+				offnum = ItemPointerGetOffsetNumber(current);
+				page = BufferGetPage(buf);
+				result = _bt_compare(rel, keysCount, scankeys, page, offnum);
+				if (result < 0)
+					break;
+				if (!_bt_step(scan, &buf, ForwardScanDirection))
+				{
+					pfree(scankeys);
+					return _bt_endpoint(scan, dir);
+				}
+			}
+			if (!_bt_step(scan, &buf, BackwardScanDirection))
+			{
+				pfree(scankeys);
+				return (RetrieveIndexResult) NULL;
 			}
-			if (result < 0)
-				_bt_twostep(scan, &buf, BackwardScanDirection);
 			break;
 
 		case BTEqualStrategyNumber:
-			if (result != 0)
+			/*
+			 * Make sure we are on the first equal item; might have to step
+			 * forward if currently at end of page.
+			 */
+			if (offnum > PageGetMaxOffsetNumber(page))
 			{
-				_bt_relbuf(scan->relation, buf, BT_READ);
-				so->btso_curbuf = InvalidBuffer;
-				ItemPointerSetInvalid(&(scan->currentItemData));
-				pfree(scankeys);
-				return (RetrieveIndexResult) NULL;
+				if (!_bt_step(scan, &buf, ForwardScanDirection))
+				{
+					pfree(scankeys);
+					return (RetrieveIndexResult) NULL;
+				}
+				offnum = ItemPointerGetOffsetNumber(current);
+				page = BufferGetPage(buf);
 			}
-			else if (ScanDirectionIsBackward(dir))
+			result = _bt_compare(rel, keysCount, scankeys, page, offnum);
+			if (result != 0)
+				goto nomatches;	/* no equal items! */
+			/*
+			 * If a backward scan was specified, need to start with last
+			 * equal item not first one.
+			 */
+			if (ScanDirectionIsBackward(dir))
 			{
 				do
 				{
-					if (!_bt_twostep(scan, &buf, ForwardScanDirection))
-						break;
-
+					if (!_bt_step(scan, &buf, ForwardScanDirection))
+					{
+						pfree(scankeys);
+						return _bt_endpoint(scan, dir);
+					}
 					offnum = ItemPointerGetOffsetNumber(current);
 					page = BufferGetPage(buf);
-					result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum);
+					result = _bt_compare(rel, keysCount, scankeys, page, offnum);
 				} while (result == 0);
-
-				if (result < 0)
-					_bt_twostep(scan, &buf, BackwardScanDirection);
+				if (!_bt_step(scan, &buf, BackwardScanDirection))
+					elog(ERROR, "_bt_first: equal items disappeared?");
 			}
 			break;
 
 		case BTGreaterEqualStrategyNumber:
-			if (offGmax)
+			/*
+			 * We want the first item >= scankey, which is where we are...
+			 * unless we're not anywhere at all...
+			 */
+			if (offnum > PageGetMaxOffsetNumber(page))
 			{
-				if (result < 0)
+				if (!_bt_step(scan, &buf, ForwardScanDirection))
 				{
-					Assert(!P_RIGHTMOST(pop) && maxoff == P_HIKEY);
-					if (!_bt_step(scan, &buf, ForwardScanDirection))
-					{
-						_bt_relbuf(scan->relation, buf, BT_READ);
-						so->btso_curbuf = InvalidBuffer;
-						ItemPointerSetInvalid(&(scan->currentItemData));
-						pfree(scankeys);
-						return (RetrieveIndexResult) NULL;
-					}
-				}
-				else if (result > 0)
-				{				/* Just remember:  _bt_binsrch() returns
-								 * the OffsetNumber of the first matching
-								 * key on the page, or the OffsetNumber at
-								 * which the matching key WOULD APPEAR IF
-								 * IT WERE on this page. No key on this
-								 * page, but offnum from _bt_binsrch()
-								 * greater maxoff - have to move right. -
-								 * vadim 12/06/96 */
-					_bt_twostep(scan, &buf, ForwardScanDirection);
+					pfree(scankeys);
+					return (RetrieveIndexResult) NULL;
 				}
 			}
-			else if (result < 0)
-			{
-				do
-				{
-					if (!_bt_twostep(scan, &buf, BackwardScanDirection))
-						break;
-
-					page = BufferGetPage(buf);
-					offnum = ItemPointerGetOffsetNumber(current);
-					result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum);
-				} while (result < 0);
-
-				if (result > 0)
-					_bt_twostep(scan, &buf, ForwardScanDirection);
-			}
 			break;
 
 		case BTGreaterStrategyNumber:
-			/* offGmax helps as above */
-			if (result >= 0 || offGmax)
+			/*
+			 * We want the first item > scankey, so make sure we are on
+			 * an item and then step over any equal items.
+			 */
+			if (offnum > PageGetMaxOffsetNumber(page))
 			{
-				do
+				if (!_bt_step(scan, &buf, ForwardScanDirection))
 				{
-					if (!_bt_twostep(scan, &buf, ForwardScanDirection))
-						break;
-
-					offnum = ItemPointerGetOffsetNumber(current);
-					page = BufferGetPage(buf);
-					result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum);
-				} while (result >= 0);
+					pfree(scankeys);
+					return (RetrieveIndexResult) NULL;
+				}
+				offnum = ItemPointerGetOffsetNumber(current);
+				page = BufferGetPage(buf);
+			}
+			result = _bt_compare(rel, keysCount, scankeys, page, offnum);
+			while (result == 0)
+			{
+				if (!_bt_step(scan, &buf, ForwardScanDirection))
+				{
+					pfree(scankeys);
+					return (RetrieveIndexResult) NULL;
+				}
+				offnum = ItemPointerGetOffsetNumber(current);
+				page = BufferGetPage(buf);
+				result = _bt_compare(rel, keysCount, scankeys, page, offnum);
 			}
 			break;
 	}
 
-	pfree(scankeys);
 	/* okay, current item pointer for the scan is right */
 	offnum = ItemPointerGetOffsetNumber(current);
 	page = BufferGetPage(buf);
 	btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
 	itup = &btitem->bti_itup;
 
+	/* is the first item actually acceptable? */
 	if (_bt_checkkeys(scan, itup, &keysok))
 	{
+		/* yes, return it */
 		res = FormRetrieveIndexResult(current, &(itup->t_tid));
-
-		/* remember which buffer we have pinned */
-		so->btso_curbuf = buf;
-	}
-	else if (keysok >= so->numberOfFirstKeys)
-	{
-		so->btso_curbuf = buf;
-		return _bt_next(scan, dir);
 	}
-	else if (keysok == ((Size) -1) && ScanDirectionIsBackward(dir))
+	else if (keysok >= so->numberOfFirstKeys ||
+			 (keysok == ((Size) -1) && ScanDirectionIsBackward(dir)))
 	{
-		so->btso_curbuf = buf;
-		return _bt_next(scan, dir);
+		/* no, but there might be another one that is */
+		res = _bt_next(scan, dir);
 	}
 	else
 	{
+		/* no tuples in the index match this scan key */
+nomatches:
 		ItemPointerSetInvalid(current);
 		so->btso_curbuf = InvalidBuffer;
 		_bt_relbuf(rel, buf, BT_READ);
 		res = (RetrieveIndexResult) NULL;
 	}
 
+	pfree(scankeys);
+
 	return res;
 }
 
@@ -1047,276 +767,128 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
  *	_bt_step() -- Step one item in the requested direction in a scan on
  *				  the tree.
  *
- *		If no adjacent record exists in the requested direction, return
- *		false.	Else, return true and set the currentItemData for the
- *		scan to the right thing.
+ *		*bufP is the current buffer (read-locked and pinned).  If we change
+ *		pages, it's updated appropriately.
+ *
+ *		If successful, update scan's currentItemData and return true.
+ *		If no adjacent record exists in the requested direction,
+ *		release buffer pin/locks and return false.
  */
 bool
 _bt_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 {
+	Relation	rel = scan->relation;
+	ItemPointer current = &(scan->currentItemData);
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber offnum,
 				maxoff;
-	OffsetNumber start;
 	BlockNumber blkno;
 	BlockNumber obknum;
-	BTScanOpaque so;
-	ItemPointer current;
-	Relation	rel;
-
-	rel = scan->relation;
-	current = &(scan->currentItemData);
 
 	/*
 	 * Don't use ItemPointerGetOffsetNumber or you risk to get assertion
 	 * due to ability of ip_posid to be equal 0.
 	 */
 	offnum = current->ip_posid;
+
 	page = BufferGetPage(*bufP);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-	so = (BTScanOpaque) scan->opaque;
 	maxoff = PageGetMaxOffsetNumber(page);
 
-	/* get the next tuple */
 	if (ScanDirectionIsForward(dir))
 	{
 		if (!PageIsEmpty(page) && offnum < maxoff)
 			offnum = OffsetNumberNext(offnum);
 		else
 		{
-
-			/* if we're at end of scan, release the buffer and return */
-			blkno = opaque->btpo_next;
-			if (P_RIGHTMOST(opaque))
-			{
-				_bt_relbuf(rel, *bufP, BT_READ);
-				ItemPointerSetInvalid(current);
-				*bufP = so->btso_curbuf = InvalidBuffer;
-				return false;
-			}
-			else
+			/* walk right to the next page with data */
+			for (;;)
 			{
-
-				/* walk right to the next page with data */
-				_bt_relbuf(rel, *bufP, BT_READ);
-				for (;;)
+				/* if we're at end of scan, release the buffer and return */
+				if (P_RIGHTMOST(opaque))
 				{
-					*bufP = _bt_getbuf(rel, blkno, BT_READ);
-					page = BufferGetPage(*bufP);
-					opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-					maxoff = PageGetMaxOffsetNumber(page);
-					start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
-
-					if (!PageIsEmpty(page) && start <= maxoff)
-						break;
-					else
-					{
-						blkno = opaque->btpo_next;
-						_bt_relbuf(rel, *bufP, BT_READ);
-						if (blkno == P_NONE)
-						{
-							*bufP = so->btso_curbuf = InvalidBuffer;
-							ItemPointerSetInvalid(current);
-							return false;
-						}
-					}
+					_bt_relbuf(rel, *bufP, BT_READ);
+					ItemPointerSetInvalid(current);
+					*bufP = so->btso_curbuf = InvalidBuffer;
+					return false;
 				}
-				offnum = start;
+				/* step right one page */
+				blkno = opaque->btpo_next;
+				_bt_relbuf(rel, *bufP, BT_READ);
+				*bufP = _bt_getbuf(rel, blkno, BT_READ);
+				page = BufferGetPage(*bufP);
+				opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+				maxoff = PageGetMaxOffsetNumber(page);
+				/* done if it's not empty */
+				offnum = P_FIRSTDATAKEY(opaque);
+				if (!PageIsEmpty(page) && offnum <= maxoff)
+					break;
 			}
 		}
 	}
-	else if (ScanDirectionIsBackward(dir))
+	else
 	{
-
-		/* remember that high key is item zero on non-rightmost pages */
-		start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
-
-		if (offnum > start)
+		if (offnum > P_FIRSTDATAKEY(opaque))
 			offnum = OffsetNumberPrev(offnum);
 		else
 		{
-
-			/* if we're at end of scan, release the buffer and return */
-			blkno = opaque->btpo_prev;
-			if (P_LEFTMOST(opaque))
-			{
-				_bt_relbuf(rel, *bufP, BT_READ);
-				*bufP = so->btso_curbuf = InvalidBuffer;
-				ItemPointerSetInvalid(current);
-				return false;
-			}
-			else
+			/* walk left to the next page with data */
+			for (;;)
 			{
-
+				/* if we're at end of scan, release the buffer and return */
+				if (P_LEFTMOST(opaque))
+				{
+					_bt_relbuf(rel, *bufP, BT_READ);
+					ItemPointerSetInvalid(current);
+					*bufP = so->btso_curbuf = InvalidBuffer;
+					return false;
+				}
+				/* step left */
 				obknum = BufferGetBlockNumber(*bufP);
-
-				/* walk right to the next page with data */
+				blkno = opaque->btpo_prev;
 				_bt_relbuf(rel, *bufP, BT_READ);
-				for (;;)
+				*bufP = _bt_getbuf(rel, blkno, BT_READ);
+				page = BufferGetPage(*bufP);
+				opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+				/*
+				 * If the adjacent page just split, then we have to walk
+				 * right to find the block that's now adjacent to where
+				 * we were.  Because pages only split right, we don't have
+				 * to worry about this failing to terminate.
+				 */
+				while (opaque->btpo_next != obknum)
 				{
+					blkno = opaque->btpo_next;
+					_bt_relbuf(rel, *bufP, BT_READ);
 					*bufP = _bt_getbuf(rel, blkno, BT_READ);
 					page = BufferGetPage(*bufP);
 					opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-					maxoff = PageGetMaxOffsetNumber(page);
-
-					/*
-					 * If the adjacent page just split, then we may have
-					 * the wrong block.  Handle this case.	Because pages
-					 * only split right, we don't have to worry about this
-					 * failing to terminate.
-					 */
-
-					while (opaque->btpo_next != obknum)
-					{
-						blkno = opaque->btpo_next;
-						_bt_relbuf(rel, *bufP, BT_READ);
-						*bufP = _bt_getbuf(rel, blkno, BT_READ);
-						page = BufferGetPage(*bufP);
-						opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-						maxoff = PageGetMaxOffsetNumber(page);
-					}
-
-					/* don't consider the high key */
-					start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
-
-					/* anything to look at here? */
-					if (!PageIsEmpty(page) && maxoff >= start)
-						break;
-					else
-					{
-						blkno = opaque->btpo_prev;
-						obknum = BufferGetBlockNumber(*bufP);
-						_bt_relbuf(rel, *bufP, BT_READ);
-						if (blkno == P_NONE)
-						{
-							*bufP = so->btso_curbuf = InvalidBuffer;
-							ItemPointerSetInvalid(current);
-							return false;
-						}
-					}
 				}
-				offnum = maxoff;/* XXX PageIsEmpty? */
+				/* done if it's not empty */
+				maxoff = PageGetMaxOffsetNumber(page);
+				offnum = maxoff;
+				if (!PageIsEmpty(page) && maxoff >= P_FIRSTDATAKEY(opaque))
+					break;
 			}
 		}
 	}
-	blkno = BufferGetBlockNumber(*bufP);
+
+	/* Update scan state */
 	so->btso_curbuf = *bufP;
+	blkno = BufferGetBlockNumber(*bufP);
 	ItemPointerSet(current, blkno, offnum);
 
 	return true;
 }
 
-/*
- *	_bt_twostep() -- Move to an adjacent record in a scan on the tree,
- *					 if an adjacent record exists.
- *
- *		This is like _bt_step, except that if no adjacent record exists
- *		it restores us to where we were before trying the step.  This is
- *		only hairy when you cross page boundaries, since the page you cross
- *		from could have records inserted or deleted, or could even split.
- *		This is unlikely, but we try to handle it correctly here anyway.
- *
- *		This routine contains the only case in which our changes to Lehman
- *		and Yao's algorithm.
- *
- *		Like step, this routine leaves the scan's currentItemData in the
- *		proper state and acquires a lock and pin on *bufP.	If the twostep
- *		succeeded, we return true; otherwise, we return false.
- */
-static bool
-_bt_twostep(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
-{
-	Page		page;
-	BTPageOpaque opaque;
-	OffsetNumber offnum,
-				maxoff;
-	OffsetNumber start;
-	ItemPointer current;
-	ItemId		itemid;
-	int			itemsz;
-	BTItem		btitem;
-	BTItem		svitem;
-	BlockNumber blkno;
-
-	blkno = BufferGetBlockNumber(*bufP);
-	page = BufferGetPage(*bufP);
-	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-	maxoff = PageGetMaxOffsetNumber(page);
-	current = &(scan->currentItemData);
-	offnum = ItemPointerGetOffsetNumber(current);
-
-	start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
-
-	/* if we're safe, just do it */
-	if (ScanDirectionIsForward(dir) && offnum < maxoff)
-	{							/* XXX PageIsEmpty? */
-		ItemPointerSet(current, blkno, OffsetNumberNext(offnum));
-		return true;
-	}
-	else if (ScanDirectionIsBackward(dir) && offnum > start)
-	{
-		ItemPointerSet(current, blkno, OffsetNumberPrev(offnum));
-		return true;
-	}
-
-	/* if we've hit end of scan we don't have to do any work */
-	if (ScanDirectionIsForward(dir) && P_RIGHTMOST(opaque))
-		return false;
-	else if (ScanDirectionIsBackward(dir) && P_LEFTMOST(opaque))
-		return false;
-
-	/*
-	 * Okay, it's off the page; let _bt_step() do the hard work, and we'll
-	 * try to remember where we were.  This is not guaranteed to work;
-	 * this is the only place in the code where concurrency can screw us
-	 * up, and it's because we want to be able to move in two directions
-	 * in the scan.
-	 */
-
-	itemid = PageGetItemId(page, offnum);
-	itemsz = ItemIdGetLength(itemid);
-	btitem = (BTItem) PageGetItem(page, itemid);
-	svitem = (BTItem) palloc(itemsz);
-	memmove((char *) svitem, (char *) btitem, itemsz);
-
-	if (_bt_step(scan, bufP, dir))
-	{
-		pfree(svitem);
-		return true;
-	}
-
-	/* try to find our place again */
-	*bufP = _bt_getbuf(scan->relation, blkno, BT_READ);
-	page = BufferGetPage(*bufP);
-	maxoff = PageGetMaxOffsetNumber(page);
-
-	while (offnum <= maxoff)
-	{
-		itemid = PageGetItemId(page, offnum);
-		btitem = (BTItem) PageGetItem(page, itemid);
-		if (BTItemSame(btitem, svitem))
-		{
-			pfree(svitem);
-			ItemPointerSet(current, blkno, offnum);
-			return false;
-		}
-	}
-
-	/*
-	 * XXX crash and burn -- can't find our place.  We can be a little
-	 * smarter -- walk to the next page to the right, for example, since
-	 * that's the only direction that splits happen in.  Deletions screw
-	 * us up less often since they're only done by the vacuum daemon.
-	 */
-
-	elog(ERROR, "btree synchronization error:  concurrent update botched scan");
-
-	return false;
-}
-
 /*
  *	_bt_endpoint() -- Find the first or last key in the index.
+ *
+ * This is used by _bt_first() to set up a scan when we've determined
+ * that the scan must start at the beginning or end of the index (for
+ * a forward or backward scan respectively).
  */
 static RetrieveIndexResult
 _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
@@ -1328,7 +900,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	ItemPointer current;
 	OffsetNumber offnum,
 				maxoff;
-	OffsetNumber start = 0;
+	OffsetNumber start;
 	BlockNumber blkno;
 	BTItem		btitem;
 	IndexTuple	itup;
@@ -1340,38 +912,50 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	current = &(scan->currentItemData);
 	so = (BTScanOpaque) scan->opaque;
 
+	/*
+	 * Scan down to the leftmost or rightmost leaf page.  This is a
+	 * simplified version of _bt_search().  We don't maintain a stack
+	 * since we know we won't need it.
+	 */
 	buf = _bt_getroot(rel, BT_READ);
+
+	if (! BufferIsValid(buf))
+	{
+		/* empty index... */
+		ItemPointerSetInvalid(current);
+		so->btso_curbuf = InvalidBuffer;
+		return (RetrieveIndexResult) NULL;
+	}
+
 	blkno = BufferGetBlockNumber(buf);
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
 	for (;;)
 	{
-		if (opaque->btpo_flags & BTP_LEAF)
+		if (P_ISLEAF(opaque))
 			break;
 
 		if (ScanDirectionIsForward(dir))
-			offnum = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
+			offnum = P_FIRSTDATAKEY(opaque);
 		else
 			offnum = PageGetMaxOffsetNumber(page);
 
 		btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
 		itup = &(btitem->bti_itup);
-
 		blkno = ItemPointerGetBlockNumber(&(itup->t_tid));
 
 		_bt_relbuf(rel, buf, BT_READ);
 		buf = _bt_getbuf(rel, blkno, BT_READ);
+
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
 		/*
-		 * Race condition: If the child page we just stepped onto is in
-		 * the process of being split, we need to make sure we're all the
-		 * way at the right edge of the tree.  See the paper by Lehman and
-		 * Yao.
+		 * Race condition: If the child page we just stepped onto was just
+		 * split, we need to make sure we're all the way at the right edge
+		 * of the tree.  See the paper by Lehman and Yao.
 		 */
-
 		if (ScanDirectionIsBackward(dir) && !P_RIGHTMOST(opaque))
 		{
 			do
@@ -1390,101 +974,39 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 
 	if (ScanDirectionIsForward(dir))
 	{
-		if (!P_LEFTMOST(opaque))/* non-leftmost page ? */
-			elog(ERROR, "_bt_endpoint: leftmost page (%u) has not leftmost flag", blkno);
-		start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
-
-		/*
-		 * I don't understand this stuff! It doesn't work for
-		 * non-rightmost pages with only one element (P_HIKEY) which we
-		 * have after deletion itups by vacuum (it's case of start >
-		 * maxoff). Scanning in BackwardScanDirection is not
-		 * understandable at all. Well - new stuff. - vadim 12/06/96
-		 */
-#ifdef NOT_USED
-		if (PageIsEmpty(page) || start > maxoff)
-		{
-			ItemPointerSet(current, blkno, maxoff);
-			if (!_bt_step(scan, &buf, BackwardScanDirection))
-				return (RetrieveIndexResult) NULL;
-
-			start = ItemPointerGetOffsetNumber(current);
-			page = BufferGetPage(buf);
-		}
-#endif
-		if (PageIsEmpty(page))
-		{
-			if (start != P_HIKEY)		/* non-rightmost page */
-				elog(ERROR, "_bt_endpoint: non-rightmost page (%u) is empty", blkno);
+		Assert(P_LEFTMOST(opaque));
 
-			/*
-			 * It's left- & right- most page - root page, - and it's
-			 * empty...
-			 */
-			_bt_relbuf(rel, buf, BT_READ);
-			ItemPointerSetInvalid(current);
-			so->btso_curbuf = InvalidBuffer;
-			return (RetrieveIndexResult) NULL;
-		}
-		if (start > maxoff)		/* start == 2 && maxoff == 1 */
-		{
-			ItemPointerSet(current, blkno, maxoff);
-			if (!_bt_step(scan, &buf, ForwardScanDirection))
-				return (RetrieveIndexResult) NULL;
-
-			start = ItemPointerGetOffsetNumber(current);
-			page = BufferGetPage(buf);
-		}
-		/* new stuff ends here */
-		else
-			ItemPointerSet(current, blkno, start);
+		start = P_FIRSTDATAKEY(opaque);
 	}
 	else if (ScanDirectionIsBackward(dir))
 	{
+		Assert(P_RIGHTMOST(opaque));
 
-		/*
-		 * I don't understand this stuff too! If RIGHT-most leaf page is
-		 * empty why do scanning in ForwardScanDirection ??? Well - new
-		 * stuff. - vadim 12/06/96
-		 */
-#ifdef NOT_USED
-		if (PageIsEmpty(page))
-		{
-			ItemPointerSet(current, blkno, FirstOffsetNumber);
-			if (!_bt_step(scan, &buf, ForwardScanDirection))
-				return (RetrieveIndexResult) NULL;
-
-			start = ItemPointerGetOffsetNumber(current);
-			page = BufferGetPage(buf);
-		}
-#endif
-		if (PageIsEmpty(page))
-		{
-			/* If it's leftmost page too - it's empty root page... */
-			if (P_LEFTMOST(opaque))
-			{
-				_bt_relbuf(rel, buf, BT_READ);
-				ItemPointerSetInvalid(current);
-				so->btso_curbuf = InvalidBuffer;
-				return (RetrieveIndexResult) NULL;
-			}
-			/* Go back ! */
-			ItemPointerSet(current, blkno, FirstOffsetNumber);
-			if (!_bt_step(scan, &buf, BackwardScanDirection))
-				return (RetrieveIndexResult) NULL;
-
-			start = ItemPointerGetOffsetNumber(current);
-			page = BufferGetPage(buf);
-		}
-		/* new stuff ends here */
-		else
-		{
-			start = PageGetMaxOffsetNumber(page);
-			ItemPointerSet(current, blkno, start);
-		}
+		start = PageGetMaxOffsetNumber(page);
+		if (start < P_FIRSTDATAKEY(opaque))	/* watch out for empty page */
+			start = P_FIRSTDATAKEY(opaque);
 	}
 	else
+	{
 		elog(ERROR, "Illegal scan direction %d", dir);
+		start = 0;				/* keep compiler quiet */
+	}
+
+	ItemPointerSet(current, blkno, start);
+	/* remember which buffer we have pinned */
+	so->btso_curbuf = buf;
+
+	/*
+	 * Left/rightmost page could be empty due to deletions,
+	 * if so step till we find a nonempty page.
+	 */
+	if (start > maxoff)
+	{
+		if (!_bt_step(scan, &buf, dir))
+			return (RetrieveIndexResult) NULL;
+		start = ItemPointerGetOffsetNumber(current);
+		page = BufferGetPage(buf);
+	}
 
 	btitem = (BTItem) PageGetItem(page, PageGetItemId(page, start));
 	itup = &(btitem->bti_itup);
@@ -1492,23 +1014,18 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/* see if we picked a winner */
 	if (_bt_checkkeys(scan, itup, &keysok))
 	{
+		/* yes, return it */
 		res = FormRetrieveIndexResult(current, &(itup->t_tid));
-
-		/* remember which buffer we have pinned */
-		so->btso_curbuf = buf;
-	}
-	else if (keysok >= so->numberOfFirstKeys)
-	{
-		so->btso_curbuf = buf;
-		return _bt_next(scan, dir);
 	}
-	else if (keysok == ((Size) -1) && ScanDirectionIsBackward(dir))
+	else if (keysok >= so->numberOfFirstKeys ||
+			 (keysok == ((Size) -1) && ScanDirectionIsBackward(dir)))
 	{
-		so->btso_curbuf = buf;
-		return _bt_next(scan, dir);
+		/* no, but there might be another one that is */
+		res = _bt_next(scan, dir);
 	}
 	else
 	{
+		/* no tuples in the index match this scan key */
 		ItemPointerSetInvalid(current);
 		so->btso_curbuf = InvalidBuffer;
 		_bt_relbuf(rel, buf, BT_READ);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 458abe7754ce76f6bec0589eb3762cf4e70cd9cb..1981f5546930e31328df5bee9cdcb0e60e215c65 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -6,8 +6,12 @@
  *
  * We use tuplesort.c to sort the given index tuples into order.
  * Then we scan the index tuples in order and build the btree pages
- * for each level.	When we have only one page on a level, it must be the
- * root -- it can be attached to the btree metapage and we are done.
+ * for each level.  We load source tuples into leaf-level pages.
+ * Whenever we fill a page at one level, we add a link to it to its
+ * parent level (starting a new parent level if necessary).  When
+ * done, we write out each final page on each level, adding it to
+ * its parent level.  When we have only one page on a level, it must be
+ * the root -- it can be attached to the btree metapage and we are done.
  *
  * this code is moderately slow (~10% slower) compared to the regular
  * btree (insertion) build code on sorted or well-clustered data.  on
@@ -23,12 +27,20 @@
  * something like the standard 70% steady-state load factor for btrees
  * would probably be better.
  *
+ * Another limitation is that we currently load full copies of all keys
+ * into upper tree levels.  The leftmost data key in each non-leaf node
+ * could be omitted as far as normal btree operations are concerned
+ * (see README for more info).  However, because we build the tree from
+ * the bottom up, we need that data key to insert into the node's parent.
+ * This could be fixed by keeping a spare copy of the minimum key in the
+ * state stack, but I haven't time for that right now.
+ *
  *
  * Portions Copyright (c) 1996-2000, PostgreSQL, Inc
  * Portions Copyright (c) 1994, Regents of the University of California
  *
  * IDENTIFICATION
- *	  $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtsort.c,v 1.54 2000/06/15 04:09:36 momjian Exp $
+ *	  $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtsort.c,v 1.55 2000/07/21 06:42:33 tgl Exp $
  *
  *-------------------------------------------------------------------------
  */
@@ -57,6 +69,20 @@ struct BTSpool
 	bool		isunique;
 };
 
+/*
+ * Status record for a btree page being built.  We have one of these
+ * for each active tree level.
+ */
+typedef struct BTPageState
+{
+	Buffer		btps_buf;		/* current buffer & page */
+	Page		btps_page;
+	OffsetNumber btps_lastoff;	/* last item offset loaded */
+	int			btps_level;
+	struct BTPageState *btps_next; /* link to parent level, if any */
+} BTPageState;
+
+
 #define BTITEMSZ(btitem) \
 	((btitem) ? \
 	 (IndexTupleDSize((btitem)->bti_itup) + \
@@ -65,13 +91,11 @@ struct BTSpool
 
 
 static void _bt_load(Relation index, BTSpool *btspool);
-static BTItem _bt_buildadd(Relation index, Size keysz, ScanKey scankey,
-			 BTPageState *state, BTItem bti, int flags);
+static void _bt_buildadd(Relation index, BTPageState *state,
+						 BTItem bti, int flags);
 static BTItem _bt_minitem(Page opage, BlockNumber oblkno, int atend);
-static BTPageState *_bt_pagestate(Relation index, int flags,
-			  int level, bool doupper);
-static void _bt_uppershutdown(Relation index, Size keysz, ScanKey scankey,
-				  BTPageState *state);
+static BTPageState *_bt_pagestate(Relation index, int flags, int level);
+static void _bt_uppershutdown(Relation index, BTPageState *state);
 
 
 /*
@@ -159,9 +183,6 @@ _bt_blnewpage(Relation index, Buffer *buf, Page *page, int flags)
 	BTPageOpaque opaque;
 
 	*buf = _bt_getbuf(index, P_NEW, BT_WRITE);
-#ifdef NOT_USED
-	printf("\tblk=%d\n", BufferGetBlockNumber(*buf));
-#endif
 	*page = BufferGetPage(*buf);
 	_bt_pageinit(*page, BufferGetPageSize(*buf));
 	opaque = (BTPageOpaque) PageGetSpecialPointer(*page);
@@ -202,18 +223,15 @@ _bt_slideleft(Relation index, Buffer buf, Page page)
  * is suitable for immediate use by _bt_buildadd.
  */
 static BTPageState *
-_bt_pagestate(Relation index, int flags, int level, bool doupper)
+_bt_pagestate(Relation index, int flags, int level)
 {
 	BTPageState *state = (BTPageState *) palloc(sizeof(BTPageState));
 
 	MemSet((char *) state, 0, sizeof(BTPageState));
 	_bt_blnewpage(index, &(state->btps_buf), &(state->btps_page), flags);
-	state->btps_firstoff = InvalidOffsetNumber;
 	state->btps_lastoff = P_HIKEY;
-	state->btps_lastbti = (BTItem) NULL;
 	state->btps_next = (BTPageState *) NULL;
 	state->btps_level = level;
-	state->btps_doupper = doupper;
 
 	return state;
 }
@@ -240,31 +258,27 @@ _bt_minitem(Page opage, BlockNumber oblkno, int atend)
 }
 
 /*
- * add an item to a disk page from a merge tape block.
+ * add an item to a disk page from the sort output.
  *
  * we must be careful to observe the following restrictions, placed
  * upon us by the conventions in nbtsearch.c:
  * - rightmost pages start data items at P_HIKEY instead of at
  *	 P_FIRSTKEY.
- * - duplicates cannot be split among pages unless the chain of
- *	 duplicates starts at the first data item.
  *
  * a leaf page being built looks like:
  *
  * +----------------+---------------------------------+
  * | PageHeaderData | linp0 linp1 linp2 ...			  |
  * +-----------+----+---------------------------------+
- * | ... linpN |				  ^ first			  |
+ * | ... linpN |									  |
  * +-----------+--------------------------------------+
  * |	 ^ last										  |
  * |												  |
- * |			   v last							  |
  * +-------------+------------------------------------+
  * |			 | itemN ...						  |
  * +-------------+------------------+-----------------+
  * |		  ... item3 item2 item1 | "special space" |
  * +--------------------------------+-----------------+
- *						^ first
  *
  * contrast this with the diagram in bufpage.h; note the mismatch
  * between linps and items.  this is because we reserve linp0 as a
@@ -272,30 +286,20 @@ _bt_minitem(Page opage, BlockNumber oblkno, int atend)
  * filled up the page, we will set linp0 to point to itemN and clear
  * linpN.
  *
- * 'last' pointers indicate the last offset/item added to the page.
- * 'first' pointers indicate the first offset/item that is part of a
- * chain of duplicates extending from 'first' to 'last'.
- *
- * if all keys are unique, 'first' will always be the same as 'last'.
+ * 'last' pointer indicates the last offset added to the page.
  */
-static BTItem
-_bt_buildadd(Relation index, Size keysz, ScanKey scankey,
-			 BTPageState *state, BTItem bti, int flags)
+static void
+_bt_buildadd(Relation index, BTPageState *state, BTItem bti, int flags)
 {
 	Buffer		nbuf;
 	Page		npage;
-	BTItem		last_bti;
-	OffsetNumber first_off;
 	OffsetNumber last_off;
-	OffsetNumber off;
 	Size		pgspc;
 	Size		btisz;
 
 	nbuf = state->btps_buf;
 	npage = state->btps_page;
-	first_off = state->btps_firstoff;
 	last_off = state->btps_lastoff;
-	last_bti = state->btps_lastbti;
 
 	pgspc = PageGetFreeSpace(npage);
 	btisz = BTITEMSZ(bti);
@@ -319,75 +323,55 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey,
 
 	if (pgspc < btisz)
 	{
+		/*
+		 * Item won't fit on this page, so finish off the page and
+		 * write it out.
+		 */
 		Buffer		obuf = nbuf;
 		Page		opage = npage;
-		OffsetNumber o,
-					n;
 		ItemId		ii;
 		ItemId		hii;
+		BTItem		nbti;
 
 		_bt_blnewpage(index, &nbuf, &npage, flags);
 
 		/*
-		 * if 'last' is part of a chain of duplicates that does not start
-		 * at the beginning of the old page, the entire chain is copied to
-		 * the new page; we delete all of the duplicates from the old page
-		 * except the first, which becomes the high key item of the old
-		 * page.
+		 * We copy the last item on the page into the new page, and then
+		 * rearrange the old page so that the 'last item' becomes its high
+		 * key rather than a true data item.
 		 *
-		 * if the chain starts at the beginning of the page or there is no
-		 * chain ('first' == 'last'), we need only copy 'last' to the new
-		 * page.  again, 'first' (== 'last') becomes the high key of the
-		 * old page.
-		 *
-		 * note that in either case, we copy at least one item to the new
-		 * page, so 'last_bti' will always be valid.  'bti' will never be
-		 * the first data item on the new page.
+		 * note that since we always copy an item to the new page,
+		 * 'bti' will never be the first data item on the new page.
 		 */
-		if (first_off == P_FIRSTKEY)
+		ii = PageGetItemId(opage, last_off);
+		if (PageAddItem(npage, PageGetItem(opage, ii), ii->lp_len,
+						P_FIRSTKEY, LP_USED) == InvalidOffsetNumber)
+			elog(FATAL, "btree: failed to add item to the page in _bt_sort (1)");
+#ifdef FASTBUILD_DEBUG
 		{
-			Assert(last_off != P_FIRSTKEY);
-			first_off = last_off;
+			bool		isnull;
+			BTItem		tmpbti =
+				(BTItem) PageGetItem(npage, PageGetItemId(npage, P_FIRSTKEY));
+			Datum		d = index_getattr(&(tmpbti->bti_itup), 1,
+										  index->rd_att, &isnull);
+
+			printf("_bt_buildadd: moved <%x> to offset %d at level %d\n",
+				   d, P_FIRSTKEY, state->btps_level);
 		}
-		for (o = first_off, n = P_FIRSTKEY;
-			 o <= last_off;
-			 o = OffsetNumberNext(o), n = OffsetNumberNext(n))
-		{
-			ii = PageGetItemId(opage, o);
-			if (PageAddItem(npage, PageGetItem(opage, ii),
-						  ii->lp_len, n, LP_USED) == InvalidOffsetNumber)
-				elog(FATAL, "btree: failed to add item to the page in _bt_sort (1)");
-#ifdef FASTBUILD_DEBUG
-			{
-				bool		isnull;
-				BTItem		tmpbti =
-				(BTItem) PageGetItem(npage, PageGetItemId(npage, n));
-				Datum		d = index_getattr(&(tmpbti->bti_itup), 1,
-											  index->rd_att, &isnull);
-
-				printf("_bt_buildadd: moved <%x> to offset %d at level %d\n",
-					   d, n, state->btps_level);
-			}
 #endif
-		}
 
 		/*
-		 * this loop is backward because PageIndexTupleDelete shuffles the
-		 * tuples to fill holes in the page -- by starting at the end and
-		 * working back, we won't create holes (and thereby avoid
-		 * shuffling).
+		 * Move 'last' into the high key position on opage
 		 */
-		for (o = last_off; o > first_off; o = OffsetNumberPrev(o))
-			PageIndexTupleDelete(opage, o);
 		hii = PageGetItemId(opage, P_HIKEY);
-		ii = PageGetItemId(opage, first_off);
 		*hii = *ii;
 		ii->lp_flags &= ~LP_USED;
 		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
 
-		first_off = P_FIRSTKEY;
+		/*
+		 * Reset last_off to point to new page
+		 */
 		last_off = PageGetMaxOffsetNumber(npage);
-		last_bti = (BTItem) PageGetItem(npage, PageGetItemId(npage, last_off));
 
 		/*
 		 * set the page (side link) pointers.
@@ -399,32 +383,21 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey,
 			oopaque->btpo_next = BufferGetBlockNumber(nbuf);
 			nopaque->btpo_prev = BufferGetBlockNumber(obuf);
 			nopaque->btpo_next = P_NONE;
-
-			if (_bt_itemcmp(index, keysz, scankey,
-			  (BTItem) PageGetItem(opage, PageGetItemId(opage, P_HIKEY)),
-			(BTItem) PageGetItem(opage, PageGetItemId(opage, P_FIRSTKEY)),
-							BTEqualStrategyNumber))
-				oopaque->btpo_flags |= BTP_CHAIN;
 		}
 
 		/*
-		 * copy the old buffer's minimum key to its parent.  if we don't
-		 * have a parent, we have to create one; this adds a new btree
-		 * level.
+		 * Link the old buffer into its parent, using its minimum key.
+		 * If we don't have a parent, we have to create one;
+		 * this adds a new btree level.
 		 */
-		if (state->btps_doupper)
+		if (state->btps_next == (BTPageState *) NULL)
 		{
-			BTItem		nbti;
-
-			if (state->btps_next == (BTPageState *) NULL)
-			{
-				state->btps_next =
-					_bt_pagestate(index, 0, state->btps_level + 1, true);
-			}
-			nbti = _bt_minitem(opage, BufferGetBlockNumber(obuf), 0);
-			_bt_buildadd(index, keysz, scankey, state->btps_next, nbti, 0);
-			pfree((void *) nbti);
+			state->btps_next =
+				_bt_pagestate(index, 0, state->btps_level + 1);
 		}
+		nbti = _bt_minitem(opage, BufferGetBlockNumber(obuf), 0);
+		_bt_buildadd(index, state->btps_next, nbti, 0);
+		pfree((void *) nbti);
 
 		/*
 		 * write out the old stuff.  we never want to see it again, so we
@@ -435,11 +408,11 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey,
 	}
 
 	/*
-	 * if this item is different from the last item added, we start a new
-	 * chain of duplicates.
+	 * Add the new item into the current page.
 	 */
-	off = OffsetNumberNext(last_off);
-	if (PageAddItem(npage, (Item) bti, btisz, off, LP_USED) == InvalidOffsetNumber)
+	last_off = OffsetNumberNext(last_off);
+	if (PageAddItem(npage, (Item) bti, btisz,
+					last_off, LP_USED) == InvalidOffsetNumber)
 		elog(FATAL, "btree: failed to add item to the page in _bt_sort (2)");
 #ifdef FASTBUILD_DEBUG
 	{
@@ -447,65 +420,57 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey,
 		Datum		d = index_getattr(&(bti->bti_itup), 1, index->rd_att, &isnull);
 
 		printf("_bt_buildadd: inserted <%x> at offset %d at level %d\n",
-			   d, off, state->btps_level);
+			   d, last_off, state->btps_level);
 	}
 #endif
-	if (last_bti == (BTItem) NULL)
-		first_off = P_FIRSTKEY;
-	else if (!_bt_itemcmp(index, keysz, scankey,
-						  bti, last_bti, BTEqualStrategyNumber))
-		first_off = off;
-	last_off = off;
-	last_bti = (BTItem) PageGetItem(npage, PageGetItemId(npage, off));
 
 	state->btps_buf = nbuf;
 	state->btps_page = npage;
-	state->btps_lastbti = last_bti;
 	state->btps_lastoff = last_off;
-	state->btps_firstoff = first_off;
-
-	return last_bti;
 }
 
+/*
+ * Finish writing out the completed btree.
+ */
 static void
-_bt_uppershutdown(Relation index, Size keysz, ScanKey scankey,
-				  BTPageState *state)
+_bt_uppershutdown(Relation index, BTPageState *state)
 {
 	BTPageState *s;
 	BlockNumber blkno;
 	BTPageOpaque opaque;
 	BTItem		bti;
 
+	/*
+	 * Each iteration of this loop completes one more level of the tree.
+	 */
 	for (s = state; s != (BTPageState *) NULL; s = s->btps_next)
 	{
 		blkno = BufferGetBlockNumber(s->btps_buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(s->btps_page);
 
 		/*
-		 * if this is the root, attach it to the metapage.	otherwise,
-		 * stick the minimum key of the last page on this level (which has
-		 * not been split, or else it wouldn't be the last page) into its
-		 * parent.	this may cause the last page of upper levels to split,
-		 * but that's not a problem -- we haven't gotten to them yet.
+		 * We have to link the last page on this level to somewhere.
+		 *
+		 * If we're at the top, it's the root, so attach it to the metapage.
+		 * Otherwise, add an entry for it to its parent using its minimum
+		 * key.  This may cause the last page of the parent level to split,
+		 * but that's not a problem -- we haven't gotten to it yet.
 		 */
-		if (s->btps_doupper)
+		if (s->btps_next == (BTPageState *) NULL)
 		{
-			if (s->btps_next == (BTPageState *) NULL)
-			{
-				opaque->btpo_flags |= BTP_ROOT;
-				_bt_metaproot(index, blkno, s->btps_level + 1);
-			}
-			else
-			{
-				bti = _bt_minitem(s->btps_page, blkno, 0);
-				_bt_buildadd(index, keysz, scankey, s->btps_next, bti, 0);
-				pfree((void *) bti);
-			}
+			opaque->btpo_flags |= BTP_ROOT;
+			_bt_metaproot(index, blkno, s->btps_level + 1);
+		}
+		else
+		{
+			bti = _bt_minitem(s->btps_page, blkno, 0);
+			_bt_buildadd(index, s->btps_next, bti, 0);
+			pfree((void *) bti);
 		}
 
 		/*
-		 * this is the rightmost page, so the ItemId array needs to be
-		 * slid back one slot.
+		 * This is the rightmost page, so the ItemId array needs to be
+		 * slid back one slot.  Then we can dump out the page.
 		 */
 		_bt_slideleft(index, s->btps_buf, s->btps_page);
 		_bt_wrtbuf(index, s->btps_buf);
@@ -519,32 +484,27 @@ _bt_uppershutdown(Relation index, Size keysz, ScanKey scankey,
 static void
 _bt_load(Relation index, BTSpool *btspool)
 {
-	BTPageState *state;
-	ScanKey		skey;
-	int			natts;
-	BTItem		bti;
-	bool		should_free;
-
-	/*
-	 * initialize state needed for the merge into the btree leaf pages.
-	 */
-	state = _bt_pagestate(index, BTP_LEAF, 0, true);
-
-	skey = _bt_mkscankey_nodata(index);
-	natts = RelationGetNumberOfAttributes(index);
+	BTPageState *state = NULL;
 
 	for (;;)
 	{
+		BTItem		bti;
+		bool		should_free;
+
 		bti = (BTItem) tuplesort_getindextuple(btspool->sortstate, true,
 											   &should_free);
 		if (bti == (BTItem) NULL)
 			break;
-		_bt_buildadd(index, natts, skey, state, bti, BTP_LEAF);
+
+		/* When we see first tuple, create first index page */
+		if (state == NULL)
+			state = _bt_pagestate(index, BTP_LEAF, 0);
+
+		_bt_buildadd(index, state, bti, BTP_LEAF);
 		if (should_free)
 			pfree((void *) bti);
 	}
 
-	_bt_uppershutdown(index, natts, skey, state);
-
-	_bt_freeskey(skey);
+	if (state != NULL)
+		_bt_uppershutdown(index, state);
 }
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 5853267670fe4f103b6e81d80ceb6703e0a21ae5..aabdf80900797ca0cc2398656b7df983147381c2 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -8,7 +8,7 @@
  *
  *
  * IDENTIFICATION
- *	  $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtutils.c,v 1.37 2000/05/30 04:24:33 tgl Exp $
+ *	  $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtutils.c,v 1.38 2000/07/21 06:42:33 tgl Exp $
  *
  *-------------------------------------------------------------------------
  */
@@ -20,16 +20,13 @@
 #include "access/nbtree.h"
 #include "executor/execdebug.h"
 
-extern int	NIndexTupleProcessed;
-
 
 /*
  * _bt_mkscankey
  *		Build a scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		The result is intended for use with _bt_skeycmp() or _bt_compare(),
- *		although it could be used with _bt_itemcmp() or _bt_tuplecompare().
+ *		The result is intended for use with _bt_compare().
  */
 ScanKey
 _bt_mkscankey(Relation rel, IndexTuple itup)
@@ -68,8 +65,9 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
  *		Build a scan key that contains comparator routines appropriate to
  *		the key datatypes, but no comparison data.
  *
- *		The result can be used with _bt_itemcmp() or _bt_tuplecompare(),
- *		but not with _bt_skeycmp() or _bt_compare().
+ *		The result cannot be used with _bt_compare().  Currently this
+ *		routine is only called by utils/sort/tuplesort.c, which has its
+ *		own comparison routine.
  */
 ScanKey
 _bt_mkscankey_nodata(Relation rel)
@@ -114,7 +112,6 @@ _bt_freestack(BTStack stack)
 	{
 		ostack = stack;
 		stack = stack->bts_parent;
-		pfree(ostack->bts_btitem);
 		pfree(ostack);
 	}
 }
@@ -331,55 +328,16 @@ _bt_formitem(IndexTuple itup)
 	Size		tuplen;
 	extern Oid	newoid();
 
-	/*
-	 * see comments in btbuild
-	 *
-	 * if (itup->t_info & INDEX_NULL_MASK) elog(ERROR, "btree indices cannot
-	 * include null keys");
-	 */
-
 	/* make a copy of the index tuple with room for the sequence number */
 	tuplen = IndexTupleSize(itup);
 	nbytes_btitem = tuplen + (sizeof(BTItemData) - sizeof(IndexTupleData));
 
 	btitem = (BTItem) palloc(nbytes_btitem);
-	memmove((char *) &(btitem->bti_itup), (char *) itup, tuplen);
+	memcpy((char *) &(btitem->bti_itup), (char *) itup, tuplen);
 
 	return btitem;
 }
 
-#ifdef NOT_USED
-bool
-_bt_checkqual(IndexScanDesc scan, IndexTuple itup)
-{
-	BTScanOpaque so;
-
-	so = (BTScanOpaque) scan->opaque;
-	if (so->numberOfKeys > 0)
-		return (index_keytest(itup, RelationGetDescr(scan->relation),
-							  so->numberOfKeys, so->keyData));
-	else
-		return true;
-}
-
-#endif
-
-#ifdef NOT_USED
-bool
-_bt_checkforkeys(IndexScanDesc scan, IndexTuple itup, Size keysz)
-{
-	BTScanOpaque so;
-
-	so = (BTScanOpaque) scan->opaque;
-	if (keysz > 0 && so->numberOfKeys >= keysz)
-		return (index_keytest(itup, RelationGetDescr(scan->relation),
-							  keysz, so->keyData));
-	else
-		return true;
-}
-
-#endif
-
 bool
 _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, Size *keysok)
 {
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 43cabceba141bfa224ec9e6a4ed42bd492ba3db5..1a970a137514a3bf5c8ae010ed837e25bef32030 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -8,7 +8,7 @@
  *
  *
  * IDENTIFICATION
- *	  $Header: /cvsroot/pgsql/src/backend/storage/page/bufpage.c,v 1.30 2000/07/03 02:54:16 vadim Exp $
+ *	  $Header: /cvsroot/pgsql/src/backend/storage/page/bufpage.c,v 1.31 2000/07/21 06:42:33 tgl Exp $
  *
  *-------------------------------------------------------------------------
  */
@@ -19,10 +19,10 @@
 
 #include "storage/bufpage.h"
 
+
 static void PageIndexTupleDeleteAdjustLinePointers(PageHeader phdr,
 									   char *location, Size size);
 
-static bool PageManagerShuffle = true;	/* default is shuffle mode */
 
 /* ----------------------------------------------------------------
  *						Page support functions
@@ -53,21 +53,17 @@ PageInit(Page page, Size pageSize, Size specialSize)
 /* ----------------
  *		PageAddItem
  *
- *		add an item to a page.
- *
- *   !!! ELOG(ERROR) IS DISALLOWED HERE !!!
+ *		Add an item to a page.  Return value is offset at which it was
+ *		inserted, or InvalidOffsetNumber if there's not room to insert.
  *
- *	 Notes on interface:
- *		If offsetNumber is valid, shuffle ItemId's down to make room
- *		to use it, if PageManagerShuffle is true.  If PageManagerShuffle is
- *		false, then overwrite the specified ItemId.  (PageManagerShuffle is
- *		true by default, and is modified by calling PageManagerModeSet.)
+ *		If offsetNumber is valid and <= current max offset in the page,
+ *		insert item into the array at that position by shuffling ItemId's
+ *		down to make room.
  *		If offsetNumber is not valid, then assign one by finding the first
  *		one that is both unused and deallocated.
  *
- *	 NOTE: If offsetNumber is valid, and PageManagerShuffle is true, it
- *		is assumed that there is room on the page to shuffle the ItemId's
- *		down by one.
+ *   !!! ELOG(ERROR) IS DISALLOWED HERE !!!
+ *
  * ----------------
  */
 OffsetNumber
@@ -82,11 +78,8 @@ PageAddItem(Page page,
 	Offset		lower;
 	Offset		upper;
 	ItemId		itemId;
-	ItemId		fromitemId,
-				toitemId;
 	OffsetNumber limit;
-
-	bool		shuffled = false;
+	bool		needshuffle = false;
 
 	/*
 	 * Find first unallocated offsetNumber
@@ -96,31 +89,12 @@ PageAddItem(Page page,
 	/* was offsetNumber passed in? */
 	if (OffsetNumberIsValid(offsetNumber))
 	{
-		if (PageManagerShuffle == true)
-		{
-			/* shuffle ItemId's (Do the PageManager Shuffle...) */
-			for (i = (limit - 1); i >= offsetNumber; i--)
-			{
-				fromitemId = &((PageHeader) page)->pd_linp[i - 1];
-				toitemId = &((PageHeader) page)->pd_linp[i];
-				*toitemId = *fromitemId;
-			}
-			shuffled = true;	/* need to increase "lower" */
-		}
-		else
-		{						/* overwrite mode */
-			itemId = &((PageHeader) page)->pd_linp[offsetNumber - 1];
-			if (((*itemId).lp_flags & LP_USED) ||
-				((*itemId).lp_len != 0))
-			{
-				elog(NOTICE, "PageAddItem: tried overwrite of used ItemId");
-				return InvalidOffsetNumber;
-			}
-		}
+		needshuffle = true;		/* need to increase "lower" */
+		/* don't actually do the shuffle till we've checked free space! */
 	}
 	else
-	{							/* offsetNumber was not passed in, so find
-								 * one */
+	{
+		/* offsetNumber was not passed in, so find one */
 		/* look for "recyclable" (unused & deallocated) ItemId */
 		for (offsetNumber = 1; offsetNumber < limit; offsetNumber++)
 		{
@@ -130,9 +104,13 @@ PageAddItem(Page page,
 				break;
 		}
 	}
+
+	/*
+	 * Compute new lower and upper pointers for page, see if it'll fit
+	 */
 	if (offsetNumber > limit)
 		lower = (Offset) (((char *) (&((PageHeader) page)->pd_linp[offsetNumber])) - ((char *) page));
-	else if (offsetNumber == limit || shuffled == true)
+	else if (offsetNumber == limit || needshuffle)
 		lower = ((PageHeader) page)->pd_lower + sizeof(ItemIdData);
 	else
 		lower = ((PageHeader) page)->pd_lower;
@@ -144,6 +122,23 @@ PageAddItem(Page page,
 	if (lower > upper)
 		return InvalidOffsetNumber;
 
+	/*
+	 * OK to insert the item.  First, shuffle the existing pointers if needed.
+	 */
+	if (needshuffle)
+	{
+		/* shuffle ItemId's (Do the PageManager Shuffle...) */
+		for (i = (limit - 1); i >= offsetNumber; i--)
+		{
+			ItemId		fromitemId,
+						toitemId;
+
+			fromitemId = &((PageHeader) page)->pd_linp[i - 1];
+			toitemId = &((PageHeader) page)->pd_linp[i];
+			*toitemId = *fromitemId;
+		}
+	}
+
 	itemId = &((PageHeader) page)->pd_linp[offsetNumber - 1];
 	(*itemId).lp_off = upper;
 	(*itemId).lp_len = size;
@@ -168,9 +163,7 @@ PageGetTempPage(Page page, Size specialSize)
 	PageHeader	thdr;
 
 	pageSize = PageGetPageSize(page);
-
-	if ((temp = (Page) palloc(pageSize)) == (Page) NULL)
-		elog(FATAL, "Cannot allocate %d bytes for temp page.", pageSize);
+	temp = (Page) palloc(pageSize);
 	thdr = (PageHeader) temp;
 
 	/* copy old page in */
@@ -327,23 +320,6 @@ PageGetFreeSpace(Page page)
 	return space;
 }
 
-/*
- * PageManagerModeSet
- *
- *	 Sets mode to either: ShufflePageManagerMode (the default) or
- *	 OverwritePageManagerMode.	For use by access methods code
- *	 for determining semantics of PageAddItem when the offsetNumber
- *	 argument is passed in.
- */
-void
-PageManagerModeSet(PageManagerMode mode)
-{
-	if (mode == ShufflePageManagerMode)
-		PageManagerShuffle = true;
-	else if (mode == OverwritePageManagerMode)
-		PageManagerShuffle = false;
-}
-
 /*
  *----------------------------------------------------------------
  * PageIndexTupleDelete
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 49d9dd07dcbf96895672d5ed503b13034e986ff4..3f8eebc3b3614d3e58bafa201110c7f11a33e4a6 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -7,7 +7,7 @@
  * Portions Copyright (c) 1996-2000, PostgreSQL, Inc
  * Portions Copyright (c) 1994, Regents of the University of California
  *
- * $Id: nbtree.h,v 1.38 2000/06/15 03:32:31 momjian Exp $
+ * $Id: nbtree.h,v 1.39 2000/07/21 06:42:35 tgl Exp $
  *
  *-------------------------------------------------------------------------
  */
@@ -24,14 +24,9 @@
  *	info.  In addition, we need to know what sort of page this is
  *	(leaf or internal), and whether the page is available for reuse.
  *
- *	Lehman and Yao's algorithm requires a ``high key'' on every page.
- *	The high key on a page is guaranteed to be greater than or equal
- *	to any key that appears on this page.  Our insertion algorithm
- *	guarantees that we can use the initial least key on our right
- *	sibling as the high key.  We allocate space for the line pointer
- *	to the high key in the opaque data at the end of the page.
- *
- *	Rightmost pages in the tree have no high key.
+ *	We also store a back-link to the parent page, but this cannot be trusted
+ *	very far since it does not get updated when the parent is split.
+ *	See backend/access/nbtree/README for details.
  */
 
 typedef struct BTPageOpaqueData
@@ -41,11 +36,11 @@ typedef struct BTPageOpaqueData
 	BlockNumber btpo_parent;
 	uint16		btpo_flags;
 
-#define BTP_LEAF		(1 << 0)
-#define BTP_ROOT		(1 << 1)
-#define BTP_FREE		(1 << 2)
-#define BTP_META		(1 << 3)
-#define BTP_CHAIN		(1 << 4)
+/* Bits defined in btpo_flags */
+#define BTP_LEAF		(1 << 0)	/* It's a leaf page */
+#define BTP_ROOT		(1 << 1)	/* It's the root page (has no parent) */
+#define BTP_FREE		(1 << 2)	/* not currently used... */
+#define BTP_META		(1 << 3)	/* Set in the meta-page only */
 
 } BTPageOpaqueData;
 
@@ -84,21 +79,24 @@ typedef struct BTScanOpaqueData
 typedef BTScanOpaqueData *BTScanOpaque;
 
 /*
- *	BTItems are what we store in the btree.  Each item has an index
- *	tuple, including key and pointer values.  In addition, we must
- *	guarantee that all tuples in the index are unique, in order to
- *	satisfy some assumptions in Lehman and Yao.  The way that we do
- *	this is by generating a new OID for every insertion that we do in
- *	the tree.  This adds eight bytes to the size of btree index
- *	tuples.  Note that we do not use the OID as part of a composite
- *	key; the OID only serves as a unique identifier for a given index
- *	tuple (logical position within a page).
+ *	BTItems are what we store in the btree.  Each item is an index tuple,
+ *	including key and pointer values.  (In some cases either the key or the
+ *	pointer may go unused, see backend/access/nbtree/README for details.)
+ *
+ *	Old comments:
+ *	In addition, we must guarantee that all tuples in the index are unique,
+ *	in order to satisfy some assumptions in Lehman and Yao.  The way that we
+ *	do this is by generating a new OID for every insertion that we do in the
+ *	tree.  This adds eight bytes to the size of btree index tuples.  Note
+ *	that we do not use the OID as part of a composite key; the OID only
+ *	serves as a unique identifier for a given index tuple (logical position
+ *	within a page).
  *
  *	New comments:
  *	actually, we must guarantee that all tuples in A LEVEL
  *	are unique, not in ALL INDEX. So, we can use bti_itup->t_tid
  *	as unique identifier for a given index tuple (logical position
- *	within a level).	- vadim 04/09/97
+ *	within a level). - vadim 04/09/97
  */
 
 typedef struct BTItemData
@@ -108,12 +106,13 @@ typedef struct BTItemData
 
 typedef BTItemData *BTItem;
 
-#define BTItemSame(i1, i2)	  ( i1->bti_itup.t_tid.ip_blkid.bi_hi == \
-								i2->bti_itup.t_tid.ip_blkid.bi_hi && \
-								i1->bti_itup.t_tid.ip_blkid.bi_lo == \
-								i2->bti_itup.t_tid.ip_blkid.bi_lo && \
-								i1->bti_itup.t_tid.ip_posid == \
-								i2->bti_itup.t_tid.ip_posid )
+/* Test whether items are the "same" per the above notes */
+#define BTItemSame(i1, i2)	  ( (i1)->bti_itup.t_tid.ip_blkid.bi_hi == \
+								(i2)->bti_itup.t_tid.ip_blkid.bi_hi && \
+								(i1)->bti_itup.t_tid.ip_blkid.bi_lo == \
+								(i2)->bti_itup.t_tid.ip_blkid.bi_lo && \
+								(i1)->bti_itup.t_tid.ip_posid == \
+								(i2)->bti_itup.t_tid.ip_posid )
 
 /*
  *	BTStackData -- As we descend a tree, we push the (key, pointer)
@@ -129,24 +128,12 @@ typedef struct BTStackData
 {
 	BlockNumber bts_blkno;
 	OffsetNumber bts_offset;
-	BTItem		bts_btitem;
+	BTItemData	bts_btitem;
 	struct BTStackData *bts_parent;
 } BTStackData;
 
 typedef BTStackData *BTStack;
 
-typedef struct BTPageState
-{
-	Buffer		btps_buf;
-	Page		btps_page;
-	BTItem		btps_lastbti;
-	OffsetNumber btps_lastoff;
-	OffsetNumber btps_firstoff;
-	int			btps_level;
-	bool		btps_doupper;
-	struct BTPageState *btps_next;
-} BTPageState;
-
 /*
  *	We need to be able to tell the difference between read and write
  *	requests for pages, in order to do locking correctly.
@@ -155,31 +142,49 @@ typedef struct BTPageState
 #define BT_READ			BUFFER_LOCK_SHARE
 #define BT_WRITE		BUFFER_LOCK_EXCLUSIVE
 
-/*
- *	Similarly, the difference between insertion and non-insertion binary
- *	searches on a given page makes a difference when we're descending the
- *	tree.
- */
-
-#define BT_INSERTION	0
-#define BT_DESCENT		1
-
 /*
  *	In general, the btree code tries to localize its knowledge about
  *	page layout to a couple of routines.  However, we need a special
  *	value to indicate "no page number" in those places where we expect
- *	page numbers.
+ *	page numbers.  We can use zero for this because we never need to
+ *	make a pointer to the metadata page.
  */
 
 #define P_NONE			0
+
+/*
+ * Macros to test whether a page is leftmost or rightmost on its tree level,
+ * as well as other state info kept in the opaque data.
+ */
 #define P_LEFTMOST(opaque)		((opaque)->btpo_prev == P_NONE)
 #define P_RIGHTMOST(opaque)		((opaque)->btpo_next == P_NONE)
+#define P_ISLEAF(opaque)		((opaque)->btpo_flags & BTP_LEAF)
+#define P_ISROOT(opaque)		((opaque)->btpo_flags & BTP_ROOT)
+
+/*
+ *	Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
+ *	page.  The high key is not a data key, but gives info about what range of
+ *	keys is supposed to be on this page.  The high key on a page is required
+ *	to be greater than or equal to any data key that appears on the page.
+ *	If we find ourselves trying to insert a key > high key, we know we need
+ *	to move right (this should only happen if the page was split since we
+ *	examined the parent page).
+ *
+ *	Our insertion algorithm guarantees that we can use the initial least key
+ *	on our right sibling as the high key.  Once a page is created, its high
+ *	key changes only if the page is split.
+ *
+ *	On a non-rightmost page, the high key lives in item 1 and data items
+ *	start in item 2.  Rightmost pages have no high key, so we store data
+ *	items beginning in item 1.
+ */
 
 #define P_HIKEY			((OffsetNumber) 1)
 #define P_FIRSTKEY		((OffsetNumber) 2)
+#define P_FIRSTDATAKEY(opaque)  (P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY)
 
 /*
- *	Strategy numbers -- ordering of these is <, <=, =, >=, >
+ *	Operator strategy numbers -- ordering of these is <, <=, =, >=, >
  */
 
 #define BTLessStrategyNumber			1
@@ -199,13 +204,27 @@ typedef struct BTPageState
 
 #define BTORDER_PROC	1
 
+/*
+ * prototypes for functions in nbtree.c (external entry points for btree)
+ */
+extern bool BuildingBtree;		/* in nbtree.c */
+
+extern Datum btbuild(PG_FUNCTION_ARGS);
+extern Datum btinsert(PG_FUNCTION_ARGS);
+extern Datum btgettuple(PG_FUNCTION_ARGS);
+extern Datum btbeginscan(PG_FUNCTION_ARGS);
+extern Datum btrescan(PG_FUNCTION_ARGS);
+extern void btmovescan(IndexScanDesc scan, Datum v);
+extern Datum btendscan(PG_FUNCTION_ARGS);
+extern Datum btmarkpos(PG_FUNCTION_ARGS);
+extern Datum btrestrpos(PG_FUNCTION_ARGS);
+extern Datum btdelete(PG_FUNCTION_ARGS);
+
 /*
  * prototypes for functions in nbtinsert.c
  */
 extern InsertIndexResult _bt_doinsert(Relation rel, BTItem btitem,
 			 bool index_is_unique, Relation heapRel);
-extern bool _bt_itemcmp(Relation rel, Size keysz, ScanKey scankey,
-			BTItem item1, BTItem item2, StrategyNumber strat);
 
 /*
  * prototypes for functions in nbtpage.c
@@ -218,25 +237,8 @@ extern void _bt_wrtbuf(Relation rel, Buffer buf);
 extern void _bt_wrtnorelbuf(Relation rel, Buffer buf);
 extern void _bt_pageinit(Page page, Size size);
 extern void _bt_metaproot(Relation rel, BlockNumber rootbknum, int level);
-extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
 extern void _bt_pagedel(Relation rel, ItemPointer tid);
 
-/*
- * prototypes for functions in nbtree.c
- */
-extern bool BuildingBtree;		/* in nbtree.c */
-
-extern Datum btbuild(PG_FUNCTION_ARGS);
-extern Datum btinsert(PG_FUNCTION_ARGS);
-extern Datum btgettuple(PG_FUNCTION_ARGS);
-extern Datum btbeginscan(PG_FUNCTION_ARGS);
-extern Datum btrescan(PG_FUNCTION_ARGS);
-extern void btmovescan(IndexScanDesc scan, Datum v);
-extern Datum btendscan(PG_FUNCTION_ARGS);
-extern Datum btmarkpos(PG_FUNCTION_ARGS);
-extern Datum btrestrpos(PG_FUNCTION_ARGS);
-extern Datum btdelete(PG_FUNCTION_ARGS);
-
 /*
  * prototypes for functions in nbtscan.c
  */
@@ -249,13 +251,13 @@ extern void AtEOXact_nbtree(void);
  * prototypes for functions in nbtsearch.c
  */
 extern BTStack _bt_search(Relation rel, int keysz, ScanKey scankey,
-		   Buffer *bufP);
+						  Buffer *bufP, int access);
 extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
 			  ScanKey scankey, int access);
-extern bool _bt_skeycmp(Relation rel, Size keysz, ScanKey scankey,
-			Page page, ItemId itemid, StrategyNumber strat);
 extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, int srchtype);
+								ScanKey scankey);
+extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
+						 Page page, OffsetNumber offnum);
 extern RetrieveIndexResult _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern RetrieveIndexResult _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 30b5a93ad6497e001c26cf8d8d8be95dbb9bf22c..8498c783a1168c3978f974b82a349b445ad639d6 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -7,7 +7,7 @@
  * Portions Copyright (c) 1996-2000, PostgreSQL, Inc
  * Portions Copyright (c) 1994, Regents of the University of California
  *
- * $Id: bufpage.h,v 1.30 2000/07/03 02:54:21 vadim Exp $
+ * $Id: bufpage.h,v 1.31 2000/07/21 06:42:39 tgl Exp $
  *
  *-------------------------------------------------------------------------
  */
@@ -309,7 +309,6 @@ extern Page PageGetTempPage(Page page, Size specialSize);
 extern void PageRestoreTempPage(Page tempPage, Page oldPage);
 extern void PageRepairFragmentation(Page page);
 extern Size PageGetFreeSpace(Page page);
-extern void PageManagerModeSet(PageManagerMode mode);
 extern void PageIndexTupleDelete(Page page, OffsetNumber offset);