Skip to content
Snippets Groups Projects
  1. Jul 05, 2017
  2. Jul 03, 2017
    • Tom Lane's avatar
      Fix race condition in recovery/t/009_twophase.pl test. · 64767522
      Tom Lane authored
      Since reducing pg_ctl's reaction time in commit c61559ec, some
      slower buildfarm members have shown erratic failures in this test.
      The reason turns out to be that the test assumes synchronous
      replication (because it does not provide any lag time for a commit
      to replicate before shutting down the servers), but it had only
      enabled sync rep in one direction.  The observed symptoms correspond
      to failure to replicate the last committed transaction in the other
      direction, which can be expected to happen if the shutdown command
      is issued soon enough and we are providing no synchronous-commit
      guarantees.
      
      Fix that, and add a bit more paranoid state checking at the bottom
      of the script.
      
      Michael Paquier and myself
      
      Discussion: https://postgr.es/m/908.1498965681@sss.pgh.pa.us
      64767522
  3. Jul 02, 2017
    • Tom Lane's avatar
      Try to improve readability of recovery/t/009_twophase.pl test. · 4e15387d
      Tom Lane authored
      The original coding here was very confusing, because it named the
      two servers it set up "master" and "slave" even though it swapped
      their replication roles multiple times.  At any given point in the
      script it was very unobvious whether "$node_master" actually referred
      to the server named "master" or the other one.  Instead, pick arbitrary
      names for the two servers --- I used "london" and "paris" --- and
      distinguish those permanent names from the nonce references $cur_master
      and $cur_slave.  Add logging to help distinguish which is which at
      any given point.  Also, use distinct data and transaction names to
      make all the prepared transactions easily distinguishable in the
      postmaster logs.  (There was one place where we intentionally tested
      that the server could cope with re-use of a transaction name, but
      it seems like one place is sufficient for that purpose.)
      
      Also, add checks at the end to make sure that all the transactions
      that were supposed to be committed did survive.
      
      Discussion: https://postgr.es/m/28238.1499010855@sss.pgh.pa.us
      4e15387d
    • Tom Lane's avatar
      Improve TAP test function PostgresNode::poll_query_until(). · de3de0af
      Tom Lane authored
      Add an optional "expected" argument to override the default assumption
      that we're waiting for the query to return "t".  This allows replacing
      a handwritten polling loop in recovery/t/007_sync_rep.pl with use of
      poll_query_until(); AFAICS that's the only remaining ad-hoc polling
      loop in our TAP tests.
      
      Change poll_query_until() to probe ten times per second not once per
      second.  Like some similar changes I've been making recently, the
      one-second interval seems to be rooted in ancient traditions rather
      than the actual likely wait duration on modern machines.  I'd consider
      reducing it further if there were a convenient way to spawn just one
      psql for the whole loop rather than one per probe attempt.
      
      Discussion: https://postgr.es/m/12486.1498938782@sss.pgh.pa.us
      de3de0af
  4. Jul 01, 2017
    • Tom Lane's avatar
      Clean up misuse and nonuse of poll_query_until(). · b0f069d9
      Tom Lane authored
      Several callers of PostgresNode::poll_query_until() neglected to check
      for failure; I do not think that's optional.  Also, rewrite one place
      that had reinvented poll_query_until() for no very good reason.
      b0f069d9
  5. Jun 29, 2017
    • Tom Lane's avatar
      Eat XIDs more efficiently in recovery TAP test. · 08aed660
      Tom Lane authored
      The point of this loop is to insert 1000 rows into the test table
      and consume 1000 XIDs.  I can't see any good reason why it's useful
      to launch 1000 psqls and 1000 backend processes to accomplish that.
      Pushing the looping into a plpgsql DO block shaves about 10 seconds
      off the runtime of the src/test/recovery TAP tests on my machine;
      that's over 10% of the runtime of that test suite.
      
      It is, in fact, sufficiently more efficient that we now demonstrably
      need wait_slot_xmins() afterwards, or the slaves' xmins may not have
      moved yet.
      08aed660
  6. Jun 26, 2017
    • Tom Lane's avatar
      Improve wait logic in TAP tests for streaming replication. · 5c77690f
      Tom Lane authored
      Remove hard-wired sleep(2) delays in 001_stream_rep.pl in favor of using
      poll_query_until to check for the desired state to appear.  In addition,
      add such a wait before the last test in the script, as it's possible
      to demonstrate failures there after upcoming improvements in pg_ctl.
      
      (We might end up adding polling before each of the get_slot_xmins calls in
      this script, but I feel no great need to do that until shown necessary.)
      
      In passing, clarify the description strings for some of the test cases.
      
      Michael Paquier and Craig Ringer, pursuant to a complaint from me
      
      Discussion: https://postgr.es/m/8962.1498425057@sss.pgh.pa.us
      5c77690f
  7. May 18, 2017
  8. May 12, 2017
    • Andrew Dunstan's avatar
      Avoid tests which crash the calling process on Windows · 734cb4c2
      Andrew Dunstan authored
      Certain recovery tests use the Perl IPC::Run module's start/kill_kill
      method of processing. On at least some versions of perl this causes the
      whole process and its caller to crash. If we ever find a better way of
      doing these tests they can be re-enabled on this platform. This does not
      affect Mingw or Cygwin builds, which use a different perl and a
      different shell and so are not affected.
      734cb4c2
  9. May 11, 2017
  10. Apr 27, 2017
  11. Apr 25, 2017
    • Fujii Masao's avatar
      Set the priorities of all quorum synchronous standbys to 1. · 346199dc
      Fujii Masao authored
      In quorum-based synchronous replication, all the standbys listed in
      synchronous_standby_names equally have chances to be chosen
      as synchronous standbys. So they should have the same priority.
      However, previously, quorum standbys whose names appear earlier
      in the list were given higher priority values though the difference of
      those priority values didn't affect the selection of synchronous standbys.
      Users could see those "meaningless" priority values in pg_stat_replication
      and this was confusing.
      
      This commit gives all the quorum synchronous standbys the same
      highest priority, i.e., 1, in order to remove such confusion.
      
      Author: Fujii Masao
      Reviewed-by: Masahiko Sawada, Kyotaro Horiguchi
      Discussion: http://postgr.es/m/CAHGQGwEKOw=SmPLxJzkBsH6wwDBgOnVz46QjHbtsiZ-d-2RGUg@mail.gmail.com
      346199dc
  12. Apr 22, 2017
    • Tom Lane's avatar
      Make PostgresNode::append_conf append a newline automatically. · 8a19c1a3
      Tom Lane authored
      Although the documentation for append_conf said clearly that it didn't
      add a newline, many test authors seem to have forgotten that ... or maybe
      they just consulted the example at the top of the POD documentation,
      which clearly shows adding a config entry without bothering to add a
      trailing newline.  The worst part of that is that it works, as long as
      you don't do it more than once, since the backend isn't picky about
      whether config files end with newlines.  So there's not a strong forcing
      function reminding test authors not to do it like that.  Upshot is that
      this is a terribly fragile way to go about things, and there's at least
      one existing test case that is demonstrably broken and not testing what
      it thinks it is.
      
      Let's just make append_conf append a newline, instead; that is clearly
      way safer than the old definition.
      
      I also cleaned up a few call sites that were unnecessarily ugly.
      (I left things alone in places where it's plausible that additional
      config lines would need to be added someday.)
      
      Back-patch the change in append_conf itself to 9.6 where it was added,
      as having a definitional inconsistency between branches would obviously
      be pretty hazardous for back-patching TAP tests.  The other changes are
      just cosmetic and don't need to be back-patched.
      
      Discussion: https://postgr.es/m/19751.1492892376@sss.pgh.pa.us
      8a19c1a3
  13. Mar 29, 2017
    • Peter Eisentraut's avatar
      Change 'diag' to 'note' in TAP tests · 2e74e636
      Peter Eisentraut authored
      Reduce noise from TAP tests by changing 'diag' to 'note', so output only
      goes to the test's log file not stdout, unless in verbose mode.  This
      also removes the junk on screen when running the TAP tests in parallel.
      
      Author: Craig Ringer <craig@2ndquadrant.com>
      2e74e636
  14. Mar 28, 2017
    • Simon Riggs's avatar
      Cleanup slots during drop database · ff539da3
      Simon Riggs authored
      Automatically drop all logical replication slots associated with a
      database when the database is dropped. Previously we threw an ERROR
      if a slot existed. Now we throw ERROR only if a slot is active in
      the database being dropped.
      
      Craig Ringer
      ff539da3
  15. Mar 25, 2017
    • Simon Riggs's avatar
      Report catalog_xmin separately in hot_standby_feedback · 5737c12d
      Simon Riggs authored
      If the upstream walsender is using a physical replication slot, store the
      catalog_xmin in the slot's catalog_xmin field. If the upstream doesn't use a
      slot and has only a PGPROC entry behaviour doesn't change, as we store the
      combined xmin and catalog_xmin in the PGPROC entry.
      
      Author: Craig Ringer
      5737c12d
    • Peter Eisentraut's avatar
      Fix recovery test hang · cd07f73d
      Peter Eisentraut authored
      The test would hang if a sufficient ~/.psqlrc was present.  Fix by using
      psql -X.
      cd07f73d
  16. Mar 24, 2017
  17. Mar 22, 2017
    • Simon Riggs's avatar
      Teach xlogreader to follow timeline switches · 1148e22a
      Simon Riggs authored
      Uses page-based mechanism to ensure we’re using the correct timeline.
      
      Tests are included to exercise the functionality using a cold disk-level copy
      of the master that's started up as a replica with slots intact, but the
      intended use of the functionality is with later features.
      
      Craig Ringer, reviewed by Simon Riggs and Andres Freund
      1148e22a
    • Peter Eisentraut's avatar
      Avoid Perl warning · 9ca2dd57
      Peter Eisentraut authored
      Perl versions before 5.12 would warn "Use of implicit split to @_ is
      deprecated".
      
      Author: Jeff Janes <jeff.janes@gmail.com>
      9ca2dd57
  18. Mar 21, 2017
    • Simon Riggs's avatar
      Add a pg_recvlogical wrapper to PostgresNode · eb2a6131
      Simon Riggs authored
      Allows testing of logical decoding using SQL interface and/or pg_recvlogical
      Most logical decoding tests are in contrib/test_decoding. This module
      is for work that doesn't fit well there, like where server restarts
      are required.
      
      Craig Ringer
      eb2a6131
  19. Feb 26, 2017
  20. Feb 21, 2017
  21. Feb 09, 2017
  22. Jan 26, 2017
    • Simon Riggs's avatar
      Reset hot standby xmin on master after restart · ec4b9750
      Simon Riggs authored
      Hot_standby_feedback could be reset by reload and worked correctly, but if
      the server was restarted rather than reloaded the xmin was not reset.
      Force reset always if hot_standby_feedback is enabled at startup.
      
      Ants Aasma, Craig Ringer
      
      Reported-by: Ants Aasma
      ec4b9750
  23. Jan 14, 2017
  24. Jan 04, 2017
    • Simon Riggs's avatar
      Add 18 new recovery TAP tests · 0813216c
      Simon Riggs authored
      Add new tests for physical repl slots and hot standby feedback.
      
      Craig Ringer, reviewed by Aleksander Alekseev and Simon Riggs
      0813216c
    • Simon Riggs's avatar
      Allow PostgresNode.pm tests to wait for catchup · fb093e4c
      Simon Riggs authored
      Add methods to the core test framework PostgresNode.pm to allow us to
      test that standby nodes have caught up with the master, as well as
      basic LSN handling.  Used in tests recovery/t/001_stream_rep.pl and
      recovery/t/004_timeline_switch.pl
      
      Craig Ringer, reviewed by Aleksander Alekseev and Simon Riggs
      fb093e4c
  25. Jan 03, 2017
  26. Dec 19, 2016
    • Fujii Masao's avatar
      Support quorum-based synchronous replication. · 3901fd70
      Fujii Masao authored
      This feature is also known as "quorum commit" especially in discussion
      on pgsql-hackers.
      
      This commit adds the following new syntaxes into synchronous_standby_names
      GUC. By using FIRST and ANY keywords, users can specify the method to
      choose synchronous standbys from the listed servers.
      
        FIRST num_sync (standby_name [, ...])
        ANY num_sync (standby_name [, ...])
      
      The keyword FIRST specifies a priority-based synchronous replication
      which was available also in 9.6 or before. This method makes transaction
      commits wait until their WAL records are replicated to num_sync
      synchronous standbys chosen based on their priorities.
      
      The keyword ANY specifies a quorum-based synchronous replication
      and makes transaction commits wait until their WAL records are
      replicated to *at least* num_sync listed standbys. In this method,
      the values of sync_state.pg_stat_replication for the listed standbys
      are reported as "quorum". The priority is still assigned to each standby,
      but not used in this method.
      
      The existing syntaxes having neither FIRST nor ANY keyword are still
      supported. They are the same as new syntax with FIRST keyword, i.e.,
      a priorirty-based synchronous replication.
      
      Author: Masahiko Sawada
      Reviewed-By: Michael Paquier, Amit Kapila and me
      Discussion: <CAD21AoAACi9NeC_ecm+Vahm+MMA6nYh=Kqs3KB3np+MBOS_gZg@mail.gmail.com>
      
      Many thanks to the various individuals who were involved in
      discussing and developing this feature.
      3901fd70
  27. Oct 27, 2016
    • Robert Haas's avatar
      Fix possible pg_basebackup failure on standby with "include WAL". · f267c1c2
      Robert Haas authored
      If a restartpoint flushed no dirty buffers, it could fail to update
      the minimum recovery point, leading to a minimum recovery point prior
      to the starting REDO location.  perform_base_backup() would interpret
      that as meaning that no WAL files at all needed to be included in the
      backup, failing an internal sanity check.  To fix, have restartpoints
      always update the minimum recovery point to just after the checkpoint
      record itself, so that the file (or files) containing the checkpoint
      record will always be included in the backup.
      
      Code by Amit Kapila, per a design suggestion by me, with some
      additional work on the code comment by me.  Test case by Michael
      Paquier.  Report by Kyotaro Horiguchi.
      f267c1c2
  28. Oct 19, 2016
    • Peter Eisentraut's avatar
      Use pg_ctl promote -w in TAP tests · e5a9bcb5
      Peter Eisentraut authored
      Switch TAP tests to use the new wait mode of pg_ctl promote.  This
      allows avoiding extra logic with poll_query_until() to be sure that a
      promoted standby is ready for read-write queries.
      
      From: Michael Paquier <michael.paquier@gmail.com>
      e5a9bcb5
    • Heikki Linnakangas's avatar
      Fix WAL-logging of FSM and VM truncation. · 917dc7d2
      Heikki Linnakangas authored
      When a relation is truncated, it is important that the FSM is truncated as
      well. Otherwise, after recovery, the FSM can return a page that has been
      truncated away, leading to errors like:
      
      ERROR:  could not read block 28991 in file "base/16390/572026": read only 0
      of 8192 bytes
      
      We were using MarkBufferDirtyHint() to dirty the buffer holding the last
      remaining page of the FSM, but during recovery, that might in fact not
      dirty the page, and the FSM update might be lost.
      
      To fix, use the stronger MarkBufferDirty() function. MarkBufferDirty()
      requires us to do WAL-logging ourselves, to protect from a torn page, if
      checksumming is enabled.
      
      Also fix an oversight in visibilitymap_truncate: it also needs to WAL-log
      when checksumming is enabled.
      
      Analysis by Pavan Deolasee.
      
      Discussion: <CABOikdNr5vKucqyZH9s1Mh0XebLs_jRhKv6eJfNnD2wxTn=_9A@mail.gmail.com>
      917dc7d2
  29. Sep 05, 2016
    • Simon Riggs's avatar
      Dirty replication slots when using sql interface · d851bef2
      Simon Riggs authored
      When pg_logical_slot_get_changes(...) sets confirmed_flush_lsn to the point at
      which replay stopped, it doesn't dirty the replication slot.  So if the replay
      didn't cause restart_lsn or catalog_xmin to change as well, this change will
      not get written out to disk. Even on a clean shutdown.
      
      If Pg crashes or restarts, a subsequent pg_logical_slot_get_changes(...) call
      will see the same changes already replayed since it uses the slot's
      confirmed_flush_lsn as the start point for fetching changes. The caller can't
      specify a start LSN when using the SQL interface.
      
      Mark the slot as dirty after reading changes using the SQL interface so that
      users won't see repeated changes after a clean shutdown. Repeated changes still
      occur when using the walsender interface or after an unclean shutdown.
      
      Craig Ringer
      d851bef2
  30. Sep 03, 2016
  31. Aug 15, 2016
  32. Aug 03, 2016
    • Alvaro Herrera's avatar
      Fix assorted problems in recovery tests · b26f7fa6
      Alvaro Herrera authored
      In test 001_stream_rep we're using pg_stat_replication.write_location to
      determine catch-up status, but we care about xlog having been applied
      not just received, so change that to apply_location.
      
      In test 003_recovery_targets, we query the database for a recovery
      target specification and later for the xlog position supposedly
      corresponding to that recovery specification.  If for whatever reason
      more WAL is written between the two queries, the recovery specification
      is earlier than the xlog position used by the query in the test harness,
      so we wait forever, leading to test failures.  Deal with this by using a
      single query to extract both items.  In 2a0f89cd we tried to deal
      with it by giving them more tests to run, but in hindsight that was
      obviously doomed to failure (no revert of that, though).
      
      Per hamster buildfarm failures.
      
      Author: Michaël Paquier
      b26f7fa6
  33. Aug 02, 2016
  34. Jun 12, 2016
  35. May 04, 2016
Loading