From 7b9dc71405931f698d953378676d7c51ab2d9591 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut <peter_e@gmx.net> Date: Wed, 24 Jan 2001 23:15:19 +0000 Subject: [PATCH] WAL documentation, from Oliver Elphick and Vadim Mikheev. --- doc/src/sgml/admin.sgml | 3 +- doc/src/sgml/filelist.sgml | 3 +- doc/src/sgml/runtime.sgml | 53 +++++- doc/src/sgml/wal.sgml | 321 +++++++++++++++++++++++++++++++++++++ 4 files changed, 377 insertions(+), 3 deletions(-) create mode 100644 doc/src/sgml/wal.sgml diff --git a/doc/src/sgml/admin.sgml b/doc/src/sgml/admin.sgml index 3379eba1b93..fc8fe193236 100644 --- a/doc/src/sgml/admin.sgml +++ b/doc/src/sgml/admin.sgml @@ -1,5 +1,5 @@ <!-- -$Header: /cvsroot/pgsql/doc/src/sgml/Attic/admin.sgml,v 1.30 2001/01/24 19:42:46 momjian Exp $ +$Header: /cvsroot/pgsql/doc/src/sgml/Attic/admin.sgml,v 1.31 2001/01/24 23:15:19 petere Exp $ --> <book id="admin"> @@ -58,6 +58,7 @@ $Header: /cvsroot/pgsql/doc/src/sgml/Attic/admin.sgml,v 1.30 2001/01/24 19:42:46 &manage-ag; &user-manag; &backup; + &wal; &recovery; ®ress; &release; diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml index 5d784d7dcc4..21174ae420c 100644 --- a/doc/src/sgml/filelist.sgml +++ b/doc/src/sgml/filelist.sgml @@ -1,4 +1,4 @@ -<!-- $Header: /cvsroot/pgsql/doc/src/sgml/filelist.sgml,v 1.5 2001/01/22 23:34:32 petere Exp $ --> +<!-- $Header: /cvsroot/pgsql/doc/src/sgml/filelist.sgml,v 1.6 2001/01/24 23:15:19 petere Exp $ --> <!entity about SYSTEM "about.sgml"> <!entity history SYSTEM "history.sgml"> @@ -54,6 +54,7 @@ <!entity release SYSTEM "release.sgml"> <!entity runtime SYSTEM "runtime.sgml"> <!entity user-manag SYSTEM "user-manag.sgml"> +<!entity wal SYSTEM "wal.sgml"> <!-- programmer's guide --> <!entity arch-pg SYSTEM "arch-pg.sgml"> diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml index 17dcc10f84d..3f68431c2cd 100644 --- a/doc/src/sgml/runtime.sgml +++ b/doc/src/sgml/runtime.sgml @@ -1,5 +1,5 @@ <!-- -$Header: /cvsroot/pgsql/doc/src/sgml/runtime.sgml,v 1.47 2001/01/24 15:19:36 momjian Exp $ +$Header: /cvsroot/pgsql/doc/src/sgml/runtime.sgml,v 1.48 2001/01/24 23:15:19 petere Exp $ --> <Chapter Id="runtime"> @@ -1159,6 +1159,57 @@ env PGOPTIONS='-c geqo=off' psql </para> </sect2> + <sect2 id="runtime-config-wal"> + <title>WAL</title> + + <para> + See also <xref linkend="wal-configuration"> for details on WAL + tuning. + + <variablelist> + <varlistentry> + <term>CHECKPOINT_TIMEOUT (<type>integer</type>)</term> + <listitem> + <para> + Frequency of automatic WAL checkpoints, in seconds. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term>WAL_BUFFERS (<type>integer</type>)</term> + <listitem> + <para> + Number of buffers for WAL. This option can only be set at + server start. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term>WAL_DEBUG (<type>integer</type>)</term> + <listitem> + <para> + If non-zero, turn on WAL-related debugging output on standard + error. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term>WAL_FILES (<type>integer</type>)</term> + <listitem> + <para> + Number of log files that are created in advance at checkpoint + time. This option can only be set at server start. + </para> + </listitem> + </varlistentry> + </variablelist> + </para> + </sect2> + + <sect2 id="runtime-config-short"> <title>Short options</title> <para> diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml new file mode 100644 index 00000000000..06198bd6e1c --- /dev/null +++ b/doc/src/sgml/wal.sgml @@ -0,0 +1,321 @@ +<!-- $Header: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v 1.1 2001/01/24 23:15:19 petere Exp $ --> + +<chapter id="wal"> + <title>Write-Ahead Logging (<acronym>WAL</acronym>)</title> + + <note> + <title>Author</title> + <para> + Vadim Mikheev and Oliver Elphick + </para> + </note> + + <sect1 id="wal-general"> + <title>General Description</Title> + + <para> + <firstterm>Write Ahead Logging</firstterm> (<acronym>WAL</acronym>) + is a standard approach to transaction logging. Its detailed + description may be found in most (if not all) books about + transaction processing. Briefly, <acronym>WAL</acronym>'s central + concept is that changes to data files (where tables and indices + reside) must be written only after those changes have been logged - + that is, when log records have been flushed to permanent + storage. When we follow this procedure, we do not need to flush + data pages to disk on every transaction commit, because we know + that in the event of a crash we will be able to recover the + database using the log: any changes that have not been applied to + the data pages will first be redone from the log records (this is + roll-forward recovery, also known as REDO) and then changes made by + uncommitted transactions will be removed from the data pages + (roll-backward recovery - UNDO). + </para> + + <sect2 id="wal-benefits-now"> + <title>Immediate Benefits of <acronym>WAL</acronym></title> + + <para> + The first obvious benefit of using <acronym>WAL</acronym> is a + significantly reduced number of disk writes, since only the log + file needs to be flushed to disk at the time of transaction + commit; in multi-user environments, commits of many transactions + may be accomplished with a single <function>fsync()</function> of + the log file. Furthermore, the log file is written sequentially, + and so the cost of syncing the log is much less than the cost of + flushing the data pages. + </para> + + <para> + The next benefit is consistency of the data pages. The truth is + that, before <acronym>WAL</acronym>, + <productname>PostgreSQL</productname> was never able to guarantee + consistency in the case of a crash. Before + <acronym>WAL</acronym>, any crash during writing could result in: + + <orderedlist> + <listitem> + <simpara>index tuples pointing to non-existent table rows</simpara> + </listitem> + + <listitem> + <simpara>index tuples lost in split operations</simpara> + </listitem> + + <listitem> + <simpara>totally corrupted table or index page content, because + of partially written data pages</simpara> + </listitem> + </orderedlist> + + Problems with indices (problems 1 and 2) could possibly have been + fixed by additional <function>fsync()</function> calls, but it is + not obvious how to handle the last case without + <acronym>WAL</acronym>; <acronym>WAL</acronym> saves the entire + data page content in the log if that is required to ensure page + consistency for after-crash recovery. + </para> + </sect2> + + <sect2 id="wal-benefits-later"> + <title>Future Benefits</title> + + <para> + In this first release of <acronym>WAL</acronym>, UNDO operation is + not implemented, because of lack of time. This means that changes + made by aborted transactions will still occupy disk space and that + we still need a permanent <filename>pg_log</filename> file to hold + the status of transactions, since we are not able to re-use + transaction identifiers. Once UNDO is implemented, + <filename>pg_log</filename> will no longer be required to be + permanent; it will be possible to remove + <filename>pg_log</filename> at shutdown, split it into segments + and remove old segments. + </para> + + <para> + With UNDO, it will also be possible to implement + <firstterm>savepoints</firstterm> to allow partial rollback of + invalid transaction operations (parser errors caused by mistyping + commands, insertion of duplicate primary/unique keys and so on) + with the ability to continue or commit valid operations made by + the transaction before the error. At present, any error will + invalidate the whole transaction and require a transaction abort. + </para> + + <para> + <acronym>WAL</acronym> offers the opportunity for a new method for + database on-line backup and restore (<acronym>BAR</acronym>). To + use this method, one would have to make periodic saves of data + files to another disk, a tape or another host and also archive the + <acronym>WAL</acronym> log files. The database file copy and the + archived log files could be used to restore just as if one were + restoring after a crash. Each time a new database file copy was + made the old log files could be removed. Implementing this + facility will require the logging of data file and index creation + and deletion; it will also require development of a method for + copying the data files (operating system copy commands are not + suitable). + </para> + </sect2> + </sect1> + + <sect1 id="wal-implementation"> + <title>Implementation</title> + + <para> + <acronym>WAL</acronym> is automatically enabled from release 7.1 + onwards. No action is required from the administrator with the + exception of ensuring that the additional disk-space requirements + of the <acronym>WAL</acronym> logs are met, and that any necessary + tuning is done (see <xref linkend="wal-configuration">). + </para> + + <para> + <acronym>WAL</acronym> logs are stored in the directory + <Filename><replaceable>$PGDATA</replaceable>/pg_xlog</Filename>, as + a set of segment files, each 16 MB in size. Each segment is + divided into 8 kB pages. The log record headers are described in + <filename>access/xlog.h</filename>; record content is dependent on + the type of event that is being logged. Segment files are given + sequential numbers as names, starting at + <filename>0000000000000000</filename>. The numbers do not wrap, at + present, but it should take a very long time to exhaust the + available stock of numbers. + </para> + + <para> + The <acronym>WAL</acronym> buffers and control structure are in + shared memory, and are handled by the backends; they are protected + by spinlocks. The demand on shared memory is dependent on the + number of buffers; the default size of the <acronym>WAL</acronym> + buffers is 64 kB. + </para> + + <para> + It is of advantage if the log is located on another disk than the + main database files. This may be achieved by moving the directory, + <filename>pg_xlog</filename>, to another location (while the + postmaster is shut down, of course) and creating a symbolic link + from the original location in <replaceable>$PGDATA</replaceable> to + the new location. + </para> + + <para> + The aim of <acronym>WAL</acronym>, to ensure that the log is + written before database records are altered, may be subverted by + disk drives that falsely report a successful write to the kernel, + when, in fact, they have only cached the data and not yet stored it + on the disk. A power failure in such a situation may still lead to + irrecoverable data corruption; administrators should try to ensure + that disks holding <productname>PostgreSQL</productname>'s data and + log files do not make such false reports. + </para> + + <sect2 id="wal-recovery"> + <title>Database Recovery with <acronym>WAL</acronym></title> + + <para> + After a checkpoint has been made and the log flushed, the + checkpoint's position is saved in the file + <filename>pg_control</filename>. Therefore, when recovery is to be + done, the backend first reads <filename>pg_control</filename> and + then the checkpoint record; next it reads the redo record, whose + position is saved in the checkpoint, and begins the REDO operation. + Because the entire content of the pages is saved in the log on the + first page modification after a checkpoint, the pages will be first + restored to a consistent state. + </para> + + <para> + Using <filename>pg_control</filename> to get the checkpoint + position speeds up the recovery process, but to handle possible + corruption of <filename>pg_control</filename>, we should actually + implement the reading of existing log segments in reverse order -- + newest to oldest -- in order to find the last checkpoint. This has + not yet been done in release 7.1. + </para> + </sect2> + </sect1> + + <sect1 id="wal-configuration"> + <title><acronym>WAL</acronym> Configuration</title> + + <para> + There are several <acronym>WAL</acronym>-related parameters that + affect database performance. This section explains their use. + Consult <xref linkend="runtime-config"> for details about setting + configuration parameters. + </para> + + <para> + There are two commonly used <acronym>WAL</acronym> functions: + <function>LogInsert</function> and <function>LogFlush</function>. + <function>LogInsert</function> is used to place a new record into + the <acronym>WAL</acronym> buffers in shared memory. If there is no + space for the new record, <function>LogInsert</function> will have + to write (move to kernel cache) a few filled <acronym>WAL</acronym> + buffers. This is undesirable because <function>LogInsert</function> + is used on every database low level modification (for example, + tuple insertion) at a time when an exclusive lock is held on + affected data pages and the operation is supposed to be as fast as + possible; what is worse, writing <acronym>WAL</acronym> buffers may + also cause the creation of a new log segment, which takes even more + time. Normally, <acronym>WAL</acronym> buffers should be written + and flushed by a <function>LogFlush</function> request, which is + made, for the most part, at transaction commit time to ensure that + transaction records are flushed to permanent storage. On systems + with high log output, <function>LogFlush</function> requests may + not occur often enough to prevent <acronym>WAL</acronym> buffers + being written by <function>LogInsert</function>. On such systems + one should increase the number of <acronym>WAL</acronym> buffers by + modifying the <varname>WAL_BUFFERS</varname> parameter. The default + number of <acronym>WAL</acronym> buffers is 8. Increasing this + value will have an impact on shared memory usage. + </para> + + <para> + <firstterm>Checkpoints</firstterm> are points in the sequence of + transactions at which it is guaranteed that the data files have + been updated with all information logged before the checkpoint. At + checkpoint time, all dirty data pages are flushed to disk and a + special checkpoint record is written to the log file. As result, in + the event of a crash, the recoverer knows from what record in the + log (known as the redo record) it should start the REDO operation, + since any changes made to data files before that record are already + on disk. After a checkpoint has been made, any log segments written + before the redo record are removed, so checkpoints are used to free + disk space in the <acronym>WAL</acronym> directory. (When + <acronym>WAL</acronym>-based <acronym>BAR</acronym> is implemented, + the log segments can be archived instead of just being removed.) + The checkpoint maker is also able to create a few log segments for + future use, so as to avoid the need for + <function>LogInsert</function> or <function>LogFlush</function> to + spend time in creating them. + </para> + + <para> + The <acronym>WAL</acronym> log is held on the disk as a set of 16 + MB files called <firstterm>segments</firstterm>. By default a new + segment is created only if more than 75% of the current segment is + used. One can instruct the server to create up to 64 log segments + at checkpoint time by modifying the <varname>WAL_FILES</varname> + configuration parameter. + </para> + + <para> + For faster after-crash recovery, it would be better to create + checkpoints more often. However, one should balance this against + the cost of flushing dirty data pages; in addition, to ensure data + page consistency, the first modification of a data page after each + checkpoint results in logging the entire page content, thus + increasing output to log and the log's size. + </para> + + <para> + By default, the postmaster spawns a special backend process to + create the next checkpoint 300 seconds after the previous + checkpoint's creation. One can change this interval by modifying + the <varname>CHECKPOINT_TIMEOUT</varname> parameter. It is also + possible to force a checkpoint by using the SQL command + <command>CHECKPOINT</command>. + </para> + + <para> + Setting the <varname>WAL_DEBUG</varname> parameter to any non-zero + value will result in each <function>LogInsert</function> and + <function>LogFlush</function> <acronym>WAL</acronym> call being + logged to standard error. At present, it makes no difference what + the non-zero value is. This option may be replaced by a more + general mechanism in the future. + </para> + + <para> + The <varname>COMMIT_DELAY</varname> parameter defines for how long + the backend will be forced to sleep after writing a commit record + to the log with <function>LogInsert</function> call but before + performing a <function>LogFlush</function>. This delay allows other + backends to add their commit records to the log so as to have all + of them flushed with a single log sync. Unfortunately, this + mechanism is not fully implemented at release 7.1, so there is at + present no point in changing this parameter from its default value + of 5 microseconds. + </para> + </sect1> +</chapter> + +<!-- Keep this comment at the end of the file +Local variables: +mode:sgml +sgml-omittag:nil +sgml-shorttag:t +sgml-minimize-attributes:nil +sgml-always-quote-attributes:t +sgml-indent-step:1 +sgml-indent-data:t +sgml-parent-document:nil +sgml-default-dtd-file:"./reference.ced" +sgml-exposed-tags:nil +sgml-local-catalogs:("/usr/lib/sgml/catalog") +sgml-local-ecat-files:nil +End: +--> -- GitLab