Locating the recovery point just before a dropped table

A common example when talking about why it's a good thing to be able to do PITR (Point In Time Recovery) is the scenario where somebody or some thing (operator or buggy application) dropped a table, and we want to do a recover to right before the table was dropped, to keep as much valid data as possible.

PostgreSQL comes with nice functionality to decide exactly what point to perform a recovery to, which can be specified at millisecond granularity, or that of an individual transaction. But what if we don't know exactly when the table was dropped? (Exactly being at the level of specific transaction or at least millisecond).

On way to handle that is to "step forward" through the log one transaction at a time until the table is gone. This is obviously very time-consuming.

Assuming that DROP TABLE is not something we do very frequently in our system, we can also use the pg_xlogdump tool to help us find the exact spot to perform the recovery to, in much less time. Unfortunately, the dropping of temporary tables (implicit or explicit) is included in this, so if your application uses a lot of temporary tables this approach will not work out of the box. But for applications without them, it can save a lot of time.

Let's show an example. This assumes you have already set up the system for log archiving, you have a base backup that you have restored, and you have a log archive.

The first thing we do is try to determine the point where a DROP TABLE happened. We can do this by scanning for entries where rows have been deleted from the pg_class table, as this will always happen as part of the drop.

In this example I will scan the entire log archive. In a system where the log archive is large, which is probably true of most production systems, this can take some time. In this case it may be a good idea to limit the amount of logs scanned by looking at the timestamps of the archive files that are close to the time of the drop (presumably you know approximately when the table was dropped).

The below simple shellscript will scan for the transaction id's of all dropped tables in the log archives that were dropped from the database with oid being 12379 (in my example this is the postgres database):

DBOID=12379
ARCHIVE=/tmp/archive

START=$(ls -1 $ARCHIVE | grep -v partial | head -1)
END=$(ls -1 $ARCHIVE | grep -v partial | tail -1)

pg_xlogdump -r heap -p $ARCHIVE $START $END \
  |sed -n -r -e "s/.*tx:\s+([0-9]+), lsn.*desc: DELETE.*rel [0-9]+\/$DBOID\/1259 .*/\1/p" 

The expression in the sed command looks for any DELETE from the heap (table) of a relation with oid=1259, which is pg_class, that occurred in the correct database, and just prints this to standard output. If multiple tables have been dropped within the period, it cannot identify which is which, and the recovery process will just have to try them all.

In my example, it returned transaction id 50851.

Armed with this transaction id, I restored my base backup, and then created a recovery.conf file that looked like this:

restore_command='cp /tmp/archive/%f %p'
recovery_target_xid=50851
recovery_target_inclusive=false

This tells the system to restore up to but not including transaction 50851. It is very important that we do not include this transaction as this would then drop the table again for us. Once this is done, the recovery ran for less than a second (this is a test system locally on SSD after all), and at the end of the log I got:

2016-07-11 14:17:51 CEST LOG:  starting point-in-time recovery to XID 50851
2016-07-11 14:17:51 CEST LOG:  restored log file "000000010000000000000060" from archive
2016-07-11 14:17:51 CEST LOG:  redo starts at 0/60000028
2016-07-11 14:17:51 CEST LOG:  consistent recovery state reached at 0/600000F8
2016-07-11 14:17:51 CEST LOG:  database system is ready to accept read only connections
2016-07-11 14:17:51 CEST LOG:  restored log file "000000010000000000000061" from archive
2016-07-11 14:17:51 CEST LOG:  restored log file "000000010000000000000062" from archive
2016-07-11 14:17:51 CEST LOG:  recovery stopping before commit of transaction 50851, time 2000-01-01 01:00:00+01
2016-07-11 14:17:51 CEST LOG:  recovery has paused
2016-07-11 14:17:51 CEST HINT:  Execute pg_xlog_replay_resume() to continue.

At this point I could connect to the system and see that my table exists:

postgres=# \dt mytesttable
          List of relations
 Schema |    Name     | Type  | Owner 
--------+-------------+-------+-------
 public | mytesttable | table | mha
(1 row)

At this point, attempting to read from the table will hang, even though it exists:

postgres=# table mytesttable;
^CCancel request sent
ERROR:  canceling statement due to user request

This is because the recovery process was paused just before the transaction commits. The system has already replayed the actual DROP TABLE command, meaning that the recovery process is holding an exclusive lock on the table, so we can't access it. The fact that it's blocking reads to the specific table we were trying to restore is an indication that we are looking at the correct transaction. We can view the state of these locks in pg_locks:

postgres=# SELECT locktype, pid, granted FROM pg_locks
postgres-#   WHERE relation=(SELECT oid FROM pg_class WHERE relname='mytesttable');
 locktype | pid  | granted 
----------+------+---------
 relation | 1795 | t
(1 row)

The pid 1795 is not present in the pg_stat_activity view, but we can see it in the operating system:

$ ps 1795
  PID TTY      STAT   TIME COMMAND
 1795 ?        Ss     0:00 postgres: startup process   recovering 000000010000000000000062

At this point, we can execute the pg_xlog_replay_resume function to open up the system for access, and once that is done we can read our table:

postgres=# select pg_xlog_replay_resume();
 pg_xlog_replay_resume 
-----------------------

(1 row)

postgres=# select count(*) from mytesttable;
 count 
-------
  1400
(1 row)

Once the system is open, we can no longer change our mind and go back - the system is now started for normal operation, and has switched to a new timeline.

If we go back to a point right before we opened the system, we can try another approach. If, when the TABLE query hangs, we instead edit the recovery.conf file to say that we want to recover up to and including the specific transaction like this:

restore_command='cp /tmp/archive/%f %p'
recovery_target_xid=50851
recovery_target_inclusive=true

and then restart the recovery with pg_ctl restart, we will recover including the drop of the table (recovery_target_inclusive=true is the default, but I include it to be extra clear). At this point, we can connect to the system and see that our table is gone:

postgres=# \d mytesttable
Did not find any relation named "mytesttable".

Of course, that means that we have "gone too far", and need to go back one transaction. Unfortunately, there is no way to do that other than to restart the recovery from the beginning. When you do that, don't forget to rm -rf the data directory you are restoring into (or use a tool such as rsync which will clear out extra files from it for you), or you may run into conflicting files later on.

If the table still exists after this change then it means that it was some other table being dropped, and we have to try again with another transaction id (the pg_xlogdump step should have returned multiple transactions in this case).

Some PostgreSQL documentation references:


Conferences

I speak at and organize conferences around Open Source in general and PostgreSQL in particular.

Upcoming

SCALE+PGDays
Mar 2-5, 2017
Pasadena, California, USA
Open Source Infrastructure @ SCALE
Mar 2, 2017
Pasadena, California, USA
Confoo Montreal 2017
Mar 8-10, 2017
Montreal, Canada
Nordic PGDay 2017
Mar 21, 2017
Stockholm, Sweden
pgDay.paris 2017
Mar 23, 2017
Paris, France
PGCon 2017
May 23-26, 2017
Ottawa, Canada

Past

FOSDEM + PGDay 2017
Feb 2-4, 2017
Brussels, Belgium
PGConf.Asia 2016
Dec 2-3, 2016
Tokyo, Japan
Berlin PUG
Nov 17, 2016
Berlin, Germany
PGConf.EU 2016
Nov 1-4, 2016
Tallinn, Estonia
Stockholm PUG 2016/5
Oct 25, 2016
Stockholm, Sweden
More past conferences