Locating the recovery point just before a dropped table

A common example when talking about why it's a good thing to be able to do PITR (Point In Time Recovery) is the scenario where somebody or some thing (operator or buggy application) dropped a table, and we want to do a recover to right before the table was dropped, to keep as much valid data as possible.

PostgreSQL comes with nice functionality to decide exactly what point to perform a recovery to, which can be specified at millisecond granularity, or that of an individual transaction. But what if we don't know exactly when the table was dropped? (Exactly being at the level of specific transaction or at least millisecond).

On way to handle that is to "step forward" through the log one transaction at a time until the table is gone. This is obviously very time-consuming.

Assuming that DROP TABLE is not something we do very frequently in our system, we can also use the pg_xlogdump tool to help us find the exact spot to perform the recovery to, in much less time. Unfortunately, the dropping of temporary tables (implicit or explicit) is included in this, so if your application uses a lot of temporary tables this approach will not work out of the box. But for applications without them, it can save a lot of time.

Let's show an example. This assumes you have already set up the system for log archiving, you have a base backup that you have restored, and you have a log archive.

The first thing we do is try to determine the point where a DROP TABLE happened. We can do this by scanning for entries where rows have been deleted from the pg_class table, as this will always happen as part of the drop.

In this example I will scan the entire log archive. In a system where the log archive is large, which is probably true of most production systems, this can take some time. In this case it may be a good idea to limit the amount of logs scanned by looking at the timestamps of the archive files that are close to the time of the drop (presumably you know approximately when the table was dropped).

The below simple shellscript will scan for the transaction id's of all dropped tables in the log archives that were dropped from the database with oid being 12379 (in my example this is the postgres database):


START=$(ls -1 $ARCHIVE | grep -v partial | head -1)
END=$(ls -1 $ARCHIVE | grep -v partial | tail -1)

pg_xlogdump -r heap -p $ARCHIVE $START $END \
  |sed -n -r -e "s/.*tx:\s+([0-9]+), lsn.*desc: DELETE.*rel [0-9]+\/$DBOID\/1259 .*/\1/p" 

The expression in the sed command looks for any DELETE from the heap (table) of a relation with oid=1259, which is pg_class, that occurred in the correct database, and just prints this to standard output. If multiple tables have been dropped within the period, it cannot identify which is which, and the recovery process will just have to try them all.

In my example, it returned transaction id 50851.

Armed with this transaction id, I restored my base backup, and then created a recovery.conf file that looked like this:

restore_command='cp /tmp/archive/%f %p'

This tells the system to restore up to but not including transaction 50851. It is very important that we do not include this transaction as this would then drop the table again for us. Once this is done, the recovery ran for less than a second (this is a test system locally on SSD after all), and at the end of the log I got:

2016-07-11 14:17:51 CEST LOG:  starting point-in-time recovery to XID 50851
2016-07-11 14:17:51 CEST LOG:  restored log file "000000010000000000000060" from archive
2016-07-11 14:17:51 CEST LOG:  redo starts at 0/60000028
2016-07-11 14:17:51 CEST LOG:  consistent recovery state reached at 0/600000F8
2016-07-11 14:17:51 CEST LOG:  database system is ready to accept read only connections
2016-07-11 14:17:51 CEST LOG:  restored log file "000000010000000000000061" from archive
2016-07-11 14:17:51 CEST LOG:  restored log file "000000010000000000000062" from archive
2016-07-11 14:17:51 CEST LOG:  recovery stopping before commit of transaction 50851, time 2000-01-01 01:00:00+01
2016-07-11 14:17:51 CEST LOG:  recovery has paused
2016-07-11 14:17:51 CEST HINT:  Execute pg_xlog_replay_resume() to continue.

At this point I could connect to the system and see that my table exists:

postgres=# \dt mytesttable
          List of relations
 Schema |    Name     | Type  | Owner 
 public | mytesttable | table | mha
(1 row)

At this point, attempting to read from the table will hang, even though it exists:

postgres=# table mytesttable;
^CCancel request sent
ERROR:  canceling statement due to user request

This is because the recovery process was paused just before the transaction commits. The system has already replayed the actual DROP TABLE command, meaning that the recovery process is holding an exclusive lock on the table, so we can't access it. The fact that it's blocking reads to the specific table we were trying to restore is an indication that we are looking at the correct transaction. We can view the state of these locks in pg_locks:

postgres=# SELECT locktype, pid, granted FROM pg_locks
postgres-#   WHERE relation=(SELECT oid FROM pg_class WHERE relname='mytesttable');
 locktype | pid  | granted 
 relation | 1795 | t
(1 row)

The pid 1795 is not present in the pg_stat_activity view, but we can see it in the operating system:

$ ps 1795
 1795 ?        Ss     0:00 postgres: startup process   recovering 000000010000000000000062

At this point, we can execute the pg_xlog_replay_resume function to open up the system for access, and once that is done we can read our table:

postgres=# select pg_xlog_replay_resume();

(1 row)

postgres=# select count(*) from mytesttable;
(1 row)

Once the system is open, we can no longer change our mind and go back - the system is now started for normal operation, and has switched to a new timeline.

If we go back to a point right before we opened the system, we can try another approach. If, when the TABLE query hangs, we instead edit the recovery.conf file to say that we want to recover up to and including the specific transaction like this:

restore_command='cp /tmp/archive/%f %p'

and then restart the recovery with pg_ctl restart, we will recover including the drop of the table (recovery_target_inclusive=true is the default, but I include it to be extra clear). At this point, we can connect to the system and see that our table is gone:

postgres=# \d mytesttable
Did not find any relation named "mytesttable".

Of course, that means that we have "gone too far", and need to go back one transaction. Unfortunately, there is no way to do that other than to restart the recovery from the beginning. When you do that, don't forget to rm -rf the data directory you are restoring into (or use a tool such as rsync which will clear out extra files from it for you), or you may run into conflicting files later on.

If the table still exists after this change then it means that it was some other table being dropped, and we have to try again with another transaction id (the pg_xlogdump step should have returned multiple transactions in this case).

Some PostgreSQL documentation references:

Add comment

New comments can no longer be posted on this entry.


I speak at and organize conferences around Open Source in general and PostgreSQL in particular.



PGConf.DEV 2024
May 28-31, 2024
Vancouver, Canada
PGDay Chicago 2024
Apr 26, 2024
Chicago, USA
SCaLE 2024
Mar 14-17, 2024
Pasadena, USA
Nordic PGDay 2024
Mar 12, 2024
Oslo, Norway
Feb 2-4, 2024
Brussels, Belgium
More past conferences