Addressing slow updates and inserts

Our mapping and enrichment activities read large numbers of individual RDF aggregations from Marmotta, modify them, and save them.  This results in high volumes of UPDATE and INSERT queries being made to PostgreSQL. Especially when we run multiple mappings or enrichments at the same time, we see significant amounts of lock contention. With shared buffers being dirtied at a high rate, we have also seen frequent activity to flush the buffer pool to disk, which is accompanied by some degree of I/O wait. We've tried to arrive at a balance of frequent-enough vacuuming and analysis, without causing too much I/O wait while this happens. We are still experimenting with the best settings for locking, checkpoints, and autovacuuming.

Attempts to reduce the impact of locking

Two examples from our logs:

2015-09-10 22:23:55 EDT ERROR:  deadlock detected
2015-09-10 22:23:55 EDT DETAIL:  Process 21581 waits for ShareLock on transaction 151266786; blocked by process 16586.
2015-09-10 22:23:55 EDT HINT:  See server log for query details.
2015-09-10 22:23:55 EDT STATEMENT:  UPDATE triples SET deleted = true, deletedAt = now() WHERE id = $1

And

2015-09-16 13:26:12 EDT ERROR: deadlock detected
2015-09-16 13:26:12 EDT DETAIL: Process 17679 waits for ShareLock on transaction 152476120; blocked by process 18601.
Process 18601 waits for ShareLock on transaction 152476596; blocked by process 19148.
Process 19148 waits for ShareLock on transaction 152476471; blocked by process 17679.
Process 17679: INSERT INTO versions_removed (version_id,triple_id) VALUES ($1,$2)
Process 18601: UPDATE triples SET deleted = true, deletedAt = now() WHERE id = $1
Process 19148: UPDATE triples SET deleted = true, deletedAt = now() WHERE id = $1

Most of our queries that wait for locks do not deadlock, of course.  We get fairly frequent messages with log_lock_waits turned on about queries waiting for locks and then acquiring them. The majority of locks that we see in our logs are ShareLocks. We also see ExclusiveLock and the occasional AccessExclusiveLock. In a logfile excerpt containing messages from 2015-09-16 08:56:41 to 2015-09-16 16:17:39 (about 8 hours, 40 minutes), we're observing 141 ShareLocks, 45 ExclusiveLocks, and one AccessExclusiveLock. This is with one of our mapping activities having been running for roughly the first half of that period, with a second mapping activity kicking in after that point. Most of the "still waiting" messages occurred in the 12:00 – 13:00 hour when the second mapping kicked off and we adjusted some parameters like deadlock_timeout.

Here is a filtered log excerpt from a period of one hour with two mapping activities running:

2015-09-16 15:02:47 EDT LOG: process 18607 still waiting for ShareLock on transaction 152525044 after 8000.110 ms
2015-09-16 15:03:11 EDT LOG: process 18606 still waiting for ShareLock on transaction 152525657 after 8000.118 ms
2015-09-16 15:03:48 EDT LOG: process 570 still waiting for ShareLock on transaction 152525763 after 8000.124 ms
2015-09-16 15:04:48 EDT LOG: process 17678 still waiting for ShareLock on transaction 152526087 after 8000.109 ms
2015-09-16 15:05:05 EDT LOG: process 18601 still waiting for ShareLock on transaction 152526597 after 8000.110 ms
2015-09-16 15:20:37 EDT LOG: process 18607 still waiting for ShareLock on transaction 152535353 after 8000.123 ms
2015-09-16 15:21:37 EDT LOG: process 17679 still waiting for ExclusiveLock on tuple (1909589,5) of relation 16869 of database 16726 after 8000.125 ms
2015-09-16 15:22:37 EDT LOG: process 570 still waiting for ExclusiveLock on tuple (1909589,5) of relation 16869 of database 16726 after 8000.098 ms
2015-09-16 15:23:18 EDT LOG: process 570 still waiting for ShareLock on transaction 152536090 after 8000.094 ms
2015-09-16 15:23:18 EDT LOG: process 17679 still waiting for ShareLock on transaction 152536090 after 8000.108 ms
2015-09-16 15:24:38 EDT LOG: process 18601 still waiting for ShareLock on transaction 152537188 after 8000.109 ms
2015-09-16 16:04:27 EDT LOG: process 15234 still waiting for ShareLock on transaction 152563196 after 8000.109 ms

It may be interesting to observe that the ExclusiveLocks tend to happen across spans of multiple client processes trying to get a lock on the same tuple for a period of a few minutes, one minute apart:

2015-09-16 14:11:58 EDT LOG: process 17678 still waiting for ExclusiveLock on tuple (1883389,95) of relation 16869 of database 16726 after 8000.098 ms
2015-09-16 14:12:58 EDT LOG: process 17679 still waiting for ExclusiveLock on tuple (1883389,95) of relation 16869 of database 16726 after 8000.091 ms
2015-09-16 14:13:58 EDT LOG: process 18578 still waiting for ExclusiveLock on tuple (1883389,95) of relation 16869 of database 16726 after 8000.095 ms
[...]
2015-09-16 16:05:27 EDT LOG: process 18601 still waiting for ExclusiveLock on tuple (1932368,25) of relation 16869 of database 16726 after 8000.102 ms
2015-09-16 16:06:27 EDT LOG: process 18603 still waiting for ExclusiveLock on tuple (1932368,25) of relation 16869 of database 16726 after 8000.115 ms
2015-09-16 16:07:27 EDT LOG: process 17678 still waiting for ExclusiveLock on tuple (1932368,25) of relation 16869 of database 16726 after 8000.099 ms

Here is a closer look at a typical set of lines about a ShareLock:

2015-09-16 16:35:22 EDT LOG: process 17679 still waiting for ShareLock on transaction 152578365 after 8000.095 ms
2015-09-16 16:35:22 EDT CONTEXT: SQL statement "SELECT 1 FROM ONLY "public"."triples" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR SHARE OF x"
2015-09-16 16:35:22 EDT STATEMENT: INSERT INTO versions_removed (version_id,triple_id) VALUES ($1,$2)

The query associated with a wait for an ExclusiveLock is typically:

UPDATE triples SET deleted = true, deletedAt = now() WHERE context = $1

We've gradually increased our deadlock_timeout from the default of one second to eight seconds:

deadlock_timeout = 8s

We are trying to see if this helps with reducing I/O wait when PostgreSQL investigates whether a deadlock has occurred.

Attempts to improve the efficiency of shared buffer writes

To reduce the frequency of WAL checkpointing, we have increased the checkpoint_segments and checkpoint_timeout and adjusted checkpoint_completion_target, arriving eventually at the following values:

checkpoint_segments = 128
checkpoint_timeout = 20min
checkpoint_completion_target = 0.9

This appears to have reduced the amount of I/O wait during a mapping activity running one at a time. We are seeing how this works, with checkpoint_segments and checkpoint_timeout values that are higher than the defaults, and a completion target that is higher than the default. We look forward to any feedback that can be provided on these values.

In addition, we have the following parameters set for PostgreSQL's bgwriter:

bgwriter_delay = 50ms          # decreased gradually from the default of 200ms
bgwriter_lru_maxpages = 500    # increased from default of 100
bgwriter_lru_multiplier = 4.0  # increased from default of 2.0

We increased bgwriter_lru_maxpages in the hopes that this will allow the background writer to alleviate some of the work that the checkpoints would normally have to do, because we seem to have good enough hardware for it and the default value of 100 dates from about eight years ago.  We'd enjoy having some feedback on these settings.

I/O Wait Questions

What is the relationship between high levels of locking and I/O wait as reported by the kernel? It seems that, if a process is waiting for a lock, it should be sleeping, not waiting for I/O. (When we've traced processes that are waiting for locks, they have been waiting for a semop() system call to complete.) When we have two mapping activities running concurrently, with many concurrent UPDATES and INSERTSwe encounter excessive I/O wait, in the 30-40% range. When we halt one of the mappings, I/O wait drops off, sometimes immediately, sometimes after a few minutes. Locks are reported in the log with and without these concurrent mappings, and do not actually seem to relate directly to I/O wait, which seems to be the real enemy of good performance on our system. Have we been looking in the wrong place when considering the lock problem as the culprit behind bad write performance? Perhaps it is related but not causal.

We noticed that we had too many buffers being written out by backend processes, instead of through checkpoints or the bgwriter:

-- at about 23:00, mapping activities having been running the whole time
marmotta=# select * from pg_stat_bgwriter;
-[ RECORD 1 ]---------+------------------------------
checkpoints_timed     | 43
checkpoints_req       | 4
buffers_checkpoint    | 328309
buffers_clean         | 1809303
maxwritten_clean      | 4382
buffers_backend       | 1474337
buffers_backend_fsync | 0
buffers_alloc         | 2139585003
stats_reset           | 2015-09-16 11:15:54.700449-04

We tried lowering bgwriter_delay to 50ms to get bgwriter to run more often to keep up, and increased bgwriter_lru_maxpages to 500. That made an improvement for a while, but we still had 830K buffers by checkpoint vs. 340K buffers by the backends after 12 hours.  (A small amount of time was spent with bgwriter_lru_multipler raised from 2 to 4 with those statistics.)   With bgwriter_lru_multiplier raised further from 2 to 4, we're seeing the following numbers, with reset statistics:

marmotta=# select now(), b.* from pg_stat_bgwriter b;
-[ RECORD 1 ]---------+------------------------------
now                   | 2015-09-17 17:20:32.4809-04
checkpoints_timed     | 19
checkpoints_req       | 0
buffers_checkpoint    | 129429
buffers_clean         | 171436
maxwritten_clean      | 9
buffers_backend       | 33902
buffers_backend_fsync | 0
buffers_alloc         | 43884970
stats_reset           | 2015-09-17 11:16:43.987407-04

Is it an issue that we're hitting checkpoint_segments segments before hitting the checkpoint_timeout most of the time?'

Another issue that came up in the past, and was solved for a while was throttling of block interface I/O by Amazon EBS. We had to upgrade the volumes housing our tablespaces to ones with faster IOPS ratings. We pose some questions about the correct IOPS rating in our page on Amazon EC2 adjustments, but, as mentioned there, even if upgrading to a faster volume would help in the short term, we believe that there's some fundamental issue in the way processes become backed up when a second mapping or enrichment runs, as described below.

The most drastic performance killer: running any second mapping or enrichment

In a scenario that has become typical, we have a mapping running at less than 1% I/O wait for a number of hours.  Then we start a second mapping or enrichment, and the number of records in the activity does not matter – for example, 100 records.  The I/O wait on the server immediately jumps to 30-50% and the disk partition that holds the triples table starts receiving at least a 50-fold increase in read requests. We have to halt this second 100-record activity after an hour of mostly I/O wait because it's making the first mapping grind to a halt. The original mapping starts processing just one record per minute.  After stopping the second activity, things go back to normal after a few minutes and we see less than 1% I/O wait and the first mapping goes back to processing one record per second.

Here is a look, using sar, at the volume with the triples table being hammered with read requests with a second mapping activity, showing the first mapping running, and then the second one coming in at about 11:30:

12:00:01 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
[...]
10:35:01 AM      xvdi     60.09   1160.18    353.46     25.19      0.10      1.59      0.29      1.77
10:45:01 AM      xvdi     47.12   1014.41    216.13     26.11      0.06      1.19      0.31      1.44
10:55:01 AM      xvdi     64.99   1245.05    372.67     24.89      0.10      1.49      0.29      1.89
11:05:01 AM      xvdi     52.32   1037.49    257.72     24.76      0.06      1.18      0.29      1.53
11:15:01 AM      xvdi     66.56   1306.14    338.74     24.71      0.09      1.42      0.30      2.02
11:25:01 AM      xvdi     52.42   1018.86    282.67     24.83      0.08      1.49      0.28      1.46
11:35:01 AM      xvdi    697.10  12266.85    655.80     18.54      0.55      0.78      0.40     27.71
11:45:01 AM      xvdi   1869.90  47128.66    470.97     25.46      1.10      0.59      0.39     72.71
11:55:01 AM      xvdi   3778.66  89101.81    322.44     23.67      2.56      0.68      0.26     98.56
12:05:01 PM      xvdi   3890.28  90455.00     28.44     23.26      2.98      0.77      0.25     98.57
12:15:01 PM      xvdi   3669.23  78907.41     30.44     21.51      2.07      0.56      0.27     97.61
12:25:01 PM      xvdi   2950.15  96475.01     12.17     32.71      4.11      1.39      0.34     99.14

Note that /dev/xvdi is the volume with the triples table. The table's indices are on another volume that sees only a fraction of the transactions (usually fewer than 100 per second when the xvdi tps are in the multiple thousands).

We've encountered another related problem that is hard to diagnose after the fact. We'll fill this section in if we can reproduce it. After a mapping activity was halted once, a process that was "IDLE" in the PostgreSQL process list (pg_stat_activity) was incurring high I/O wait. We noticed that I/O wait was in the 40% range, although this second mapping had been stopped, and we expected the wait to drop. Looking at top, we noticed an "idle" database client for the marmotta database that was using 1% CPU.  Looking at pg_stat_activity, we also confirmed that it was marked "IDLE."  Out of curiosity, we stopped it with pg_cancel_backend() – and I/O wait dropped immediately back down to near-zero. In the future, we'll trace the process's system calls before killing it.