Index performance with high context counts
Background
We use Marmotta's KiWi triplestore with a PostgreSQL database. Marmotta stores RDF triples in the triples table, and each triple is joined to a context, where triples.context equals nodes.id in the nodes table.
At DPLA, our triples include many contexts, and some contexts are shared by a majority of our triples. There are a few contexts that appear many times in the triples table, and many contexts that appear once or only a few times.
All relational databases that have indexes have to decide when it's more efficient to use an index, or when it's best to read rows directly from a table. The latter condition is known as a full table scan, or sequential scan. If the database server thinks that a column has low uniqueness – low cardinality – it often decides that the index won't provide any benefit in finding rows that are selected. If it has to read a large number of index records, and then go to the table to look up table records for each member of that list, it may take more time than simply doing a full table scan. Tables like our triples table can confuse the database into thinking that using the index is not useful, when, in reality, our particular query would benefit from it. If we select rows that exclude the low-cardinality values, the index will be useful after all.
PostgreSQL provides tunable parameters that can be used to specify how it should analyze particular tables, if they have unusual characteristics. There is also a clause that can be specified in the creation of an index that governs which rows of the table will be indexed, leading to the creation of what's known as a "partial index."
Marmotta creates a triples table when it is first run that has a partial index on triples.context "WHERE deleted = false".
DPLA usage pattern and problem
In the course of one of our harvests, there are very frequent queries that update triples to set deleted equal to true for triples that belong to a certain context:
UPDATE triples SET deleted = true, deletedAt = now() WHERE context = '<the context id>';
When we looked at the execution plan of these queries, we found that they were doing full table scans. Keep in mind that an UPDATE has to first find the row that needs updating.
Looking at the indices on the table, we noted that there is a multicolumn index that includes the context column as the first column in the list, which would normally allow the index to be used for query keys on context like the one above:
"idx_triples_cspo" btree (context, subject, predicate, object) WHERE deleted = false
The only issue with this is that idx_triples_cspo is a partial index, where `deleted = false`. This does not allow it to be used for the UPDATE query above, which does not specify "WHERE deleted = false".
As the triples table grew to many millions of rows, these full table scans started to make each UPDATE query take four seconds or more to complete.
It is likely that this issue has not been obvious in other installations, because nobody has attempted to load the number of LDP records that we have here at the DPLA.
Solutions
We created a new full index on triples.context (without the "WHERE" clause). This allows an UPDATE statement without a WHERE clause to use the index.
We also increased the cost of a sequential table scan in the tablespace that contains the triples table:
alter tablespace marmotta_1 set (seq_page_cost = 2);
Finally, we altered the column definition for triples.context to change the way PostgreSQL analyzes it. We told it to consider a larger number of most-frequent distinct values for the column, which leads to a more accurate estimated distinct-value count being made available to the Query Planner. This requires more storage for this table's statistics, and makes it take longer to analyze, which is why PostgreSQL's default is lower. You have to make it do extra work for you if you have a table with unusual characteristics.
alter table triples alter column context set statistics 5000;
Following these changes, queries that used to take four seconds or more to complete – sometimes up in the 20 second range – began taking milliseconds.