A record in our system is harvested, mapped and enriched, and is then ready to be checked. It exists at this point in a repository, an RDF triplestore, but is still unavailable to our website, our API, and even our internal quality-assurance application. The next step is indexing – reading the record into a search index for either the API or the QA interface so that it can be queried and displayed.

The way it works now, an indexing activity is run on an activity, feeding records in that activity into a search index, which is read by our QA application. After it's been QA'd, another indexing activity reads the records into our production Elasticsearch search index, and you can then find them with a search on our API or frontend website.

This document describes the way we currently index our records, and suggests some alternatives that require changes to our technology stack.

Current method

With the current method, there is one Solr search index for all providers, and for all QA purposes. Provider A's records and Provider B's records both get sent into the Solr index for QA, every time they are run.  Exceptions to this are actually possible, but this is the way it normally works, and was designed to work by default.

Solr, though it has in common with Elasticsearch its use of the Lucene library for low-level search index internals, is sufficiently different than Elasticsearch for there to be inconsistencies and inefficiencies that affect our QA work. For example, it has been difficult to get facets to work the same way between Solr and Elasticsearch, causing surprises when data reach production.

Alternate method

Some preliminaries are established:

With those prerequisites satisfied, it will be possible to:

  1. Have the Indexer activity read the provider's profile and create a new private search index if it doesn't already exist; or use the one indicated, or the default one.
  2. Index records into a provider-specific search index, or into a combined search index.
  3. Copy records from the QA index into the production index, using the Elasticsearch Reindex API, which is supported by the elasticsearch-ruby gem that Krikri uses now (using its reindex extension).

This new method allows us to see providers' data either on their own, or mixed in with others, as appropriate for the case at hand. It also allows us to set up test instances of our frontend website, pointed at these solo or combined QA indices, to see how the data will look on the frontend.

Outcomes

Enriched records appear either in their own index, if isolation is necessary, or mixed in with other providers' records.

QA'd records in the QA search index are copied internally within Elasticsearch to the production search index.

We can view the same search results and facets in the QA index as we would in the API or frontend site, because we're using the same search technology for both.

We only have to learn and keep current with one search engine product, instead of two.