Post-Enrichment Outcomes (Indexing)

A record in our system is harvested, mapped and enriched, and is then ready to be checked. It exists at this point in a repository, an RDF triplestore, but is still unavailable to our website, our API, and even our internal quality-assurance application. The next step is indexing – reading the record into a search index for either the API or the QA interface so that it can be queried and displayed.

The way it works now, an indexing activity is run on an activity, feeding records in that activity into a search index, which is read by our QA application. After it's been QA'd, another indexing activity reads the records into our production Elasticsearch search index, and you can then find them with a search on our API or frontend website.

This document describes the way we currently index our records, and suggests some alternatives that require changes to our technology stack.

Current method

  • An Activity is run, which itself runs an Indexer to insert a representation of the record into a QASearchIndex. This QASearchIndex is always a Solr search index, at present.
  • People use the QA web application to check the data, issuing queries, generating reports, and so on. Solr is behind this, powering the searches and producing facets for reporting and navigation of the data.
  • When all's well, another Activity is run, which starts up an Indexer to read the record into a ProdSearchIndex. This is always an Elasticsearch search index, at present. At this point, the representation of the record is in the final search index that's behind our main website and our API. The record is live and searchable according to the functionality and configuration of our Elasticsearch installation.

With the current method, there is one Solr search index for all providers, and for all QA purposes. Provider A's records and Provider B's records both get sent into the Solr index for QA, every time they are run.  Exceptions to this are actually possible, but this is the way it normally works, and was designed to work by default.

Solr, though it has in common with Elasticsearch its use of the Lucene library for low-level search index internals, is sufficiently different than Elasticsearch for there to be inconsistencies and inefficiencies that affect our QA work. For example, it has been difficult to get facets to work the same way between Solr and Elasticsearch, causing surprises when data reach production.

Alternate method

Some preliminaries are established:

  • Elasticsearch indices are used for both QA and production purposes, and Solr is abandoned.
  • Krikri's QA interface is either modified to use a new version or fork of Blacklight that can talk with Elasticsearch, or we remove its dependency on Blacklight and write our own search and reporting pages.
  • Index-management functionality is added to Krikri to allow for creating new Elasticsearch indices, as needed. (Something that can't be done as conveniently with Solr.)
  • Data-structures are maintained and persisted that associate data providers with particular QA indices. A data provider can have its own private QA index, or can be pointed at a shared index, if that makes more sense for the case at hand.
  • Elasticsearch is upgraded to version 2.3. We are currently running on 0.90, which lacks some functionality needed below.

With those prerequisites satisfied, it will be possible to:

  1. Have the Indexer activity read the provider's profile and create a new private search index if it doesn't already exist; or use the one indicated, or the default one.
  2. Index records into a provider-specific search index, or into a combined search index.
  3. Copy records from the QA index into the production index, using the Elasticsearch Reindex API, which is supported by the elasticsearch-ruby gem that Krikri uses now (using its reindex extension).

This new method allows us to see providers' data either on their own, or mixed in with others, as appropriate for the case at hand. It also allows us to set up test instances of our frontend website, pointed at these solo or combined QA indices, to see how the data will look on the frontend.

Outcomes

Enriched records appear either in their own index, if isolation is necessary, or mixed in with other providers' records.

QA'd records in the QA search index are copied internally within Elasticsearch to the production search index.

We can view the same search results and facets in the QA index as we would in the API or frontend site, because we're using the same search technology for both.

We only have to learn and keep current with one search engine product, instead of two.