This page documents where exceptions are raised in Krikri and makes recommendations about how they might be changed to better represent the kinds of errors that the application is encountering, with the intention of better representing activities' statuses to users as well as improving the diagnostics available to administrators and developers.

This information has been collected to fulfill Krikri ticket #8426 ("Define Error Classes") and #8427 ("Develop Error Message Requirements ..."), and is intended to be an aid for followup implementation work.

Documented Behavior

The documentation comment at the head of the file for Krikri::SoftwareAgent says:

Software Agents should handle internal errors that do not result in full activity failure, and raise a RuntimeError when the job fails. Activity handles logging of activity start/stop, and failure status.

It appears that the statement about RuntimeError was based on an assumption about what had been observed at the time, and not an attempt to prescribe a particular behavior. Considering the cases documented below, and the "Recommendations for Exceptions" below, we may want to acknowledge that our SoftwareAgents throw a range of exceptions.

Recommendations

Recommendations for Exceptions

Standardize exceptions that should be RuntimeError but are not. As an alternative, we could update the documented behavior above to specify a wider range of exceptions that one should expect. See below for notes on the following. Some of them have other recommendations, as well.

We should catch the exceptions thrown by our harvester record enumerators. We could have some of these enumerators re-try HTTP requests a limited number of times after a delay, where they don't already by default. (We need to audit the use of Faraday settings for retry, delay, backoff.) This should probably be done in the #enumerate_records methods of the harvesters.

Catching exceptions from the ProvenanceQueryClient enumerator, which is what is behind mappings, enrichments, and indexings, is a little more involved. It's possible that we could try to recover some number of times from an exception raised by its enumerator, and then reraise the exception if it's still being caught, because the enumerator enumerates the result of a single SPARQL query. Once it has succeeded, it is likely that the each_solution and map will loop through to the end of the result. SPARQL::Client uses Net::HTTP::Persistent, and Net::HTTP defaults to buffering responses before returning a response; it does not stream responses by default so we're probably safe to assume that we won't encounter errors in the enumerator more than once.

When an Activity catches various kinds of exceptions, it should save an indication of its "stopped" state (e.g. in a new field, state), and indicate the reason why it stopped. I would like to suggest that the following classes be added to Krikri:

With these classes, the Activity record could persist the JobStatus, along with, perhaps, more detailed information from the exception, if applicable.

Additionally:

Messaging Requirements and Recommendations

Information that should be reported by an Activity:

Information that should be logged by each SoftwareAgent (Harvesters, Mapper, Enricher, Indexer):

Actual error messages per Activity should be reported through a centralized logging tool, where messages are tagged with Activity IDs, and can be queried. Whichever logging aggregation tool is chosen should have an API that allows us to integrate it with the Krikri dashboard.

Expected record counts can be reported for mappings, enrichments, and indexings by calling #entity_uris on the related generator Activity. We may consider adding a column to Activity to aggregate this statistic so that we don't have to run a SPARQL query every time we need it.

Counts of records that failed to process involves catching and logging the errors, and then using a suitable counter implementation that is durable and efficient. Catching and logging the errors is discussed below in the Notes. The correct place in most cases is inside of the #run methods of the SoftwareAgents. Counters could be implemented as new numeric fields of Activity.

Logging URIs of records that failed to process should be fairly straightforward. It shouldn't be too much of an issue to log this in the Harvesters (with Activity IDs for searching in a centralized logging system), but could differ between classes. In the mappings, enrichments, and indexings, the #rdf_subject of each record should be available to the code making the log call. The solution in all cases is assumed at the moment to be logging and integration of the Krikri dashboard with the centralized logging tool through the tool's API.

TODO: determine how best to log parser and DSL failures.  (Consult with Tom Johnson (Unlicensed)) Parser and DSL failures should be where we catch and possibly count failures by individual field.

An ErrorAccumulator behavior should be created and added to Activity. This behavior should be accessible to the SoftwareAgent that the Activity has run, and should have a method that updates the Activity's error statistics.

FailureGate behavior could also be added to Activity to help with deciding whether to retry a failed operation, like a harvester request or a provenance query. Activities for different SoftwareAgents could have different subclasses of FailureGate.

Recommendations for Logging

All Krikri log messages should be tagged with "Krikri" so that  Krikri messages can be filtered from the logfile, which will contain messages from other parts of the Rails application, Krikri being but an engine inside of a larger application.

All messages of a higher priority than "info" or maybe "notice" should be tagged with the location of the line of code that called the logging method, which can be determined with caller_locations(1, 1)[0]. The basename of the file and line number should be sufficient, assuming that a backtrace will follow with more detailed information. This will make it faster to identify where to look for errors within Krikri.

In addition to the line of code and file basename, messages should be tagged with a short Git commit ID or other indicator of the file's version. There can be multiple versions of the ingestion application running at the same moment across our ingestion workers, due to the fact that application deployments leave existing workers running.

Wherever possible, messages should be tagged with the ID of the relevant data provider.

All messages related to exceptions should be tagged with the name of the exception's class (e.g. e.class.to_s), after the code location.

All messages that are caught by a SoftwareAgent's #run method should be tagged with the Activity ID.

Instead of writing backtraces with string expansion as (for example) "#{e.backtrace}", the backtraces should be formatted for better readability. We would ideally remove stack frames closer to the bottom that are earlier than Krikri's, such as those from Rake and Resque, to clear up some visual clutter.

Configuration files should be adjusted, and dependencies should be updated to allow Heidrun to log to syslog. Krikri needs to be examined to ascertain that it can play as nicely as possible with container applications that have their own logging requirements. It should tag and otherwise format messages in the most convenient fashion for its maintainers, but should not place any undue burden on its host appliation. Heidrun should be examined to ensure that it has good initializers and settings files that give it good defaults for development and production, but that allow it to be configured flexibly for development of logging and messaging functionality.

Notes

Where Exceptions are Raised

These are the locations in Krikri where exceptions are raised explicitly, in the current code, as of July, 2016 (Krikri version 0.13.2)

This does not include where exceptions bubble up from dependencies.

The class of exception that is raised is indicated in parentheses.

Where Exceptions Are Caught

These are the locations in the current Krikri code, as of July, 2016 (version 0.13.2) where exceptions are caught; excluding some locations that are not relevant to this analysis.

Where Marmotta And Other HTTP Exceptions Are Or Can Be Caught

In the cases above, exceptions that go uncaught and are passed up to Activity#run can be of a variety of exception classes, in addition to RuntimeError (and may not extend RuntimeError as promised in the documentation of Activity#run).

Where Log Messages Are Written

These are the locations in the current Krikri code, as of July, 2016 (version 0.13.2) where logging statements are executed.

Logging is never performed in controllers, and Activity is the only model that logs.

Every class that logs, apart from Activity and SearchIndex, is a SoftwareAgent.