Error Classes and Messaging Requirements

This page documents where exceptions are raised in Krikri and makes recommendations about how they might be changed to better represent the kinds of errors that the application is encountering, with the intention of better representing activities' statuses to users as well as improving the diagnostics available to administrators and developers.

This information has been collected to fulfill Krikri ticket #8426 ("Define Error Classes") and #8427 ("Develop Error Message Requirements ..."), and is intended to be an aid for followup implementation work.

Documented Behavior

The documentation comment at the head of the file for Krikri::SoftwareAgent says:

Software Agents should handle internal errors that do not result in full activity failure, and raise a RuntimeError when the job fails. Activity handles logging of activity start/stop, and failure status.

It appears that the statement about RuntimeError was based on an assumption about what had been observed at the time, and not an attempt to prescribe a particular behavior. Considering the cases documented below, and the "Recommendations for Exceptions" below, we may want to acknowledge that our SoftwareAgents throw a range of exceptions.

Recommendations

Recommendations for Exceptions

Standardize exceptions that should be RuntimeError but are not. As an alternative, we could update the documented behavior above to specify a wider range of exceptions that one should expect. See below for notes on the following. Some of them have other recommendations, as well.

  • Krikri::Harvesters::PrimoHarvester
  • Krikri::LDP::Resource
  • Krikri::MapCrossWalk::CrosswalkHashBuilder
  • Krikri::ProvenanceQueryClient

We should catch the exceptions thrown by our harvester record enumerators. We could have some of these enumerators re-try HTTP requests a limited number of times after a delay, where they don't already by default. (We need to audit the use of Faraday settings for retry, delay, backoff.) This should probably be done in the #enumerate_records methods of the harvesters.

Catching exceptions from the ProvenanceQueryClient enumerator, which is what is behind mappings, enrichments, and indexings, is a little more involved. It's possible that we could try to recover some number of times from an exception raised by its enumerator, and then reraise the exception if it's still being caught, because the enumerator enumerates the result of a single SPARQL query. Once it has succeeded, it is likely that the each_solution and map will loop through to the end of the result. SPARQL::Client uses Net::HTTP::Persistent, and Net::HTTP defaults to buffering responses before returning a response; it does not stream responses by default so we're probably safe to assume that we won't encounter errors in the enumerator more than once.

When an Activity catches various kinds of exceptions, it should save an indication of its "stopped" state (e.g. in a new field, state), and indicate the reason why it stopped. I would like to suggest that the following classes be added to Krikri:

  • Krikri::JobStatus::New
    • The Activity's job has not been started yet.
  • Krikri::JobStatus::Running
    • The Activity's job is assumed to be running
  • Krikri::JobStatus::StopRequested
    • The Activity's job is assumed to be running
    • The Activity's job should `exit` at the next opportunity
  • Krikri::JobStatus::Stopped
    • Krikri::JobStatus::Interrupted < Krikri::JobStatus::Stopped
      • The job was interrupted on purpose, e.g. with SIGINT
    • Krikri::JobStatus::Failed < Krikri::JobStatus::Stopped
      • The job stopped due to error
    • Krikri::JobStatus::Complete < Krikri::JobStatus::Stopped
      • The job completed successfully

With these classes, the Activity record could persist the JobStatus, along with, perhaps, more detailed information from the exception, if applicable.

Additionally:

  • We could have Activity#run register a signal handler to catch SIGINT (e.g. Signal.trap('INT')). It could log the interrupt and exit.
  • Each SoftwareAgent's #run is passed the Activity's uri (via Activity#rdf_subject), so it should be able to periodically poll the Activity and see if its status is Krikri::JobStatus::StopRequested. It can raise an Interrupt in this case.

Messaging Requirements and Recommendations

Information that should be reported by an Activity:

  • Counts of records expected and the counts processed
    • See below in this section
  • Counts of records that failed to process
    • See below in this section
  • Actual error messages per Activity
    • As noted below, this should actually be delegated to a centralized logging system.
  • Fields failed to map or enrich, and error counts by field
    • See below in this section. Requires more planning.
    • Logging messages by field is related to catching parser and DSL failures, below.

Information that should be logged by each SoftwareAgent (Harvesters, Mapper, Enricher, Indexer):

  • The OriginalRecord URIs of records that failed to map, enrich, or index. The URLs of harvester requests that produced failures.
  • HTTP errors, including those arising from Marmotta requests.
  • Empty metadata or root parser nodes
  • Parser or DSL failures
  • Interrupt signals. (SIGINT – see above)

Actual error messages per Activity should be reported through a centralized logging tool, where messages are tagged with Activity IDs, and can be queried. Whichever logging aggregation tool is chosen should have an API that allows us to integrate it with the Krikri dashboard.

Expected record counts can be reported for mappings, enrichments, and indexings by calling #entity_uris on the related generator Activity. We may consider adding a column to Activity to aggregate this statistic so that we don't have to run a SPARQL query every time we need it.

Counts of records that failed to process involves catching and logging the errors, and then using a suitable counter implementation that is durable and efficient. Catching and logging the errors is discussed below in the Notes. The correct place in most cases is inside of the #run methods of the SoftwareAgents. Counters could be implemented as new numeric fields of Activity.

Logging URIs of records that failed to process should be fairly straightforward. It shouldn't be too much of an issue to log this in the Harvesters (with Activity IDs for searching in a centralized logging system), but could differ between classes. In the mappings, enrichments, and indexings, the #rdf_subject of each record should be available to the code making the log call. The solution in all cases is assumed at the moment to be logging and integration of the Krikri dashboard with the centralized logging tool through the tool's API.

TODO: determine how best to log parser and DSL failures.  (Consult with Tom Johnson (Unlicensed)) Parser and DSL failures should be where we catch and possibly count failures by individual field.

An ErrorAccumulator behavior should be created and added to Activity. This behavior should be accessible to the SoftwareAgent that the Activity has run, and should have a method that updates the Activity's error statistics.

FailureGate behavior could also be added to Activity to help with deciding whether to retry a failed operation, like a harvester request or a provenance query. Activities for different SoftwareAgents could have different subclasses of FailureGate.

Recommendations for Logging

All Krikri log messages should be tagged with "Krikri" so that  Krikri messages can be filtered from the logfile, which will contain messages from other parts of the Rails application, Krikri being but an engine inside of a larger application.

All messages of a higher priority than "info" or maybe "notice" should be tagged with the location of the line of code that called the logging method, which can be determined with caller_locations(1, 1)[0]. The basename of the file and line number should be sufficient, assuming that a backtrace will follow with more detailed information. This will make it faster to identify where to look for errors within Krikri.

In addition to the line of code and file basename, messages should be tagged with a short Git commit ID or other indicator of the file's version. There can be multiple versions of the ingestion application running at the same moment across our ingestion workers, due to the fact that application deployments leave existing workers running.

Wherever possible, messages should be tagged with the ID of the relevant data provider.

All messages related to exceptions should be tagged with the name of the exception's class (e.g. e.class.to_s), after the code location.

All messages that are caught by a SoftwareAgent's #run method should be tagged with the Activity ID.

Instead of writing backtraces with string expansion as (for example) "#{e.backtrace}", the backtraces should be formatted for better readability. We would ideally remove stack frames closer to the bottom that are earlier than Krikri's, such as those from Rake and Resque, to clear up some visual clutter.

Configuration files should be adjusted, and dependencies should be updated to allow Heidrun to log to syslog. Krikri needs to be examined to ascertain that it can play as nicely as possible with container applications that have their own logging requirements. It should tag and otherwise format messages in the most convenient fashion for its maintainers, but should not place any undue burden on its host appliation. Heidrun should be examined to ensure that it has good initializers and settings files that give it good defaults for development and production, but that allow it to be configured flexibly for development of logging and messaging functionality.

Notes

Where Exceptions are Raised

These are the locations in Krikri where exceptions are raised explicitly, in the current code, as of July, 2016 (Krikri version 0.13.2)

This does not include where exceptions bubble up from dependencies.

The class of exception that is raised is indicated in parentheses.

  • in Krikri::AsyncUriGetter.Request (Unused, so of limited concern)
    • in #join in the proc given to Struct.new
      • when a thread for a new asynchronous http request can't be joined (re-raises derived class of StandardError)
    • in #with_response in the proc given to Struct.new
      • when an HTTP request fails (re-raises Faraday exception)
  • in DPLA::MAP::Aggregation, monkeypatched in Krikri::Engine
    • in #original_record
      • when an OriginalRecord is empty or is a blank node. (NameError)
    • in #dpla_id
      • when trying to get DPLA ID for non-DPLA URI (Krikri::Engine::NamespaceError)
      • only called by #no_origrec_message
    • in #mint_id_fragment
      • when asked to mint an ID for an empty or blanknode OriginalRecord (NameError)
      • only called by #mint_id!, which is never called within Krikri
    • NOTE: These are of limited concern, because only #original_record ever gets invoked within Krikri, via the call to agg.original_record in Krikri::SearchResultsHelperBehavior.
  • in Krikri::Harvesters::OAIHarvester
    • in #get_identifier
      • when encountering an XML syntax error trying to find the record ID (Krikri::Harvester::IdentifierError)
      • when the ID element parsed from the document is empty (Krikri::Harvester::IdentifierError)
      • caught and handled in #records.
      • NOTE: Of limited concern. Krikri::Harvester::IdentifierError is defined in Krikri::Harvester, but #get_identifier is the only place it's ever thrown, and it's handled within the same class in which it's thrown.
  • in Krikri::Harvesters::PrimoHarvester
    • in #count
      • when receiving a non-200 HTTP status from the endpoint for initial request to retrieve record count (PrimoHarvesterError)
    • in #records
      • when receiving a non-200 HTTP response from request for a batch of records (PrimoHarvesterError)
    • in #get_record
      • when receiving a non-200 HTTP response from request for one record (PrimoHarvesterError)
    • NOTE: The exceptions in #get_record and #count should probably be RuntimeError. An exception should not be raised in #records. Processing should attempt to proceed with the next batch – the logic should be the same as if a single record fails to fetch: we should continue and assess the damage at the end of the harvest.
  • in Krikri::LDP::Invalidatable
    • in #invalidate!
      • when #exists? returns false (RuntimeError)
      • when #invalidated? returns true and invalidity errors are not being ignored (RuntimeError)
  • in Krikri::LDP::Resource
    • in #exists?
      • when encountering a Faraday::ClientError that's not an HTTP 410, re-raises it (Faraday::ClientError < StandardError)
      • This appears to be something that would get passed upwards to Activity#run and be re-raised.
    • in #delete!
      • when #exists? returns false (RuntimeError)
    • in #validate_not_nil
      • when a Resource has a nil RDF subject (RuntimeError)
    • in #validate_not_node
      • when a Resource has a blanknode RDF subject (RuntimeError)
    • in #validate_namespace
      • when a Resource has an RDF subject in the wrong namespace (RuntimeError)
    • NOTE: We should have #exists? re-raise the Faraday::ClientError as a RuntimeError if we want to be true to our promise to raise RuntimeError from Activity#run.
  • in Krikri::MapCrosswalk::CrosswalkHashBuilder
    • in #build_aggregation
      • when the parent aggregation is a blank node (NameError)
      • If this is thrown, it will be passed up from a Harvester's #records enumerator and up past Harvester#run, and will be re-raised by Activity#run as a NameError instead of the documented RuntimeError.
    • NOTE: Is NameError the ideal class for this exception?  See http://ruby-doc.org/core-2.3.1/NameError.html
  • in Krikri::Mapping
    • in #process_record
      • when an error is thrown handling any of the properties of the record (Krikri::Mapper::Error < RuntimeError)
  • in Krikri::MappingDSL::ParserMethods
    • in lambda returned by #record_uri
      • when record to be parsed has not been saved (RuntimeError)
  • in Krikri::MappingDSL::ParserMethods::RecordProxy
    • in #method_missing
      • when the arity of the call does not match the method on #value_class (ArgumentError)
      • This should be caught in Krikri::Mapping#process_record and added to the Krikri::Mapper::Error that is then raised, and which will be caught by Activity#run.
  • in Krikri::MappingDSL::PropertyDeclaration
    • in #initialize
      • when a block with arity other than 1 is passed (ArgumentError)
      • As with Krikri::MappingDSL::ParserMethods::RecordProxy#method_missing, this should be added to a Krikri::Mapper::Error and caught by Activity#run.
  • in Krikri::MappingDSL::RdfSubjects::SubjectDeclaration
    • in #to_proc
      • when multiple URIs are returned from [what is the object?] (RuntimeError)
  • in Krikri::Parser::Value::ValueArray
    • in #[]=
      • when value is not a Value (InvalidParserValueError)
    • in #<<
      • when value is not a Value (InvalidParserValueError)
    • in #and
      • when trying to return unbound variable (ArgumentError)
    • in #else
      • when no block given for else (ArgumentError)
    • NOTE: As with the MappingDSL cases above, these should be caught in Krikri::Mapping#process_record and added to the Krikri::Mapper::Error that it then raises.
  • in Krikri::XmlParser
    • in #initialize
      • when Nokogiri::XML#at_xpath returns a nil root node (EmptyRootNodeError)
      • As with the MappingDSL cases above, these should be caught in Krikri::Mapping#process_record and added to the Krikri::Mapper::Error that it then raises.
  • in Krikri::ProvenanceQueryClient
    • in #find_by_activity
      • when activity_uri is not an RDF::URI (ArgumentError)
      • This will be caught and re-raised by Activity#run as ArgumentError instead of the promised RuntimeError.
  • in Krikri::QAQueryClient
    • in #values_for_predicate
      • when provider_uri is not an RDF::URI (ArgumentError)
    • in #counts_for_predicate
      • when provider_uri is not an RDF::URI (ArgumentError)
  • in Krikri::Registry
    • in .get
      • when name is not registered (RuntimeError)
    • in .register
      • when name is already registered (RuntimeError)

Where Exceptions Are Caught

These are the locations in the current Krikri code, as of July, 2016 (version 0.13.2) where exceptions are caught; excluding some locations that are not relevant to this analysis.

  • in Krikri::Activity
    • in #run
      • when encountering any exception that has bubbled up
      • re-raise it after logging
  • in Krikri::AsyncUriGetter.Request (Unused)
    • in #join in the proc given to Struct.new
      • when encountering any exception that has bubbled up from joining to the HTTP request thread
      • re-raise unless "inline exceptions" are being implemented. In the latter case, the exception is swallowed.
  • in Krikri::Enricher
    • in #run
      • when encountering any exception bubbled up from Audumbla or a record save
      • log and continue
  • in Krikri::Harvester
    • in #run
      • when encountering any exception bubbled up from a derived harvester's #process_record.
      • log and continue
  • in Krikri::Harvesters::OAIHarvester
    • in #records
      • when encountering an exception from @record_class.build
      • log and continue
    • in #request_with_sets, in the proc that handles each set
      • when encountering an OAI::Exception
      • log and return [] for the affected set
    • in #get_identifier
      • with a Nokogiri::XML::XPath::SyntaxError generated by Nokogiri.XML.xpath
      • re-raise as Krikri::Harvester::IdentifierError
  • in Krikri::LDP::Resource
    • in #exists?
      • with Faraday::ResourceNotFound
        • return false
      • with Faraday::ClientError
        • return false if it's an HTTP 410
        • otherwise re-raise
  • in Krikri::Mapper
    • in #map
      • with an exception bubbled up from Registry.get or #process_record
      • log error and return nil
    • in Agent#run, in block passed to Krikri::Mapper.map
      • with any exception encountered minting ID or saving record
      • log error for the record and continue
  • in Krikri::SearchIndex
    • in #index_with_error_handling
      • when encountering any exception from yielding the given block (which happens for each record being indexed)
      • log error

Where Marmotta And Other HTTP Exceptions Are Or Can Be Caught

  • The lambda returned from Krikri::MappingDSL::ParserMethods#record_uri
    • The lambda could pass a Faraday::ClientError or other exception from Krikri::LDP::Resource#exists?
    • Although it doesn't appear that #record_uri is ever called in Krikri
  • Krikri::Enricher#run
    • The record save can pass an exception, but it will be caught.
    • Entity enumerators can pass exceptions:
      • DPLA::MAP::Aggregation#get can throw an exception in AggregationEntityBehavior via Activity.entities
      • Repository.query_client can throw exceptions via ProvenanceQueryClient.find_by_activity
  • Krikri::Mapper::Agent#run
    • The record save can pass an exception, but it will be caught.
    • Krikri::Mapper.map(name, entities).each can pass an exception from OriginalRecordEntityBehavior#entities
  • Krikri::Indexer#run
    • Exceptions are caught in Krikri::SearchIndex#index_with_error_handling, but only for the current batch of records. The Activity#entities enumerator could throw an exception that Indexer#run will fail to catch.
  • Harvesters
    • Classes that extend Krikri::Harvester catch exceptions in records.each in #run, but #records itself tends to pass exceptions, which can be raised by HTTP clients.
    • Harvesters that don't catch any exceptions in their record enumerators:
      • ApiHarvester
      • CouchdbHarvester
      • MarcXMLHarvester
      • PrimoHarvester
        • Additionally, raises its own PrimoHarvestError exception, which extends StandardError.
    • OAIHarvester doesn't catch any exceptions in its record enumerator methods, but catches record_class.build exceptions in the block for an individual record in #records.

In the cases above, exceptions that go uncaught and are passed up to Activity#run can be of a variety of exception classes, in addition to RuntimeError (and may not extend RuntimeError as promised in the documentation of Activity#run).

Where Log Messages Are Written

These are the locations in the current Krikri code, as of July, 2016 (version 0.13.2) where logging statements are executed.

  • in Krikri::Activity
    • in #run
      • informationally
      • when starting and ending a run, and when catching an exception (before re-raising it).
  • in Krikri::Enricher
    • in #run
      • log caught exception
      • when taking an individual record
      • generated by anything in the Adumbla enrichment or by record save methods
  • in Krikri::Harvester
    • in #run
      • log caught exception
      • when taking an individual record
      • generated by a record save
  • in Krikri::Harvesters::OAIHarvester
    • in #records
      • log caught exception
      • when taking an individual record
      • generated by building the record and minting the ID (which parses the XML)
    • in #request_with_sets
      • log caught exception
      • when running a proc for an individual set with its options
      • generated by the given block, which runs the harvest action
  • in Krikri::SearchIndex
    • in #index_with_error_handling
      • log caught exception
      • when running the given block
      • generated by anything within the block, e.g. calling #bulk_add (solr.add or Elasticsearch::Client#bulk)
  • in Krikri::Mapper
    • in #map
      • log caught exception
      • calls Rails.logger#error, not Krikri::Logger#log
      • when processing each record
      • generated by anything within Registry.get or #process_record within the proc for each record
    • in Agent#run
      • log caught exception
      • calls Rails.logger#error, not Krikri::Logger#log
      • when processing each original record (provenance entity)
      • generated by anything within minting an ID or saving the record

Logging is never performed in controllers, and Activity is the only model that logs.

Every class that logs, apart from Activity and SearchIndex, is a SoftwareAgent.