Error Classes and Messaging Requirements
This page documents where exceptions are raised in Krikri and makes recommendations about how they might be changed to better represent the kinds of errors that the application is encountering, with the intention of better representing activities' statuses to users as well as improving the diagnostics available to administrators and developers.
This information has been collected to fulfill Krikri ticket #8426 ("Define Error Classes") and #8427 ("Develop Error Message Requirements ..."), and is intended to be an aid for followup implementation work.
Documented Behavior
The documentation comment at the head of the file for Krikri::SoftwareAgent says:
Software Agents should handle internal errors that do not result in full activity failure, and raise a
RuntimeError
when the job fails.Activity
handles logging of activity start/stop, and failure status.
It appears that the statement about RuntimeError was based on an assumption about what had been observed at the time, and not an attempt to prescribe a particular behavior. Considering the cases documented below, and the "Recommendations for Exceptions" below, we may want to acknowledge that our SoftwareAgent
s throw a range of exceptions.
Recommendations
Recommendations for Exceptions
Standardize exceptions that should be RuntimeError but are not. As an alternative, we could update the documented behavior above to specify a wider range of exceptions that one should expect. See below for notes on the following. Some of them have other recommendations, as well.
Krikri::Harvesters::PrimoHarvester
Krikri::LDP::Resource
Krikri::MapCrossWalk::CrosswalkHashBuilder
Krikri::ProvenanceQueryClient
We should catch the exceptions thrown by our harvester record enumerators. We could have some of these enumerators re-try HTTP requests a limited number of times after a delay, where they don't already by default. (We need to audit the use of Faraday settings for retry, delay, backoff.) This should probably be done in the #enumerate_records
methods of the harvesters.
Catching exceptions from the ProvenanceQueryClient
enumerator, which is what is behind mappings, enrichments, and indexings, is a little more involved. It's possible that we could try to recover some number of times from an exception raised by its enumerator, and then reraise the exception if it's still being caught, because the enumerator enumerates the result of a single SPARQL query. Once it has succeeded, it is likely that the each_solution
and map
will loop through to the end of the result. SPARQL::Client
uses Net::HTTP::Persistent
, and Net::HTTP
defaults to buffering responses before returning a response; it does not stream responses by default so we're probably safe to assume that we won't encounter errors in the enumerator more than once.
When an Activity catches various kinds of exceptions, it should save an indication of its "stopped" state (e.g. in a new field, state
), and indicate the reason why it stopped. I would like to suggest that the following classes be added to Krikri:
- Krikri::JobStatus::New
- The Activity's job has not been started yet.
- Krikri::JobStatus::Running
- The Activity's job is assumed to be running
- Krikri::JobStatus::StopRequested
- The Activity's job is assumed to be running
- The Activity's job should `exit` at the next opportunity
- Krikri::JobStatus::Stopped
- Krikri::JobStatus::Interrupted < Krikri::JobStatus::Stopped
- The job was interrupted on purpose, e.g. with SIGINT
- Krikri::JobStatus::Failed < Krikri::JobStatus::Stopped
- The job stopped due to error
- Krikri::JobStatus::Complete < Krikri::JobStatus::Stopped
- The job completed successfully
- Krikri::JobStatus::Interrupted < Krikri::JobStatus::Stopped
With these classes, the Activity
record could persist the JobStatus
, along with, perhaps, more detailed information from the exception, if applicable.
Additionally:
- We could have
Activity#run
register a signal handler to catch SIGINT (e.g.Signal.trap('INT')
). It could log the interrupt andexit
. - Each
SoftwareAgent
's#run
is passed the Activity's uri (viaActivity#rdf_subject
), so it should be able to periodically poll the Activity and see if its status isKrikri::JobStatus::StopRequested
. It can raise anInterrupt
in this case.
Messaging Requirements and Recommendations
Information that should be reported by an Activity:
- Counts of records expected and the counts processed
- See below in this section
- Counts of records that failed to process
- See below in this section
- Actual error messages per Activity
- As noted below, this should actually be delegated to a centralized logging system.
- Fields failed to map or enrich, and error counts by field
- See below in this section. Requires more planning.
- Logging messages by field is related to catching parser and DSL failures, below.
Information that should be logged by each SoftwareAgent (Harvesters, Mapper, Enricher, Indexer):
- The OriginalRecord URIs of records that failed to map, enrich, or index. The URLs of harvester requests that produced failures.
- HTTP errors, including those arising from Marmotta requests.
- Empty metadata or root parser nodes
- Parser or DSL failures
- Interrupt signals. (SIGINT – see above)
Actual error messages per Activity should be reported through a centralized logging tool, where messages are tagged with Activity IDs, and can be queried. Whichever logging aggregation tool is chosen should have an API that allows us to integrate it with the Krikri dashboard.
Expected record counts can be reported for mappings, enrichments, and indexings by calling #entity_uris
on the related generator Activity. We may consider adding a column to Activity to aggregate this statistic so that we don't have to run a SPARQL query every time we need it.
Counts of records that failed to process involves catching and logging the errors, and then using a suitable counter implementation that is durable and efficient. Catching and logging the errors is discussed below in the Notes. The correct place in most cases is inside of the #run
methods of the SoftwareAgent
s. Counters could be implemented as new numeric fields of Activity.
Logging URIs of records that failed to process should be fairly straightforward. It shouldn't be too much of an issue to log this in the Harvesters (with Activity IDs for searching in a centralized logging system), but could differ between classes. In the mappings, enrichments, and indexings, the #rdf_subject
of each record should be available to the code making the log
call. The solution in all cases is assumed at the moment to be logging and integration of the Krikri dashboard with the centralized logging tool through the tool's API.
TODO: determine how best to log parser and DSL failures. (Consult with Tom Johnson (Unlicensed)) Parser and DSL failures should be where we catch and possibly count failures by individual field.
An ErrorAccumulator
behavior should be created and added to Activity
. This behavior should be accessible to the SoftwareAgent
that the Activity has run, and should have a method that updates the Activity's error statistics.
A FailureGate
behavior could also be added to Activity
to help with deciding whether to retry a failed operation, like a harvester request or a provenance query. Activities for different SoftwareAgent
s could have different subclasses of FailureGate
.
Recommendations for Logging
All Krikri log messages should be tagged with "Krikri" so that Krikri messages can be filtered from the logfile, which will contain messages from other parts of the Rails application, Krikri being but an engine inside of a larger application.
All messages of a higher priority than "info" or maybe "notice" should be tagged with the location of the line of code that called the logging method, which can be determined with caller_locations(1, 1)[0].
The basename of the file and line number should be sufficient, assuming that a backtrace will follow with more detailed information. This will make it faster to identify where to look for errors within Krikri.
In addition to the line of code and file basename, messages should be tagged with a short Git commit ID or other indicator of the file's version. There can be multiple versions of the ingestion application running at the same moment across our ingestion workers, due to the fact that application deployments leave existing workers running.
Wherever possible, messages should be tagged with the ID of the relevant data provider.
All messages related to exceptions should be tagged with the name of the exception's class (e.g. e.class.to_s
), after the code location.
All messages that are caught by a SoftwareAgent
's #run
method should be tagged with the Activity
ID.
Instead of writing backtraces with string expansion as (for example) "#{e.backtrace}"
, the backtraces should be formatted for better readability. We would ideally remove stack frames closer to the bottom that are earlier than Krikri's, such as those from Rake and Resque, to clear up some visual clutter.
Configuration files should be adjusted, and dependencies should be updated to allow Heidrun to log to syslog. Krikri needs to be examined to ascertain that it can play as nicely as possible with container applications that have their own logging requirements. It should tag and otherwise format messages in the most convenient fashion for its maintainers, but should not place any undue burden on its host appliation. Heidrun should be examined to ensure that it has good initializers and settings files that give it good defaults for development and production, but that allow it to be configured flexibly for development of logging and messaging functionality.
Notes
Where Exceptions are Raised
These are the locations in Krikri where exceptions are raised explicitly, in the current code, as of July, 2016 (Krikri version 0.13.2)
This does not include where exceptions bubble up from dependencies.
The class of exception that is raised is indicated in parentheses.
- in Krikri::AsyncUriGetter.Request (Unused, so of limited concern)
- in
#join
in the proc given toStruct.new
- when a thread for a new asynchronous http request can't be joined (re-raises derived class of StandardError)
- in
#with_response
in the proc given toStruct.new
- when an HTTP request fails (re-raises Faraday exception)
- in
- in DPLA::MAP::Aggregation, monkeypatched in Krikri::Engine
- in
#original_record
- when an OriginalRecord is empty or is a blank node. (NameError)
- in
#dpla_id
- when trying to get DPLA ID for non-DPLA URI (Krikri::Engine::NamespaceError)
- only called by
#no_origrec_message
- in
#mint_id_fragment
- when asked to mint an ID for an empty or blanknode
OriginalRecord
(NameError) - only called by
#mint_id!,
which is never called within Krikri
- when asked to mint an ID for an empty or blanknode
- NOTE: These are of limited concern, because only
#original_record
ever gets invoked within Krikri, via the call toagg.original_record
inKrikri::SearchResultsHelperBehavior
.
- in
- in Krikri::Harvesters::OAIHarvester
- in
#get_identifier
- when encountering an XML syntax error trying to find the record ID (Krikri::Harvester::IdentifierError)
- when the ID element parsed from the document is empty (Krikri::Harvester::IdentifierError)
- caught and handled in
#records
. - NOTE: Of limited concern.
Krikri::Harvester::IdentifierError
is defined inKrikri::Harvester
, but#get_identifier
is the only place it's ever thrown, and it's handled within the same class in which it's thrown.
- in
- in Krikri::Harvesters::PrimoHarvester
- in
#count
- when receiving a non-200 HTTP status from the endpoint for initial request to retrieve record count (PrimoHarvesterError)
- in
#records
- when receiving a non-200 HTTP response from request for a batch of records (PrimoHarvesterError)
- in
#get_record
- when receiving a non-200 HTTP response from request for one record (PrimoHarvesterError)
- NOTE: The exceptions in
#get_record
and#count
should probably be RuntimeError. An exception should not be raised in#records
. Processing should attempt to proceed with the next batch – the logic should be the same as if a single record fails to fetch: we should continue and assess the damage at the end of the harvest.
- in
- in Krikri::LDP::Invalidatable
- in
#invalidate!
- when
#exists?
returnsfalse
(RuntimeError) - when
#invalidated?
returnstrue
and invalidity errors are not being ignored (RuntimeError)
- when
- in
- in Krikri::LDP::Resource
- in
#exists?
- when encountering a
Faraday::ClientError
that's not an HTTP 410, re-raises it (Faraday::ClientError < StandardError) - This appears to be something that would get passed upwards to
Activity#run
and be re-raised.
- when encountering a
- in
#delete!
- when
#exists?
returnsfalse
(RuntimeError)
- when
- in
#validate_not_nil
- when a Resource has a nil RDF subject (RuntimeError)
- in
#validate_not_node
- when a Resource has a blanknode RDF subject (RuntimeError)
- in
#validate_namespace
- when a Resource has an RDF subject in the wrong namespace (RuntimeError)
- NOTE: We should have
#exists?
re-raise the Faraday::ClientError as aRuntimeError
if we want to be true to our promise to raiseRuntimeError
fromActivity#run
.
- in
- in Krikri::MapCrosswalk::CrosswalkHashBuilder
- in
#build_aggregation
- when the parent aggregation is a blank node (NameError)
- If this is thrown, it will be passed up from a Harvester's
#records
enumerator and up pastHarvester#run
, and will be re-raised byActivity#run
as aNameError
instead of the documentedRuntimeError
.
- NOTE: Is NameError the ideal class for this exception? See http://ruby-doc.org/core-2.3.1/NameError.html
- in
- in Krikri::Mapping
- in
#process_record
- when an error is thrown handling any of the properties of the record (Krikri::Mapper::Error < RuntimeError)
- in
- in Krikri::MappingDSL::ParserMethods
- in lambda returned by
#record_uri
- when record to be parsed has not been saved (RuntimeError)
- in lambda returned by
- in Krikri::MappingDSL::ParserMethods::RecordProxy
- in
#method_missing
- when the arity of the call does not match the method on
#value_class
(ArgumentError) - This should be caught in
Krikri::Mapping#process_record
and added to theKrikri::Mapper::Error
that is then raised, and which will be caught byActivity#run
.
- when the arity of the call does not match the method on
- in
- in Krikri::MappingDSL::PropertyDeclaration
- in
#initialize
- when a block with arity other than 1 is passed (ArgumentError)
- As with
Krikri::MappingDSL::ParserMethods::RecordProxy#method_missing
, this should be added to aKrikri::Mapper::Error
and caught byActivity#run
.
- in
- in Krikri::MappingDSL::RdfSubjects::SubjectDeclaration
- in
#to_proc
- when multiple URIs are returned from [what is the object?] (RuntimeError)
- in
- in Krikri::Parser::Value::ValueArray
- in
#[]=
- when value is not a Value (InvalidParserValueError)
- in
#<<
- when value is not a Value (InvalidParserValueError)
- in
#and
- when trying to return unbound variable (ArgumentError)
- in
#else
- when no block given for
else
(ArgumentError)
- when no block given for
- NOTE: As with the
MappingDSL
cases above, these should be caught inKrikri::Mapping#process_record
and added to theKrikri::Mapper::Error
that it then raises.
- in
- in Krikri::XmlParser
- in
#initialize
- when
Nokogiri::XML#at_xpath
returns anil
root node (EmptyRootNodeError) - As with the
MappingDSL
cases above, these should be caught inKrikri::Mapping#process_record
and added to theKrikri::Mapper::Error
that it then raises.
- when
- in
- in Krikri::ProvenanceQueryClient
- in
#find_by_activity
- when
activity_uri
is not anRDF::URI
(ArgumentError) - This will be caught and re-raised by
Activity#run
asArgumentError
instead of the promisedRuntimeError
.
- when
- in
- in Krikri::QAQueryClient
- in
#values_for_predicate
- when
provider_uri
is not anRDF::URI
(ArgumentError)
- when
- in
#counts_for_predicate
- when
provider_uri
is not anRDF::URI
(ArgumentError)
- when
- in
- in Krikri::Registry
- in
.get
- when
name
is not registered (RuntimeError)
- when
- in
.register
- when
name
is already registered (RuntimeError)
- when
- in
Where Exceptions Are Caught
These are the locations in the current Krikri code, as of July, 2016 (version 0.13.2) where exceptions are caught; excluding some locations that are not relevant to this analysis.
- in Krikri::Activity
- in
#run
- when encountering any exception that has bubbled up
- re-raise it after logging
- in
- in Krikri::AsyncUriGetter.Request (Unused)
- in
#join
in the proc given toStruct.new
- when encountering any exception that has bubbled up from joining to the HTTP request thread
- re-raise unless "inline exceptions" are being implemented. In the latter case, the exception is swallowed.
- in
- in Krikri::Enricher
- in
#run
- when encountering any exception bubbled up from
Audumbla
or a record save - log and continue
- when encountering any exception bubbled up from
- in
- in Krikri::Harvester
- in
#run
- when encountering any exception bubbled up from a derived harvester's
#process_record
. - log and continue
- when encountering any exception bubbled up from a derived harvester's
- in
- in Krikri::Harvesters::OAIHarvester
- in
#records
- when encountering an exception from
@record_class.build
- log and continue
- when encountering an exception from
- in
#request_with_sets
, in the proc that handles each set- when encountering an
OAI::Exception
- log and return
[]
for the affected set
- when encountering an
- in
#get_identifier
- with a
Nokogiri::XML::XPath::SyntaxError
generated byNokogiri.XML.xpath
- re-raise as
Krikri::Harvester::IdentifierError
- with a
- in
- in Krikri::LDP::Resource
- in
#exists?
- with
Faraday::ResourceNotFound
- return false
- with
Faraday::ClientError
- return false if it's an HTTP 410
- otherwise re-raise
- with
- in
- in Krikri::Mapper
- in
#map
- with an exception bubbled up from
Registry.get
or#process_record
- log error and return
nil
- with an exception bubbled up from
- in
Agent#run
, in block passed toKrikri::Mapper.map
- with any exception encountered minting ID or saving record
- log error for the record and continue
- in
- in Krikri::SearchIndex
- in
#index_with_error_handling
- when encountering any exception from yielding the given block (which happens for each record being indexed)
- log error
- in
Where Marmotta And Other HTTP Exceptions Are Or Can Be Caught
- The lambda returned from Krikri::MappingDSL::ParserMethods#record_uri
- The lambda could pass a
Faraday::ClientError
or other exception fromKrikri::LDP::Resource#exists?
- Although it doesn't appear that
#record_uri
is ever called in Krikri
- The lambda could pass a
- Krikri::Enricher#run
- The record save can pass an exception, but it will be caught.
- Entity enumerators can pass exceptions:
DPLA::MAP::Aggregation#get
can throw an exception inAggregationEntityBehavior
viaActivity.entities
Repository.query_client
can throw exceptions viaProvenanceQueryClient.find_by_activity
- The record save can pass an exception, but it will be caught.
- Krikri::Mapper::Agent#run
- The record save can pass an exception, but it will be caught.
Krikri::Mapper.map(name, entities).each
can pass an exception fromOriginalRecordEntityBehavior#entities
- The record save can pass an exception, but it will be caught.
- Krikri::Indexer#run
- Exceptions are caught in
Krikri::SearchIndex#index_with_error_handling
, but only for the current batch of records. TheActivity#entities
enumerator could throw an exception thatIndexer#run
will fail to catch.
- Exceptions are caught in
- Harvesters
- Classes that extend
Krikri::Harvester
catch exceptions inrecords.each
in#run
, but#records
itself tends to pass exceptions, which can be raised by HTTP clients. - Harvesters that don't catch any exceptions in their record enumerators:
- ApiHarvester
- CouchdbHarvester
- MarcXMLHarvester
- PrimoHarvester
- Additionally, raises its own
PrimoHarvestError
exception, which extendsStandardError
.
- Additionally, raises its own
- ApiHarvester
- OAIHarvester doesn't catch any exceptions in its record enumerator methods, but catches
record_class.build
exceptions in the block for an individual record in#records
.
- Classes that extend
In the cases above, exceptions that go uncaught and are passed up to Activity#run
can be of a variety of exception classes, in addition to RuntimeError
(and may not extend RuntimeError
as promised in the documentation of Activity#run
).
Where Log Messages Are Written
These are the locations in the current Krikri code, as of July, 2016 (version 0.13.2) where logging statements are executed.
- in Krikri::Activity
- in
#run
- informationally
- when starting and ending a run, and when catching an exception (before re-raising it).
- in
- in Krikri::Enricher
- in
#run
- log caught exception
- when taking an individual record
- generated by anything in the Adumbla enrichment or by record save methods
- in
- in Krikri::Harvester
- in
#run
- log caught exception
- when taking an individual record
- generated by a record save
- in
- in Krikri::Harvesters::OAIHarvester
- in
#records
- log caught exception
- when taking an individual record
- generated by building the record and minting the ID (which parses the XML)
- in
#request_with_sets
- log caught exception
- when running a proc for an individual set with its options
- generated by the given block, which runs the harvest action
- in
- in Krikri::SearchIndex
- in
#index_with_error_handling
- log caught exception
- when running the given block
- generated by anything within the block, e.g. calling #bulk_add (solr.add or
Elasticsearch::Client#bulk
)
- in
- in Krikri::Mapper
- in
#map
- log caught exception
- calls
Rails.logger#error
, notKrikri::Logger#log
- when processing each record
- generated by anything within
Registry.get
or#process_record
within the proc for each record
- in
Agent#run
- log caught exception
- calls
Rails.logger#error
, notKrikri::Logger#log
- when processing each original record (provenance entity)
- generated by anything within minting an ID or saving the record
- in
Logging is never performed in controllers, and Activity
is the only model that logs.
Every class that logs, apart from Activity and SearchIndex, is a SoftwareAgent.