This page documents where exceptions are raised in Krikri and makes recommendations about how they might be changed to better represent the kinds of errors that the application is encountering, with the intention of better representing activities' statuses to users as well as improving the diagnostics available to administrators and developers.
This information has been collected to fulfill Krikri ticket #8426 ("Define Error Classes") and #8427 ("Develop Error Message Requirements ..."), and is intended to be an aid for followup implementation work.
The documentation comment at the head of the file for Krikri::SoftwareAgent says:
Software Agents should handle internal errors that do not result in full activity failure, and raise a
RuntimeError
when the job fails.Activity
handles logging of activity start/stop, and failure status.
It appears that the statement about RuntimeError was based on an assumption about what had been observed at the time, and not an attempt to prescribe a particular behavior. Considering the cases documented below, and the "Recommendations for Exceptions" below, we may want to acknowledge that our SoftwareAgent
s throw a range of exceptions.
Standardize exceptions that should be RuntimeError but are not. As an alternative, we could update the documented behavior above to specify a wider range of exceptions that one should expect. See below for notes on the following. Some of them have other recommendations, as well.
Krikri::Harvesters::PrimoHarvester
Krikri::LDP::Resource
Krikri::MapCrossWalk::CrosswalkHashBuilder
Krikri::ProvenanceQueryClient
We should catch the exceptions thrown by our harvester record enumerators. We could have some of these enumerators re-try HTTP requests a limited number of times after a delay, where they don't already by default. (We need to audit the use of Faraday settings for retry, delay, backoff.) This should probably be done in the #enumerate_records
methods of the harvesters.
Catching exceptions from the ProvenanceQueryClient
enumerator, which is what is behind mappings, enrichments, and indexings, is a little more involved. It's possible that we could try to recover some number of times from an exception raised by its enumerator, and then reraise the exception if it's still being caught, because the enumerator enumerates the result of a single SPARQL query. Once it has succeeded, it is likely that the each_solution
and map
will loop through to the end of the result. SPARQL::Client
uses Net::HTTP::Persistent
, and Net::HTTP
defaults to buffering responses before returning a response; it does not stream responses by default so we're probably safe to assume that we won't encounter errors in the enumerator more than once.
When an Activity catches various kinds of exceptions, it should save an indication of its "stopped" state (e.g. in a new field, state
), and indicate the reason why it stopped. I would like to suggest that the following classes be added to Krikri:
With these classes, the Activity
record could persist the JobStatus
, along with, perhaps, more detailed information from the exception, if applicable.
Additionally:
Activity#run
register a signal handler to catch SIGINT (e.g. Signal.trap('INT')
). It could log the interrupt and exit
.SoftwareAgent
's #run
is passed the Activity's uri (via Activity#rdf_subject
), so it should be able to periodically poll the Activity and see if its status is Krikri::JobStatus::StopRequested
. It can raise an Interrupt
in this case.Information that should be reported by an Activity:
Information that should be logged by each SoftwareAgent (Harvesters, Mapper, Enricher, Indexer):
Actual error messages per Activity should be reported through a centralized logging tool, where messages are tagged with Activity IDs, and can be queried. Whichever logging aggregation tool is chosen should have an API that allows us to integrate it with the Krikri dashboard.
Expected record counts can be reported for mappings, enrichments, and indexings by calling #entity_uris
on the related generator Activity. We may consider adding a column to Activity to aggregate this statistic so that we don't have to run a SPARQL query every time we need it.
Counts of records that failed to process involves catching and logging the errors, and then using a suitable counter implementation that is durable and efficient. Catching and logging the errors is discussed below in the Notes. The correct place in most cases is inside of the #run
methods of the SoftwareAgent
s. Counters could be implemented as new numeric fields of Activity.
Logging URIs of records that failed to process should be fairly straightforward. It shouldn't be too much of an issue to log this in the Harvesters (with Activity IDs for searching in a centralized logging system), but could differ between classes. In the mappings, enrichments, and indexings, the #rdf_subject
of each record should be available to the code making the log
call. The solution in all cases is assumed at the moment to be logging and integration of the Krikri dashboard with the centralized logging tool through the tool's API.
TODO: determine how best to log parser and DSL failures. (Consult with Tom Johnson (Unlicensed)) Parser and DSL failures should be where we catch and possibly count failures by individual field.
An ErrorAccumulator
behavior should be created and added to Activity
. This behavior should be accessible to the SoftwareAgent
that the Activity has run, and should have a method that updates the Activity's error statistics.
A FailureGate
behavior could also be added to Activity
to help with deciding whether to retry a failed operation, like a harvester request or a provenance query. Activities for different SoftwareAgent
s could have different subclasses of FailureGate
.
All Krikri log messages should be tagged with "Krikri" so that Krikri messages can be filtered from the logfile, which will contain messages from other parts of the Rails application, Krikri being but an engine inside of a larger application.
All messages of a higher priority than "info" or maybe "notice" should be tagged with the location of the line of code that called the logging method, which can be determined with caller_locations(1, 1)[0].
The basename of the file and line number should be sufficient, assuming that a backtrace will follow with more detailed information. This will make it faster to identify where to look for errors within Krikri.
In addition to the line of code and file basename, messages should be tagged with a short Git commit ID or other indicator of the file's version. There can be multiple versions of the ingestion application running at the same moment across our ingestion workers, due to the fact that application deployments leave existing workers running.
Wherever possible, messages should be tagged with the ID of the relevant data provider.
All messages related to exceptions should be tagged with the name of the exception's class (e.g. e.class.to_s
), after the code location.
All messages that are caught by a SoftwareAgent
's #run
method should be tagged with the Activity
ID.
Instead of writing backtraces with string expansion as (for example) "#{e.backtrace}"
, the backtraces should be formatted for better readability. We would ideally remove stack frames closer to the bottom that are earlier than Krikri's, such as those from Rake and Resque, to clear up some visual clutter.
Configuration files should be adjusted, and dependencies should be updated to allow Heidrun to log to syslog. Krikri needs to be examined to ascertain that it can play as nicely as possible with container applications that have their own logging requirements. It should tag and otherwise format messages in the most convenient fashion for its maintainers, but should not place any undue burden on its host appliation. Heidrun should be examined to ensure that it has good initializers and settings files that give it good defaults for development and production, but that allow it to be configured flexibly for development of logging and messaging functionality.
These are the locations in Krikri where exceptions are raised explicitly, in the current code, as of July, 2016 (Krikri version 0.13.2)
This does not include where exceptions bubble up from dependencies.
The class of exception that is raised is indicated in parentheses.
#join
in the proc given to Struct.new
#with_response
in the proc given to Struct.new
#original_record
#dpla_id
#no_origrec_message
#mint_id_fragment
OriginalRecord
(NameError)#mint_id!,
which is never called within Krikri#original_record
ever gets invoked within Krikri, via the call to agg.original_record
in Krikri::SearchResultsHelperBehavior
.#get_identifier
#records
.Krikri::Harvester::IdentifierError
is defined in Krikri::Harvester
, but #get_identifier
is the only place it's ever thrown, and it's handled within the same class in which it's thrown.#count
#records
#get_record
#get_record
and #count
should probably be RuntimeError. An exception should not be raised in #records
. Processing should attempt to proceed with the next batch – the logic should be the same as if a single record fails to fetch: we should continue and assess the damage at the end of the harvest.#invalidate!
#exists?
returns false
(RuntimeError)#invalidated?
returns true
and invalidity errors are not being ignored (RuntimeError)#exists?
Faraday::ClientError
that's not an HTTP 410, re-raises it (Faraday::ClientError < StandardError)Activity#run
and be re-raised.#delete!
#exists?
returns false
(RuntimeError)#validate_not_nil
#validate_not_node
#validate_namespace
#exists?
re-raise the Faraday::ClientError as a RuntimeError
if we want to be true to our promise to raise RuntimeError
from Activity#run
.#build_aggregation
#records
enumerator and up past Harvester#run
, and will be re-raised by Activity#run
as a NameError
instead of the documented RuntimeError
.#process_record
#record_uri
#method_missing
#value_class
(ArgumentError)Krikri::Mapping#process_record
and added to the Krikri::Mapper::Error
that is then raised, and which will be caught by Activity#run
.#initialize
Krikri::MappingDSL::ParserMethods::RecordProxy#method_missing
, this should be added to a Krikri::Mapper::Error
and caught by Activity#run
.#to_proc
#[]=
#<<
#and
#else
else
(ArgumentError)MappingDSL
cases above, these should be caught in Krikri::Mapping#process_record
and added to the Krikri::Mapper::Error
that it then raises.#initialize
Nokogiri::XML#at_xpath
returns a nil
root node (EmptyRootNodeError)MappingDSL
cases above, these should be caught in Krikri::Mapping#process_record
and added to the Krikri::Mapper::Error
that it then raises.#find_by_activity
activity_uri
is not an RDF::URI
(ArgumentError)Activity#run
as ArgumentError
instead of the promised RuntimeError
.#values_for_predicate
provider_uri
is not an RDF::URI
(ArgumentError)#counts_for_predicate
provider_uri
is not an RDF::URI
(ArgumentError).get
name
is not registered (RuntimeError).register
name
is already registered (RuntimeError)These are the locations in the current Krikri code, as of July, 2016 (version 0.13.2) where exceptions are caught; excluding some locations that are not relevant to this analysis.
#run
#join
in the proc given to Struct.new
#run
Audumbla
or a record save#run
#process_record
.#records
@record_class.build
#request_with_sets
, in the proc that handles each setOAI::Exception
[]
for the affected set#get_identifier
Nokogiri::XML::XPath::SyntaxError
generated by Nokogiri.XML.xpath
Krikri::Harvester::IdentifierError
#exists?
Faraday::ResourceNotFound
Faraday::ClientError
#map
Registry.get
or #process_record
nil
Agent#run
, in block passed to Krikri::Mapper.map
#index_with_error_handling
Faraday::ClientError
or other exception from Krikri::LDP::Resource#exists?
#record_uri
is ever called in KrikriDPLA::MAP::Aggregation#get
can throw an exception in AggregationEntityBehavior
via Activity.entities
Repository.query_client
can throw exceptions via ProvenanceQueryClient.find_by_activity
Krikri::Mapper.map(name, entities).each
can pass an exception from OriginalRecordEntityBehavior#entities
Krikri::SearchIndex#index_with_error_handling
, but only for the current batch of records. The Activity#entities
enumerator could throw an exception that Indexer#run
will fail to catch.Krikri::Harvester
catch exceptions in records.each
in #run
, but #records
itself tends to pass exceptions, which can be raised by HTTP clients.PrimoHarvestError
exception, which extends StandardError
.record_class.build
exceptions in the block for an individual record in #records
.In the cases above, exceptions that go uncaught and are passed up to Activity#run
can be of a variety of exception classes, in addition to RuntimeError
(and may not extend RuntimeError
as promised in the documentation of Activity#run
).
These are the locations in the current Krikri code, as of July, 2016 (version 0.13.2) where logging statements are executed.
#run
#run
#run
#records
#request_with_sets
#index_with_error_handling
Elasticsearch::Client#bulk
)#map
Rails.logger#error
, not Krikri::Logger#log
Registry.get
or #process_record
within the proc for each recordAgent#run
Rails.logger#error
, not Krikri::Logger#log
Logging is never performed in controllers, and Activity
is the only model that logs.
Every class that logs, apart from Activity and SearchIndex, is a SoftwareAgent.