Error Classes and Messaging Requirements
This page documents where exceptions are raised in Krikri and makes recommendations about how they might be changed to better represent the kinds of errors that the application is encountering, with the intention of better representing activities' statuses to users as well as improving the diagnostics available to administrators and developers.
This information has been collected to fulfill Krikri ticket #8426 ("Define Error Classes") and #8427 ("Develop Error Message Requirements ..."), and is intended to be an aid for followup implementation work.
Documented Behavior
The documentation comment at the head of the file for Krikri::SoftwareAgent says:
Software Agents should handle internal errors that do not result in full activity failure, and raise a
RuntimeErrorwhen the job fails.Activityhandles logging of activity start/stop, and failure status.
It appears that the statement about RuntimeError was based on an assumption about what had been observed at the time, and not an attempt to prescribe a particular behavior. Considering the cases documented below, and the "Recommendations for Exceptions" below, we may want to acknowledge that our SoftwareAgents throw a range of exceptions.
Recommendations
Recommendations for Exceptions
Standardize exceptions that should be RuntimeError but are not. As an alternative, we could update the documented behavior above to specify a wider range of exceptions that one should expect. See below for notes on the following. Some of them have other recommendations, as well.
Krikri::Harvesters::PrimoHarvesterKrikri::LDP::ResourceKrikri::MapCrossWalk::CrosswalkHashBuilderKrikri::ProvenanceQueryClient
We should catch the exceptions thrown by our harvester record enumerators. We could have some of these enumerators re-try HTTP requests a limited number of times after a delay, where they don't already by default. (We need to audit the use of Faraday settings for retry, delay, backoff.) This should probably be done in the #enumerate_records methods of the harvesters.
Catching exceptions from the ProvenanceQueryClient enumerator, which is what is behind mappings, enrichments, and indexings, is a little more involved. It's possible that we could try to recover some number of times from an exception raised by its enumerator, and then reraise the exception if it's still being caught, because the enumerator enumerates the result of a single SPARQL query. Once it has succeeded, it is likely that the each_solution and map will loop through to the end of the result. SPARQL::Client uses Net::HTTP::Persistent, and Net::HTTP defaults to buffering responses before returning a response; it does not stream responses by default so we're probably safe to assume that we won't encounter errors in the enumerator more than once.
When an Activity catches various kinds of exceptions, it should save an indication of its "stopped" state (e.g. in a new field, state), and indicate the reason why it stopped. I would like to suggest that the following classes be added to Krikri:
Krikri::JobStatus::New
The Activity's job has not been started yet.
Krikri::JobStatus::Running
The Activity's job is assumed to be running
Krikri::JobStatus::StopRequested
The Activity's job is assumed to be running
The Activity's job should `exit` at the next opportunity
Krikri::JobStatus::Stopped
Krikri::JobStatus::Interrupted < Krikri::JobStatus::Stopped
The job was interrupted on purpose, e.g. with SIGINT
Krikri::JobStatus::Failed < Krikri::JobStatus::Stopped
The job stopped due to error
Krikri::JobStatus::Complete < Krikri::JobStatus::Stopped
The job completed successfully
With these classes, the Activity record could persist the JobStatus, along with, perhaps, more detailed information from the exception, if applicable.
Additionally:
We could have
Activity#runregister a signal handler to catch SIGINT (e.g.Signal.trap('INT')). It could log the interrupt andexit.Each
SoftwareAgent's#runis passed the Activity's uri (viaActivity#rdf_subject), so it should be able to periodically poll the Activity and see if its status isKrikri::JobStatus::StopRequested. It can raise anInterruptin this case.
Messaging Requirements and Recommendations
Information that should be reported by an Activity:
Counts of records expected and the counts processed
See below in this section
Counts of records that failed to process
See below in this section
Actual error messages per Activity
As noted below, this should actually be delegated to a centralized logging system.
Fields failed to map or enrich, and error counts by field
See below in this section. Requires more planning.
Logging messages by field is related to catching parser and DSL failures, below.
Information that should be logged by each SoftwareAgent (Harvesters, Mapper, Enricher, Indexer):
The OriginalRecord URIs of records that failed to map, enrich, or index. The URLs of harvester requests that produced failures.
HTTP errors, including those arising from Marmotta requests.
Empty metadata or root parser nodes
Parser or DSL failures
Interrupt signals. (SIGINT – see above)
Actual error messages per Activity should be reported through a centralized logging tool, where messages are tagged with Activity IDs, and can be queried. Whichever logging aggregation tool is chosen should have an API that allows us to integrate it with the Krikri dashboard.
Expected record counts can be reported for mappings, enrichments, and indexings by calling #entity_uris on the related generator Activity. We may consider adding a column to Activity to aggregate this statistic so that we don't have to run a SPARQL query every time we need it.
Counts of records that failed to process involves catching and logging the errors, and then using a suitable counter implementation that is durable and efficient. Catching and logging the errors is discussed below in the Notes. The correct place in most cases is inside of the #run methods of the SoftwareAgents. Counters could be implemented as new numeric fields of Activity.
Logging URIs of records that failed to process should be fairly straightforward. It shouldn't be too much of an issue to log this in the Harvesters (with Activity IDs for searching in a centralized logging system), but could differ between classes. In the mappings, enrichments, and indexings, the #rdf_subject of each record should be available to the code making the log call. The solution in all cases is assumed at the moment to be logging and integration of the Krikri dashboard with the centralized logging tool through the tool's API.
TODO: determine how best to log parser and DSL failures. (Consult with @Tom Johnson (Unlicensed)) Parser and DSL failures should be where we catch and possibly count failures by individual field.
An ErrorAccumulator behavior should be created and added to Activity. This behavior should be accessible to the SoftwareAgent that the Activity has run, and should have a method that updates the Activity's error statistics.
A FailureGate behavior could also be added to Activity to help with deciding whether to retry a failed operation, like a harvester request or a provenance query. Activities for different SoftwareAgents could have different subclasses of FailureGate.
Recommendations for Logging
All Krikri log messages should be tagged with "Krikri" so that Krikri messages can be filtered from the logfile, which will contain messages from other parts of the Rails application, Krikri being but an engine inside of a larger application.
All messages of a higher priority than "info" or maybe "notice" should be tagged with the location of the line of code that called the logging method, which can be determined with caller_locations(1, 1)[0]. The basename of the file and line number should be sufficient, assuming that a backtrace will follow with more detailed information. This will make it faster to identify where to look for errors within Krikri.
In addition to the line of code and file basename, messages should be tagged with a short Git commit ID or other indicator of the file's version. There can be multiple versions of the ingestion application running at the same moment across our ingestion workers, due to the fact that application deployments leave existing workers running.
Wherever possible, messages should be tagged with the ID of the relevant data provider.
All messages related to exceptions should be tagged with the name of the exception's class (e.g. e.class.to_s), after the code location.
All messages that are caught by a SoftwareAgent's #run method should be tagged with the Activity ID.
Instead of writing backtraces with string expansion as (for example) "#{e.backtrace}", the backtraces should be formatted for better readability. We would ideally remove stack frames closer to the bottom that are earlier than Krikri's, such as those from Rake and Resque, to clear up some visual clutter.
Configuration files should be adjusted, and dependencies should be updated to allow Heidrun to log to syslog. Krikri needs to be examined to ascertain that it can play as nicely as possible with container applications that have their own logging requirements. It should tag and otherwise format messages in the most convenient fashion for its maintainers, but should not place any undue burden on its host appliation. Heidrun should be examined to ensure that it has good initializers and settings files that give it good defaults for development and production, but that allow it to be configured flexibly for development of logging and messaging functionality.
Notes
Where Exceptions are Raised
These are the locations in Krikri where exceptions are raised explicitly, in the current code, as of July, 2016 (Krikri version 0.13.2)
This does not include where exceptions bubble up from dependencies.
The class of exception that is raised is indicated in parentheses.
in Krikri::AsyncUriGetter.Request (Unused, so of limited concern)
in
#joinin the proc given toStruct.newwhen a thread for a new asynchronous http request can't be joined (re-raises derived class of StandardError)
in
#with_responsein the proc given toStruct.newwhen an HTTP request fails (re-raises Faraday exception)
in DPLA::MAP::Aggregation, monkeypatched in Krikri::Engine
in
#original_recordwhen an OriginalRecord is empty or is a blank node. (NameError)
in
#dpla_idwhen trying to get DPLA ID for non-DPLA URI (Krikri::Engine::NamespaceError)
only called by
#no_origrec_message
in
#mint_id_fragmentwhen asked to mint an ID for an empty or blanknode
OriginalRecord(NameError)only called by
#mint_id!,which is never called within Krikri
NOTE: These are of limited concern, because only
#original_recordever gets invoked within Krikri, via the call toagg.original_recordinKrikri::SearchResultsHelperBehavior.
in Krikri::Harvesters::OAIHarvester
in
#get_identifierwhen encountering an XML syntax error trying to find the record ID (Krikri::Harvester::IdentifierError)
when the ID element parsed from the document is empty (Krikri::Harvester::IdentifierError)
caught and handled in
#records.NOTE: Of limited concern.
Krikri::Harvester::IdentifierErroris defined inKrikri::Harvester, but#get_identifieris the only place it's ever thrown, and it's handled within the same class in which it's thrown.
in Krikri::Harvesters::PrimoHarvester
in
#countwhen receiving a non-200 HTTP status from the endpoint for initial request to retrieve record count (PrimoHarvesterError)
in
#recordswhen receiving a non-200 HTTP response from request for a batch of records (PrimoHarvesterError)
in
#get_recordwhen receiving a non-200 HTTP response from request for one record (PrimoHarvesterError)
NOTE: The exceptions in
#get_recordand#countshould probably be RuntimeError. An exception should not be raised in#records. Processing should attempt to proceed with the next batch – the logic should be the same as if a single record fails to fetch: we should continue and assess the damage at the end of the harvest.
in Krikri::LDP::Invalidatable
in
#invalidate!when
#exists?returnsfalse(RuntimeError)when
#invalidated?returnstrueand invalidity errors are not being ignored (RuntimeError)
in Krikri::LDP::Resource
in
#exists?when encountering a
Faraday::ClientErrorthat's not an HTTP 410, re-raises it (Faraday::ClientError < StandardError)This appears to be something that would get passed upwards to
Activity#runand be re-raised.
in
#delete!when
#exists?returnsfalse(RuntimeError)
in
#validate_not_nilwhen a Resource has a nil RDF subject (RuntimeError)
in
#validate_not_nodewhen a Resource has a blanknode RDF subject (RuntimeError)
in
#validate_namespacewhen a Resource has an RDF subject in the wrong namespace (RuntimeError)
NOTE: We should have
#exists?re-raise the Faraday::ClientError as aRuntimeErrorif we want to be true to our promise to raiseRuntimeErrorfromActivity#run.
in Krikri::MapCrosswalk::CrosswalkHashBuilder
in
#build_aggregationwhen the parent aggregation is a blank node (NameError)
If this is thrown, it will be passed up from a Harvester's
#recordsenumerator and up pastHarvester#run, and will be re-raised byActivity#runas aNameErrorinstead of the documentedRuntimeError.
NOTE: Is NameError the ideal class for this exception? See http://ruby-doc.org/core-2.3.1/NameError.html
in Krikri::Mapping
in
#process_recordwhen an error is thrown handling any of the properties of the record (Krikri::Mapper::Error < RuntimeError)
in Krikri::MappingDSL::ParserMethods
in lambda returned by
#record_uriwhen record to be parsed has not been saved (RuntimeError)
in Krikri::MappingDSL::ParserMethods::RecordProxy
in
#method_missingwhen the arity of the call does not match the method on
#value_class(ArgumentError)This should be caught in
Krikri::Mapping#process_recordand added to theKrikri::Mapper::Errorthat is then raised, and which will be caught byActivity#run.
in Krikri::MappingDSL::PropertyDeclaration
in
#initializewhen a block with arity other than 1 is passed (ArgumentError)
As with
Krikri::MappingDSL::ParserMethods::RecordProxy#method_missing, this should be added to aKrikri::Mapper::Errorand caught byActivity#run.
in Krikri::MappingDSL::RdfSubjects::SubjectDeclaration
in
#to_procwhen multiple URIs are returned from [what is the object?] (RuntimeError)
in Krikri::Parser::Value::ValueArray
in
#[]=when value is not a Value (InvalidParserValueError)
in
#<<when value is not a Value (InvalidParserValueError)
in
#andwhen trying to return unbound variable (ArgumentError)
in
#elsewhen no block given for
else(ArgumentError)
NOTE: As with the
MappingDSLcases above, these should be caught inKrikri::Mapping#process_recordand added to theKrikri::Mapper::Errorthat it then raises.
in Krikri::XmlParser
in
#initializewhen
Nokogiri::XML#at_xpathreturns anilroot node (EmptyRootNodeError)As with the
MappingDSLcases above, these should be caught inKrikri::Mapping#process_recordand added to theKrikri::Mapper::Errorthat it then raises.
in Krikri::ProvenanceQueryClient
in
#find_by_activitywhen
activity_uriis not anRDF::URI(ArgumentError)This will be caught and re-raised by
Activity#runasArgumentErrorinstead of the promisedRuntimeError.
in Krikri::QAQueryClient
in
#values_for_predicatewhen
provider_uriis not anRDF::URI(ArgumentError)
in
#counts_for_predicatewhen
provider_uriis not anRDF::URI(ArgumentError)
in Krikri::Registry
in
.getwhen
nameis not registered (RuntimeError)
in
.registerwhen
nameis already registered (RuntimeError)
Where Exceptions Are Caught
These are the locations in the current Krikri code, as of July, 2016 (version 0.13.2) where exceptions are caught; excluding some locations that are not relevant to this analysis.
in Krikri::Activity
in
#runwhen encountering any exception that has bubbled up
re-raise it after logging
in Krikri::AsyncUriGetter.Request (Unused)
in
#joinin the proc given toStruct.newwhen encountering any exception that has bubbled up from joining to the HTTP request thread
re-raise unless "inline exceptions" are being implemented. In the latter case, the exception is swallowed.
in Krikri::Enricher
in
#runwhen encountering any exception bubbled up from
Audumblaor a record savelog and continue
in Krikri::Harvester
in
#runwhen encountering any exception bubbled up from a derived harvester's
#process_record.log and continue
in Krikri::Harvesters::OAIHarvester
in
#recordswhen encountering an exception from
@record_class.buildlog and continue
in
#request_with_sets, in the proc that handles each setwhen encountering an
OAI::Exceptionlog and return
[]for the affected set
in
#get_identifierwith a
Nokogiri::XML::XPath::SyntaxErrorgenerated byNokogiri.XML.xpathre-raise as
Krikri::Harvester::IdentifierError
in Krikri::LDP::Resource