Database Wide Outage
Incident background
There are two different issues that caused the system wide outage. The first issue was caused Monday, May 17, 2021, in the process of recovering from bad data duplication, an erroneous removal necessitated the full restore of production databases prior to the incident. The recovery time has been delayed due to the volume of data (100s of millions of claims), and the need to reprocess the data into good working order. All issues have been resolved and no data loss has occurred. The outage lasted from Tuesday, May 17th until Tuesday, May 25th, though some services were restored earlier.
The second issue was due to a legacy system requirement suppressing the automatic creation of disputes for PHS-Duplicate claims identified in Review until they had been invoiced to the customer. Avenue/VRC respected that requirement while newer tools did not. That discrepancy caused the different tools to display different results.
Controls put in place
To avoid these issues from occurring in the future we have implemented the following controls.
Issue One:
- Enhancing production database access control to require peer review of code before it can be used to directly modify production data.
- Extending the existing change management ticketing system to require a peer witness when executing approved code to directly modify production data.
- Continuous migration away from the use of direct SQL query to modify production data. Moving modification logic to source-controlled tools that reduce or eliminate the risk of manual error will continue to be an engineering theme.
- Reworking the reprocessing procedure to be much faster, decreasing the time to recovery
- More automatic data integrity checks to more rapidly identify issues as they occur
Issue Two:
- Focused analysis on the discrepancies clarified the intent of the legacy requirement and corrected the circumstances under which it should and should not be applied.
- Data downstream from the billing code were reprocessed on 5/20 and 5/21 and a QC effort on 5/21 verified that the conflicts had been resolved.
- Our approach to data product requirements no longer allows for the introduction of requirements not known to all consumers of common data sources. Impact analysis and clear communication of changes are continuously improving and always first order priorities for us.