Bixal Common Contingency Plan#

Table of Contents#

Applicability#

Note: This Contingency Plan applies only to systems for which Bixal has negotiated and defined Incident Response/Contingency Plan (IRCP) operations. Each IRCP-managed system will have a specific, tailored version of this Contingency Plan or in some cases a completely unique Contingency Plan will be developed. All Bixal employees are aware of the procedures outlined herein.

Overview#

This Contingency Plan provides baseline guidance for the Bixal Team when managing the disruption, compromise, or failure of any component of a Bixal IRCP managed system, product or service ("system"). As a general guideline, we consider "disruption" to mean unexpected downtime or significantly reduced service lasting longer than:

Scenarios where that could happen include unexpected downtime of key services, system data loss, or improper privilege escalation. In the case of a security incident, the team uses the Security Incident Response Plan as well.

Some clients will create and maintain a Contingency Plan defining procedures specfic to their system. In such a case, the client-specific Contingency Plan takes precedence.

Recovery objective#

Short-term disruptions lasting less than 30 minutes are outside the scope of this plan.

More than 3 hours of any system being offline during standard U.S. business hours (0900 - 2100 Eastern Time) is considered unacceptable. Our objective is to recover from any significant problem (disruption, compromise, or failure) within that span of time.

Incident Response Team information#

Contact information#

Team contact information is available in the the ICPR Group in Active Directory:

Contingency plan outline#

Activation and notification#

The first Incident Response Team member who notices or reports a potential contingency-plan-level problem becomes the Incident Commander (IC) until recovery efforts are complete or the Incident Commander role is explicitly reassigned.

If the problem is identified as part of a security incident response situation (or becomes a security incident response situation), the same Incident Commander (IC) should handle the overall situation, since these response processes must be coordinated.

The IC first notifies and coordinates with the people who are authorized to decide that the system is in a contingency plan situation:

The IC keeps a log of the situation in the within a client-specific Teams channel, JIRA ticket, or GitHub issue. If this is also a security incident, the IC also follows the security incident communications process. The IC should delegate assistant ICs for aspects of the situation as necessary.

Recovery#

The Incident Response Team assesses the situation and works to recover the system. See the list of external dependencies for procedures for recovery from problems with external services.

If this is also a security incident, the IC also follows the security incident assessment and remediation processes.

If the IC assesses that the overall response process is likely to last longer than 3 hours, the IC should organize shifts so that each responder works on response for no longer than 3 hours at a time, including handing off their own responsibility to a new IC after 3 hours.

Reconstitution#

The Incident Response Team tests and validates the system as operational.

The Incident Commander declares that recovery efforts are complete and notifies all relevant people. The last step is to schedule a postmortem to discuss the event. This is the same as the security incident retrospective process.

External dependencies#

Bixal Solutions managed systems often depend on several external services. In the event one or more of these services has a long-term disruption, the team will mitigate impact by following this plan. Zero or more of the following services may be involved:

Bitbucket#

If BitBucket becomes unavailable, non Acquia Cloud hosted applications will continue to operate in its current state. The disruption would only impact the team's ability to update code on the instances.

JIRA#

There is no direct impact to the platform if a disruption occurs. Primary incident communications will move to the projects Microsoft Teams channel.

Office365#

There is no direct impact to the platform if a disruption occurs. Primary incident communications will move to SMS and phone communications.

AWS#

In case of a significant disruption, after receiving approval from our Authorizing Official, the Bixal Solutions team will deploy a new instance of the entire system to a different region.

Acquia Cloud Enterprise (ACE) Platform as a Service (PaaS)#

Prjects hosted on the Acquia Cloud Enterprise (ACE) PaaS https://cloud.acquia.com/app/develop which is layered on top of the Amazon Web Services (AWS) FedRAMP-certified cloud in the us-east region. See ACE Status and AWS status.

Acquia Cloud takes hourly snapshots of EBS volumes that are saved to Amazon S3 providing geographically distributed data centers.

In case of a significant disruption, after receiving approval from our Authorizing Official, the Bixal Solutions and Acquia teams will deploy a new instance of the entire system to a different region.

How this document works#

This plan is most effective if all Bixal Solutions team members know about it, remember that it exists, have the ongoing opportunity to give input based on their expertise, and keep it up to date.



Edit on GitHub

Documentation built with MkDocs using a modified Windmill Dark theme