Chapter 12: Operations & Maintenance

Ongoing operational procedures, maintenance schedules, KPI monitoring, and incident response runbooks

12.1 Operations Dashboard and KPIs

Effective operations of a log security system requires continuous monitoring of a set of key performance indicators (KPIs) that reflect the health of the evidence chain. The operations dashboard shown below represents the standard monitoring view used by the security operations team, displaying real-time EPS, hash chain integrity, storage utilization, and certificate expiry status alongside the maintenance calendar and backup status.

Operations and Maintenance Dashboard
Figure 12.1: Operations & Maintenance Dashboard — Dual-monitor SOC workstation showing real-time log security health metrics (EPS: 3,240, Hash Chain Integrity: 100%, Storage Utilization: 67%, Collector Health: All Green, NTP Drift: 0.2ms) alongside maintenance schedule calendar, certificate expiry countdown (45 days), and backup status (Complete - Daily). Incident response runbooks are visible on the desk.
99.99%
Hash Chain Integrity
Target: 100% | Alert: <99.9%
0 events
Log Events Dropped (24h)
Target: 0 | Alert: >0
<50ms
NTP Clock Drift (max)
Target: <10ms | Alert: >50ms
99.95%
System Availability
Target: 99.9% | Alert: <99%
<30s
Collector Failover Time
Target: <30s | Alert: >60s
<70%
Storage Utilization
Target: <70% | Alert: >85%

12.2 Maintenance Schedule

The maintenance schedule defines the recurring operational tasks required to maintain the integrity, availability, and compliance posture of the log security system. Tasks are organized by frequency and must be documented in the maintenance log with the name of the operator, the date and time, and the outcome of each task.

TaskFrequencyResponsible RoleDocumentation RequiredCompliance Relevance
Review integrity alerts and EPS metricsDailySecurity AnalystAlert review logPCI DSS 10.5, SOX
Verify NTP synchronization statusDailyOperationsNTP status reportPCI DSS 10.6
Check storage utilization and buffer levelsDailyOperationsCapacity logAll frameworks
Verify backup completion statusDailyOperationsBackup logAll frameworks
Review access logs for anomaliesWeeklySecurity AnalystAccess review reportSOX, HIPAA, PCI DSS
Run hash chain integrity verification scanWeeklySecurity AnalystIntegrity scan reportAll frameworks
Check certificate expiry datesWeeklyOperationsCertificate inventoryAll frameworks
Review and update RBAC role assignmentsMonthlySecurity AdminAccess review sign-offSOX, PCI DSS, HIPAA
Test collector failover procedureMonthlyOperationsFailover test reportAll frameworks
Capacity planning reviewMonthlyOperations + ManagementCapacity plan updateAll frameworks
Renew TLS certificates (90-day cycle)QuarterlyOperationsCertificate renewal logAll frameworks
Full penetration test of log security systemAnnualExternal Security FirmPenetration test reportPCI DSS, SOX, NIST
HSM key rotation ceremonyAnnualSecurity Admin + WitnessKey ceremony minutesAll frameworks
Full disaster recovery testAnnualOperations + ManagementDR test reportSOX, HIPAA, NIST
Replace UPS batteriesEvery 3–5 yearsFacilitiesMaintenance recordAvailability SLA

12.3 Incident Response Runbooks

The following runbooks provide step-by-step guidance for the most critical incident scenarios that can affect the evidence chain. Each runbook must be executed by the designated responsible role and documented in the incident management system. All runbooks are stored in printed form in the operations area for use during network outages.

Runbook 1: Hash Chain Integrity Failure

Trigger: Integrity monitoring alert — hash chain gap or mismatch detected. Immediate actions: (1) Do not modify any storage. (2) Preserve the alert with timestamp. (3) Identify the affected segment range from the alert. (4) Retrieve the segment from the vault and compute its hash manually using sha256sum. (5) Compare with the stored hash in the manifest. (6) If mismatch confirmed, escalate to CISO and legal counsel immediately. (7) Preserve all evidence for forensic investigation. (8) Document all actions with timestamps in the incident log.

Runbook 2: Collector Failure — Log Gap Risk

Trigger: Collector health alert — primary collector offline. Immediate actions: (1) Verify that the secondary collector has taken over (check EPS on secondary). (2) Check the primary collector's buffer disk for unsent events. (3) If buffer is intact, restore primary collector and trigger buffer replay. (4) If buffer is lost, document the gap with start and end timestamps. (5) Notify compliance team of the gap. (6) Investigate root cause of collector failure. (7) Implement corrective action to prevent recurrence.

Runbook 3: HSM Failure — Signing Unavailable

Trigger: HSM health alert — signing operations failing. Immediate actions: (1) Verify HSM physical connectivity and power. (2) Restart HSM service. (3) If HSM is unresponsive, activate the backup HSM token using the approved procedure. (4) Notify the security officer and initiate the HSM replacement procedure. (5) Document all events in the incident log. (6) Note: log ingestion continues during HSM failure, but verification manifests are not signed until HSM is restored. Document this gap for compliance purposes.

Runbook 4: Storage Capacity Alert — >85% Utilization

Trigger: Storage capacity alert — vault utilization exceeds 85%. Immediate actions: (1) Verify that no WORM-locked segments can be deleted. (2) Identify segments that have passed their retention period and can be archived or deleted. (3) If no segments are eligible for deletion, immediately provision additional storage capacity. (4) Update the capacity plan with the new storage requirements. (5) Review the storage sizing calculation (Chapter 9 Calculator 1) to identify the root cause of the capacity shortfall.

Runbook 5: Unauthorized Access Attempt to Vault

Trigger: Access control alert — unauthorized access attempt to vault storage or admin functions. Immediate actions: (1) Preserve the access log entry with full details. (2) Identify the source account and IP address. (3) Immediately disable the source account pending investigation. (4) Notify the CISO and initiate the security incident response procedure. (5) Review all access events from the source account for the past 30 days. (6) Verify that no vault data was accessed or modified. (7) Document all findings in the incident report.

← Chapter 11: Installation & Debugging Back to Homepage →