The security operations center (SOC) plays a critical role in protecting an organization’s assets and reputation by identifying, analyzing, and responding to cyberthreats in a timely and effective manner. Additionally, SOCs also help to improve overall security posture by providing add-on services like vulnerability identification, inventory tracking, threat intelligence, threat hunting, log management, etc. With all these services running under the SOC umbrella, it pretty much bears the burden of making the organization resilient against cyberattacks, meaning it is essential for organizations to evaluate the effectiveness of a cybersecurity operations center. An effective and successful SOC unit should be able to find a way to justify and demonstrate the value of their existence to stakeholders.
Principles of success
Apart from revenue and profits, there are two key principles that drive business success:
- Maintaining business operations to achieve the desired outcomes
- Continually improving by bringing in new ideas or initiatives that support the overall goals of the business
The same principles are applied to any organization or entity that is running a SOC, acting as a CERT, or providing managed security services to customers. So, how do we ensure the services being provided by security operations centers are meeting expectations? How do we know continuous improvement is being incorporated in daily operations? The answer lies in the measurement of SOC internal processes and services. By measuring the effectiveness of processes and services, organizations can assess the value of their efforts, identify underlying issues that impact service outcomes, and give SOC leadership the opportunity to make informed decisions about how to enhance performance.
Measuring routine operations
Let’s take a closer look at how we can make sure routine security operations are providing services within the normal parameters to a business or subscribed customers. This is where metrics, service-level indicators (SLIs) and key performance indicators (KPIs) come into play; metrics provide a quantitative value to measure something, KPIs set an acceptable value on a key metric to evaluate the performance of any particular internal process, and SLIs provide value to measure service outcomes that are eventually linked with service SLAs. With regard to KPIs, if the metric value falls into the range of a defined KPI value, then the process is deemed to be working normally; otherwise, it provides an indication of reduced performance or possibly a problem.
The following figure clarifies the metric types generally used in SOCs and their objectives:
One important thing to understand here is that not all metrics need a KPI value. Some of the metric, such as monitoring ones, are required to be measured for the informational purpose. This is because they provide valuable support to track functional pieces of SOC operations where their main objective is to assist SOC teams in forecasting problems that could potentially decrease the operational performance.
The following section provides some concrete examples to reinforce understanding:
Example 1: Measuring analysts’ wrong verdicts
|Process||Metric Name||Type||Metric Description||Target|
|Security monitoring process||Wrong verdict||KPI (internal)||% of alerts wrongly triaged by the SOC analyst||5%|
This example involves the evaluation of a specific aspect of the security monitoring process, namely the accuracy of SOC analyst verdicts. Measuring this metric can aid in identifying critical areas that may affect the outcome of the security monitoring process. It should be noted that this metric is an internal KPI, and the SOC manager has set a target of 10% (target value is often set based on the existing levels of maturity). If the percentage of this metric exceeds the established target, it suggests that the SOC analyst’s triage skills may require improvement, hence providing valuable insight to the SOC manager.
Example 2: Measuring alert triage queue
|Process||Metric Name||Type||Metric Description||Target|
|Security monitoring process||Alert triage queue||Monitoring metric||Number of alerts waiting to be triaged||Dynamic|
This specific case involves assessment of a different element of the security monitoring process – the alert triage queue. Evaluating this metric can provide insights into the workload of SOC analysts. It is important to note that this is a monitoring metric, and there is no assigned target value; instead, it is classified as a dynamic value. If the queue of incoming alerts grows, it indicates that the analyst’s workload is increasing, and this information can be used by SOC management to make necessary arrangements.
Example 3: Measuring time to detect incidents
|Service||Metric Name||Type||Metric Description||Target|
|Security monitoring service||Time to detect||SLI||Time required to detect a critical incident||30 minutes|
In this example, the effectiveness of the security monitoring service is evaluated by assessing the time required to detect a critical incident. Measuring this metric can provide insights into the efficiency of the security monitoring service for both internal and external stakeholders. It’s important to note that this metric is categorized as a service-level indicator (SLI), and the target value is set at 30 minutes. This target value represents a service-level agreement (SLA) established by the service consumer. If the time required for detection exceeds the target value, it signifies an SLA breach.
Evaluating the everyday operations of a practical SOC unit can be challenging due to the unavailability or inadequacy of data, and gathering metrics can also be a time-consuming process. Therefore, it is essential to select suitable metrics (which will be discussed later in the article) and to have the appropriate tools and technologies in place for collecting, automating, visualizing, and reporting metric data.
The other essential element in the overall success of security operations is ‘continuous improvement’. SOC leadership should devise a program where management and SOC employees get an opportunity to create and pitch ideas for improvement. Once ideas are collected from different units of security operations, they are typically evaluated by management and team leads to determine their feasibility and potential impact on the SOC goals. The selected ideas are then converted into initiatives along with the corresponding metrics and desired state, and lastly their progress is tracked and evaluated to measure their results over a period of time. The goal of creating initiatives through ideas management is to encourage employee engagement and continuously improve SOC processes and operations. Typically, it is the SOC manager and lead roles who undertake initiatives to fix technical and performance-related matters within the SOC.
A high-level flow is depicted in the figure below:
Whether it is for routine security operations or ongoing improvement efforts, metrics remain a common parameter for measuring performance and tracking progress.
Three common problems that we often observe in real-world scenarios are:
- In the IT world the principle of “if it’s not broken, don’t fix it” is well known, and this mentality extends to operational units as well. Similarly, many SOCs prioritize current operations and only implement changes in response to issues rather than adopting a continuous improvement approach. This reluctance to change acts as a bottleneck for achieving continuous improvement.
- Absence of a structured process to gather ideas for potential improvements results in only a fraction of these ideas being presented to the SOC management, and thus, only a fraction of them being implemented.
- Absence of progress tracking for improvements – it’s not sufficient to simply generate and discuss ideas. Implementing ideas requires diligent monitoring of their progress and measuring their actual impact.
Example: Initiative to improve analyst triage verdicts
Revisiting ‘example 1’ presented in the ‘Measuring routine operations’ section, let us assume that the percentage of incorrect verdicts detected over the past month was 12%, indicating an issue that requires attention. Management has opted to provide additional training to the analysts with the goal of reducing this percentage to 5%. Consequently, the effectiveness of this initiative must be monitored for the specified duration to determine if the target value has been attained. It’s important to note that the metric, ‘Wrong verdicts’, remains unchanged, but the current value is now being used to evaluate progress towards the desired value of 5%. Once significant improvements are implemented, the target value can be adjusted to further enhance the analysts’ triage skills.
Metric identification and prioritization
SOCs generally do measure their routine operations and improvements using ‘metrics’. However, they often struggle to recognize if these metrics are supporting the decision-making process or showing any value to the stakeholders. Hunting for meaningful metrics is a daunting task. The common approach we have followed in SOC consulting services to derive meaningful metrics is to understand the specific goals and operational objectives of security operations. Another proven approach is the GQM (Goal-Question-Metric) system that involves a systematic, top-down methodology for creating metrics that are aligned with an organization’s goals. By starting with specific, measurable goals and working backwards to identify the questions and metrics needed to measure progress towards those goals, the GQM approach ensures that the resulting metrics are directly relevant to the SOC’s objectives.
Let’s illustrate our approach with an example. If a SOC is acting as a financial CERT, it is likely to focus on responding to incidents related to the financial industry, tracking and recording financials threats, providing advisory services, etc. Once the principal goals of the CERT are realized, the next step is to identify metrics that directly influence the CERT services outcomes.
Example: Metric identification
|Ensure participating financial institutions are informed about latest threat||How can we determine the amount of time the CERT is taking to notify other financial institutions?||Time it takes to notify participant banks after threat discovery|
Similarly, for operational objectives, metrics are identified to track and measure processes that support financial CERT operations. This also leads to the issue of prioritizing metrics, as not all metrics hold the same level of importance. In fact, when selecting metrics, it is crucial to prioritize quality over quantity and therefore it is recommended to limit the collection of metrics to sharpen focus and increase efficiency. In order to emphasize the importance of prioritizing metrics, the metrics that directly support CERT goals take precedence over metrics supporting operational objectives because ultimately it is consumers and stakeholders who evaluate the services rendered.
To determine the appropriate metrics, several factors should be taken into account:
- Metrics must be aligned with the primary goals and operational objectives
- Metrics should assist in the decision-making process
- Metrics must demonstrate their purpose and value to both internal operations and external stakeholders.
- Metrics should be realistically achievable in terms of data collection, data accuracy, and reporting.
- Metrics must also meet the criteria of the SMART (Specific, Measurable, Actionable, Realistic, Time-based) model.
- Ideally, metrics should be automated to receive and analyze current values in order to visualize them as quickly as possible.