Disclaimer: Nothing in this blog is related to the author’s day-to-day work. The content is neither affiliated nor sponsored by any companies.

The cybersecurity and risk management industry has always been viewed as a cost center, which growth hackers despise the most. Simultaneously, security professionals must ensure the overall security of their systems and operations in order to maintain the company’s long-term business value. From the standpoint of the importance of cybersecurity to long-term business value, we can discuss the third major reason for machine learning’s failure to solve cybersecurity problems: non-sense evaluation metrics. Simply put, when establishing evaluation metrics for data models for cybersecurity, we sometimes lose sight of the long-term business value mindset.

The discussion on the design of evaluation metrics is not commonly seen in the academic community, most likely because the problems studied in the papers are general and independent of the details of specific commercial products, and there are relatively generic evaluation metrics, whereas specific problems in industry are more closely related to their business value, and there is a greater need for data scientists to specify and relate these generic metrics to product and market value.

A special note to those who have read the Chinese version https://toooold.com/2021/11/13/why_ml_fails_security_evaluation_cn.html: the two versions share the same core content but present it in different ways. Upon the request of my Chinese-speaking friends, I paid special attention to data modeling research in the domestic market in the Chinese version, where the research sometimes has to compromise to fit into the revenue growth mind. Security should be considered assets and having a growth mind is never a problem, but the metrics should go beyond just the revenue, right?

Why a good metrics is essential

Metrics that are properly defined provide direction to guide data and security models toward their objectives. Metric improvements can be directly mapped to business value, driving targeted improvements to data and security models while the business value they bring ensures continued investment in the models. For example, a 1% increase in malware detection rate can prevent infection of thousands of cloud hosts, and a 0.01 second reduction in WAF detection time will increase the threshold of network throughput of customer hosts to more effectively protect against attacks and more.

Better detection, or better defense, that is the question. The cybersecurity industry must solve problems in dynamic and highly adversarial environments, and the uncertainty introduced by the attacker or the environment can also make setting evaluation metrics difficult. In the evaluation of intrusion detection models, for example, if the company is not effectively attacked, CISO can have a list of questions, is it because my detection model is doing a good job, or is it because the other side has not been able to break through the first few layers of defense, or simply does not bother to attack me, or even if it has been breached and I am unaware of it? These adversarial and dynamic environments frequently put data science teams in the darkness when developing models: on the one hand, they want to detect more attacks, while on the other, they want to ensure a better defense, but a better defense means fewer attacks, so how do they evaluate defense metrics? The same dilemma confronts various risk management teams, security and remediation teams, and so on. “Never in the field of human conflict was so much been owed by so many to so few” (Winston Churchill) *, how can we improve the design and evaluation metrics of detection and defense systems, like Royal Air Force and allied fighter crews who were at the time fighting the Battle of Britain?

We found numerous difficulties in establishing reasonable evaluation metrics: metrics that do not accurately reflect long-term business value frequently mislead the direction of data and security model research, and these metrics frequently cause conflicts between business growth and security protection. Overall, good evaluation metrics serve as an important link between good modeling efforts and their business value, effectively guiding the direction of modeling efforts, whereas bad metrics can cause good models to work in a totally wrong direction, exposing modeling efforts to unnecessary pressure.

Top 1 mistake: metrics without objectives

Objectives first, then metrics. It is part of data science foundation. Unfortunately, “metrics without objectives” has contributed the most in the pool of mistakes.

In the machine learning class, we must have seen such homework problem: why can’t we replace the loss function with precision/recall metrics when optimization? * Besides the statistical reason of a priori vs posteriori in the decision making, we can understand it in the intuitive way: a smart agent needs the loss function to guide the direction of optimization (objective) while precision/recall can only be used by human to judge its decision making (metrics), otherwise the agent could cheat to maximize outputs in a local minimal instead of optimizing for outcomes.

However, smart human beings sometimes get lost in decision making and confuse objectives with metrics,when they see profit. We’ve seen many students who are proud of not being caught cheating on exams, and we’ve also seen major APPs that send free perks to boost the new user trend but don’t have enough product features to retain users. Only with the goals of “solid knowledge” for students and “building a good product” for apps can the metrics be meaningful.

The goal of cybersecurity teams and cybersecurity products is to protect their own and customers’ assets from attacks, where different areas have different sub-goals and metrics to measure how well the goals are met. We have seen many metrics in the industry that do not accurately reflect the goal. For example, a SIEM product that claims to deploy 200 detection models and generate thousands of alerts for customers, rather than metrics that better reflect its business goals such as alert fatigue reduction, ease of use for threat hunting, and fast response. Such a metric as “how many alerts we generate for customers” may appear simple to quantify, but its absurdity is analogous to a fire station’s performance evaluation based on how many fires it puts out, as a metric that loses its goal is meaningless for business value. A few other examples can be, DDoS products to block 2 Tbps attacks without mentioning the cost until the customer receives a huge bill; a paid threat intelligence feed of 500 thousands of new IoCs each day without mentioning the use cases and the intelligence context (and turns out it is a precomputed DGA list plus a few OSINT).

Metrics without objectives can appear legit, which makes it dangerous for the modeling work because a wrong goal can waste much resource and drive smart people more wrong, sometimes let them gamify the system. If coverage is used as a metric in a threat intelligence model, the model can assume that all activity is malicious and generate a large number of events to flood the SOC team; if the detection accuracy is used as a metric, the model is better off reporting nothing, because no prediction, no mistake; and if alert volume is used as a metric in a SIEM product, the model will generate a large number of alerts without screening and triaging, just enough to slow down the customer. In practice, these seemingly irrational behaviors can take various forms.

The 2nd one: metrics looks like ‘common-sense’

Statistical learning models are intended to learn the target distribution’s statistical expectations, which means that the algorithm is always incentivized to predict the behavior of the majority group * because the majority group dominates the statistical expectations of the target distribution. Applying conventional accuracy-recall metrics as is, rather than understanding that the algorithm prefers to look for majority-group behavior and designing metrics to fit a specific problem, not only fails to solve the problem, but also raises concerns about the algorithm’s effectiveness. For most time, the algorithm is good but it is used in the wrong way because of ‘common-sense’ metrics.

In cybersecurity, the frequency distribution of attacks can be extremely unbalanced, with intrusion frequently at only one in ten million or less, while the difficulty of discovering each attack varies significantly. If the classifier model is assumed to have a 90% accuracy rate for attacks, it is better for the model to detect nothing, because negative samples are several orders of magnitude larger than positive samples, and a single sample misjudgment is enough to reduce the accuracy rate to near zero. Such issues cannot be solved using unbalanced sampling methods and need to jump out of the box.

In most cases of cybersecurity, such as the discovery of zero-day vulnerabilities, APT attacks, and so on, the ground truth is missing. In such cases, the data model necessitates a recall rate, or even a “recall rate of unknown threats,” a metric that can be described as “not even wrong”. “There are only two types of companies: Those that have been hacked and those that will be hacked.” * We, too, cannot wait to be hacked in order to calculate the recall rate. The impact of an intrusion attack on business, such as a data breach to be seen several years later, or the fact that compromised data is being sold on the dark web while the security team is unaware, is significant. The limited attack scenarios can also skew the model’s effectiveness if it relies solely on certain attack benchmark tools, such as inviting the in-house red team attacks, to obtain benchmark facts. If a data science team committed to a similar metric for any reason, the team would expend a significant amount of resources on it, ultimately failing to solve the problem.

An interesting measurement “unique detection rate” is commonly used in the machine learning based detection evaluation, like the ML driven malware detection expects more malicious examples, the same with ML driven spam URL detection etc. “Unique detection rate” as the comparision on number of samples between machine learning methods and the current methods or intelligence feed seems ‘common-sense’, but practically no sense:

  • The business value of the detected samples is based on the business assets they can influence rather than the number of samples, and the evaluation process ignores the impact of the detection time sequence.
  • The results from rule models or 3rd party threat feeds with missing quality assessments are insufficient as denominators to calculate the unique detection rates .
  • The two types of models use different features but the results of rules models and machine learning models frequently overlap significantly, and evaluating only machine learning models while ignoring the unique detection metrics of rule models frequently leads to debates about the fairness of the evaluation.

Honorable mention: good metrics, wrong problem

We have also observed many cases where cybersecurity problems simply cannot be approached using machine learning or artificial intelligence, such as using third-party intelligence to detect unknown APT attacks; developing log-based threat discovery while ignoring the data collection and data warehousing efforts required; and certain problems themselves require massive investments that are not supported by existing resources, most often by companies eager to develop their in-house malwa. All of these problems may have clear metrics definitions, but the objective itself leads to the wrong problem to solve, resulting in failure for the data science team.

A few tips for metrics design

From the bloody lessons that we have learned, we have a few tips.

All metrics must be designed in tandem with the objectives. The objective defines the problem’s limited scope, and only within that scope can a reasonable metric be proposed. We must always ensure that the objective comes first, and metrics are only a means to an end when it comes to achieving the goal. Instead of rushing for a metric that appears to make sense, we must understand the business need to set objectives, and data scientists must clearly identify when such metrics appear and provide timely feedback to say “NO!”

When planning a problem and setting goals for objectives, it is important to consider whether the goal is too big or too small, whether the scenario is appropriate for the solution, and whether the solution’s goals are within a reasonable resource budget. It is suggested that one refer to industrial common practice in the field and allocate resources based on the current situation.

“Unique detection” is a very bad metric in general, pitting data models and rule models against each other or relying on external sourcing while ignoring the impact of detected samples on business assets, detection time latency, and other factors. We do not recommend using unique detection rate as the primary metric for the detection model; instead, we can use intersection and set to see the overall coverage of detection results and the impact on assets; if one needs to compare models, one should look at the detection time latency; we should also keep in mind that the machine learning model comes to improve the rule model result, so we should target for “greater than zero” rather than 10% more or 50% more, and consider the cost of iterative updates.

What if there is no such thing as a ground truth or a red team test? In the absence of ground truth, which is often the case, we should deploy as many anomaly detections as possible and evaluate recalls when explaining why these anomalies occur. “How many anomalies can be explained?” could be a better metric to use. In the absence of a red team test, attack detection metrics can be evaluated using the attack surface that the defender covers for the business asset. We must also be proactive and timely in adjusting our assessment strategies in the dynamic adversarial environment of cybersecurity.


Good metrics can effectively enable data and security models to demonstrate business value in their respective business domains, and we must establish reasonable metrics that meet our objectives. Data science teams must also understand that algorithms are always motivated to predict the behavior of the majority population and that evaluation metrics must be rationally designed to capitalize on the strengths of algorithmic models.

Reasonable metrics can also help to avoid unnecessary or incorrect model optimization. Whether or not the model’s optimization goal can make sense, a smart data scientist can do a fantastic job of modeling for this goal, and the more optimization out of reasonable metrics brings more errors, whose eventual business losses and work frustration must be more expensive to smooth out.