Why your machine learning models fail to solve cybersecurity problems: fragility second

Disclaimer: Nothing in this blog is related to the author’s day-to-day work. The content is neither affiliated nor sponsored by any companies.

Solving cybersecurity problems with machine learning and other algorithms frequently involves a battle between the effectiveness of data models and rule-based models, the balance between coverage and false positive rates, and the remote nature of model’s cool unique discovery and its product integration, all of which are bothering data scientists daily work. One day, a good cybersecurity friend of mine said to me, from the bottom of his heart:

“Only algorithms care about right or wrong, security is about cost.”

The fight of beautiful models vs limited use comes from that, many data models just throw a half-baked product to the security operations team or security product customers, resulting in “I can’t use this model in production”, “why should I trust your prediction “ and many other negative feedback. Applying algorithms, such as machine learning, to solve cybersecurity problems is not only the work of the algorithm itself, but also a system engineering task, with fragility arising at every stage of the process. Following on from the previous blog’s discussion on feature space and sample labeling, we discuss how to understand algorithmic fragility, engineering fragility, and operational fragility in a systems engineering framework and effectively avoid their negative impact on the solution of cybersecurity problems. The discussion here is also not limited to the cybersecurity industry; the majority of it also applies to other industrial scenarios that employ algorithmic models.

Fragility from algorithm

In addition to common prediction quality measurement such as precision vs recall, algorithms designed to address problems in the cybersecurity domain have several special requirements for robust prediction results, most notably identifying erroneous results and providing appropriate disposition, similar to machine vision-based autopilot, which must ensure that even if the model misjudges in any situation, it will not hit a wall. However, the industry has discussed this primary reason even less than other causes of algorithm fragility, such as unbalanced data and labeling small data sets.

“Wrong prediction” results exist objectively, and data scientists cannot ignore or fear their existence since we, the data scientists, are the only providers of these models. The conclusion of most acedamic papers as a seemingly perfect AUC is only the beginning of all industry’s work. The common 0.1 percent false positive rate in machine learning papers may be amplified by the billions of records into hundreds of thousands of results in the security operation queue, waiting to be picked up by a poor SOC guy (sorry about that), where each prediction result by model is related to the operation time and human resource cost. Sometimes, because of negative feedback from operations or product teams about the cost of false positives, data scientists fear false positives and limit their thinking when building algorithms, for example, by making 30% or larger compromises in recall for 0.1% or less marginal improvement of precision, or manually adding large numbers of list rules to filter model prediction results and suffering from the rule maintainance, or even giving up on a new model because it can only reaches 90 percent accuracy, and so on. The most common cause of algorithm fragility is ignoring or fearing wrong prediction results.

Instead of ignoring or fearing it, we should confront the root cause of “wrong prediction” and start thinking about how to properly deal with it. The most common cause of false positives or false negatives from an algorithmic point of view is incomplete information about security event observation, which is an objective disadvantage of cybersecurity on the defensive side. The attacker can design and invest heavily in attack resources from multiple perspectives, whereas the defenders can only model what they see. Even if the defender’s assets are completely covered with all detection points, it is difficult to fully analyze all of the attacker’s actions, not to mention that complete detection points are impossible, just like “The Blind Men and the Elephant”*, “humans have a tendency to claim absolute truth based on their limited, subjective experience as they ignore other people’s limited, subjective experiences which may be equally true”*. The uncertainty associated with modeling with incomplete information necessitates tolerating uncertainty from a systems engineering point of view, as well as identifying and dealing with erroneous results.

Of course, algorithmic friagility can be caused by a variety of factors, including selection bias in the dataset, such as a crime prediction model’s default extremely high correlation of geography and skin color with crime rates, or statistical errors in sample collection and labeling, such as an insurance risk model predicts that the marriage rate of people over 100 years old is low, nor much expectation that “unknown unknown”* threats can be easily solved by known modeling, as some commercial products claim their AI models can detect and operate all types of APT attacks. Let us begin with the most important friagilities in the market today and work our way forward.

Fragility from engineering

The vast majority of papers will not mention the algorithm’s engineering implementation, and the engineering implementation to ensure the availability of model results is another major source of fragility, which includes the availability of upstream and downstream data, computing capacity support, monitoring and recovery systems, and so on, while data accessibility is also an important factor.

The majority of algorithmic models in the cybersecurity industry are based on data sources such as logs collected in-house or by customer platforms, which are produced by dedicated teams and saved in mutually exclusive file formats, from pcap to csv to some graph database to binary files on AWS S3, with different latency and reliability, and using different data platforms for output, coupled with various threat intelligence and third-party formats, as well as various field definitions and conflicts in the data. These data collection tasks, which are not mentioned in academic papers, are one of a few tricky topics that data scientists must face before modeling.

Even if a usable dataset is successfully collected and compiled, stable and timely computation is another engineering vulnerability. When we implemented the parent model of DGA detection,’domain2vec’*, a model that transforms domain name sequence co-occurrence probabilities into a geometric vector space, the computing power required to process approximately 1 billion DNS records per hour posed a significant engineering challenge. Because DNS data traffic can have completely different patterns for each time period, we must complete data collection and model computation within certain time period to avoid delays in results and computing platform clogging.

Data quality assurance and model building pipeline monitoring and recovery are also frequently overlooked, and their negative impact is not realized until a severe incident or intrusion occurs that the model could have detected and prevented, but the end result is ineffective due to lost or delayed upstream data, long job queues on shared computing platforms, incorrect version of the whitelisting policy, model runtime OOM, and so on. Some security teams and companies do not consider monitoring and recovery efforts to be in the core tasks, and as a result, they frequently invest insufficient resources and priorities, resulting in a classic case of “A small leak will sink a great ship.”*

Meanwhile, the data barrier leads to additional concern on fragility. Under ideal conditions, data science teams have unrestricted access to the data they require, but compliance and regulatory requirements, as well as the game of corporate interests do not perfectly integrate data holders and model teams, and such data barriers are an important factor in engineering fragility.

Fragility from operation

Most security operations teams lack the necessary mechanisms to process algorithmic model prediction results, which invariably raises the operational cost per case, which is the primary cause of operational fragility, and the cybersecurity domain knowledge makes it difficult for data science teams to further support. It is determined by two major factors: whether the algorithmic model supports the information required by operations and whether operations understand the model’s predicted results.

Data scientists frequently believe that the modeling task is limited to providing prediction results, and that if the prediction is correct, everything is fine; if it is not, the only thing that matters is recall-precision trade offs. However, the correct results must also be operational, such as when the model detects a video with violent content and the operation team has to find the video frame by frame for more than an hour, or when the model detects a new APT C&C but the operations team must check dozens of hosts hundreds of processes and files one by one, manually! As rumored that Amazon’s packpage dispatch algorithm team had to ride with the courier truck for a month of package delivery work, the data science teams can not do effective data modeling if they put themselves out of the operational business.

Mechanisms, toolkits, and training for operations teams have also fallen behind the data science era. Most operations teams do not triage but instead devote all of their human resources to processing detection results, resulting in the random assignment of security researchers with varying levels of experience to the team regardless of the case’s complexity. Simultaneously, the contextual information required for incident investigation and subsequent action is dispersed throughout the data system and must be queried on-demand manually using multiple different tools. There are also challenges for security researchers in understanding and applying the algorithm’s prediction results, such as causality analysis, how to apply the prediction results in products such as firewalls, and rationalizing the impact of uncertainty in the prediction. The operational anxiety caused by these challenges makes it even more difficult for security researchers to use the results of data models, and conversations between data science and security operations teams frequently end with “you just tell me should I block it or not.”

In addition to the preceding two factors, communication barriers between the security operations team and the data science team due to different domain knowledge composition also resulted in the failure of the traditional method of “feedback loop”. The security teams concentrated on investigation descriptions, with unique features of each event, whereas the data science team struggled to extract and abstract the model’s commonality from these specific descriptions due to a lack of background knowledge, and the two sides of the communication always felt like they were talking about the two orthogonal views of the same thing, and the discussion became fruitless. The accumulation of these large and small factors leads to operational fragility.

A few tips for less fragility

Addressing fragility and its consequences necessitates nearly all of these efforts from a systems engineering perspective:

Recognize incorrect results and provide explanations for both correct and incorrect ones.
Create a modern and mature data warehouse, as well as a related engineering framework, to ensure the model’s usability.
Create relevant mechanisms, tools, and training for the operation of model prediction results.

Each step in this process is complementary to the others, and there are many real-world examples of how to improve the algorithm to solve dowstream engineering problems easier, redesign the security architecture to reduce the difficulty of the algorithm, and so on. Please consider the following tips in the context of systems engineering in general.

Algorithms require a channel to collect error results. A common misunderstanding stems from relying solely on user feedback for error results, which does little more than annoy users, as well as a massive number of alerts clogging up the operations queue, forcing the team using the security product or the operation team to discard the majority of the alerts by a list of in-house rules to keep the operations bandwidth up. In this case, not only does the security team lack the time to provide feedback, but it may also lead the model provider to believe that their model performs perfectly, when in fact, it is the user who is already too exhausted to care about the model feedback. The appropriate feedback channel may consist of several stages:

Model feature based feedback: This is usually a heuristics rule or a machine-learning model based on other features. For example, if the algorithm predicts that a spear phishing page will be the Google home page, it can be cross-validated with a traffic ranking model and excluded based on the fact that “high traffic sites have a low correlation with spear phishing” in most cases (disclaimer: not always!). This type of feedback method makes use of a number of other features to effectively supplement the detection model’s limited view when observing attack patterns, providing a theoretical foundation for feedback and flagging the vast majority of errors.
Association knowledge based feedback: If a prediction is correct, the association result should be correct as well, unless the association is extended by several steps to produce an incorrect result. For example, if an algorithm predicts a domain name as a malware C&C, the binary in VirusTotal or other detection engines’ or the security team’s binary analysis results can be extended by associating the IP record corresponding to the DNS lookup record with access to that IP in the sandbox until the entire chain is completed. This type of feedback relies on third-party knowledge outside of the feature space for independent verification; it is slightly more expensive than model feature feedback, but it is a useful supplement to model feature feedback methods.
User feedback: Following the previous stages, there are only very small amount of results that can reach the user who needs to operate. Users can make judgments about the results by combining their own experience and other intelligence with the help of contextual information provided by the algorithm. In this step, the user feedback includes not only the correctness or incorrectness of the results, but also the judgments made by the users based on which relevant information. In short, users must provide ‘what’ and ‘why’.

The algorithm should also provide as much interpretability of the prediction results as possible, not just for incorrect results, but also for correct results. This includes explaining the algorithm’s features (common in deep learning models), marking and locating the basis of the judgment (which specific line of the malicious script snippet for example), and contextual information about the predicted result (like the associated knowledge mentioned above, such as that the binary was distributed by a URL implementation, other known malicious behaviors under that URL, etc.). Here’s a good example of the importance of interpreting results: it’s clear that when it comes to the effectiveness of data models versus rule models, security operations teams still prefer rule models, even though data models have nice metrics on paper in most cases. This is because operations staff can understand the model’s foundation by reading the rules themselves, as well as their own security experience, as well as additional research and other follow-up work from the model’s information, and eventually make a reasonable judgment. Based on this concept, Sophos AI team open-sourced a good repo to translate the relevant yara rules from the results of the machine learning model*, which is a very interesting work using the method of explaining with the features. Please follow the reference for further reading. It is worth noting that improving the explainability of model results is more than just translating into yara rules; the rule model does not provide perfect interpretation, and there is no silver bullet.

Algorithms must also provide a quick way to deal with erroneous results and partial automation, such as appropriate triage algorithms and enough contextual information to assist operations, among other things. Triaging algorithms are frequently overlooked in data science and security research, owing to a lack of discussion in academia and industry. A common scenario is that a good anomaly detection model is abandoned in production due to the large number of predicted events required to operate, which is a huge loss for both data science and security operations teams, whereas triaging algorithms can effectively rank the predicted results based on operational priorities and rationalize operational resources. One example is my colleague’s presentation at Botconf 2017, Asiaee et al “Augmented Intelligence to Scale Humans Fighting Botnets”*, which involved billions of DNS logs per hour. The system uses an anomaly detection model to output all unseen domains, ‘domain2vec’ to construct access correlations between domains, and strong correlation patterns as operational importance metrics for triage ranking, and ten million anomalous events per hour were reduced to a dozen valid clusters and successfully applied to detect malware using DGA. The triaging algorithm has a number of metrics and methods, such as clustering and sorting, and it is a data science research direction related to security domain knowledge, which can be discussed in later posts if readers are interested.

Engineering fragility has a broader impact on the industry, and we can build a Data Quality Assurance (DQA) system along the lines of general solutions brought by other fields. Please read the references on your own for the DQA implementation*, there is no need to elaborate on this mature direction here.

Another cause of engineering fragility is the data barrier issue, which is becoming more prevalent in the cybersecurity industry. In addition to some open data organizations or alliances, engineering implementations of privacy-preserving data models are technically important to mention, which simply means that the models do not require decrypted data to learn and predict. Federated Learning is a popular approach that uses a server-client architecture to protect the model’s and data parties’ privacy while allowing the model to obtain the desired features without decrypting the raw data. These Federated Learning approaches are common in some NDR and XDR startup products, but are currently only used in simple scenarios. In terms of federated learning implementations, FATE* has established a strong foothold through a series of open source efforts, please check out the reference for further reading. To some extent, privacy-preserving computing uses higher computational costs to mitigate the data barrier problem, but there is still a long way to go to fundamentally solve the data barrier issue.

To resolve operational fragility, the data science team and the security operations team must work together. After the algorithmic model is interpretable to the results and the detection results are ranked in order of importance by the triage algorithm, the security operations team can make quick decisions and decide on the next steps based on the context provided. Simultaneously, some convenient data tools, such as a convenient and easy-to-use graph database system, which can be provided by the engineering team, can aid in the speed of operations. Simultaneously, we observe that security researchers who interested in data science can be excellent teachers, quickly and effectively providing data scientists with the necessary background knowledge to allow data science teams to better understand security problems and propose data models. These efforts across the knowledge gaps are addressing operational fragility in a step-by-step fashion.

Summary

The resolution of fragility issues is a high priority in cybersecurity. As experienced cybersecurity researchers may have discovered, the above discussion of fragility also applies to other aspects such as the use of third-party threat intelligence, with further automated validation of threat intelligence using cloud or telecom traffic, for example. Similarly, rule models based on security researchers’ domain knowledge and experience face fragility issues: smart detection rules require sufficient explanation, a large number of outdated whitelist rules must be maintained and updated in order to counter the trials and guesses of the red team, lack of triage and ranking of model detection results, and so on. These are the issues that rule models must contend with. We bring up the issue here and wish to shed light on the fragility of the rule model and its solutions devised by security researchers.

It is more important for data scientists to address fragility issues in system engineering way beyond proposing effective detection models. While we debated whether rule models or data models are more useful in production, we have seen many imperfect rules or data models produce good results with good engineering implementation and reasonable operational support, and rule models and data models validate and triage each other rather than compete. This also serves as a reminder to think beyond limitations when developing data models and to approach problems from a broader systems engineering perspective.

Similarly, we have discussed the vulnerability of models from a systems engineering perspective in response to the needs of the cybersecurity profession, and these discussions and their general solutions can also be applied to other industries that rely on data models, such as computer vision in image and video, voice recognition, risk and fraud control, and automation etc. Overall, industry data models are always a system engineering topic, and we must approach the design and solution from a system engineering perspective.

Reference

The Blind Men and the Elephant https://en.wikisource.org/wiki/The_poems_of_John_Godfrey_Saxe/The_Blind_Men_and_the_Elephant
Blind men and an elephant https://en.wikipedia.org/wiki/Blind_men_and_an_elephant
Uncovering The “Unknown Unknowns”: Why Threat Hunting is a Security Must-Have https://www.crowdstrike.com/blog/uncovering-the-unknown-unknowns-why-threat-hunting-is-a-security-must-have/
domain2vec was mentioned in this DGA detection blog https://toooold.com/2021/07/16/lstm_dga_is_wrong.html
A small leak will sink a great ship https://idioms.thefreedictionary.com/a+small+leak+will+sink+a+great+ship
Sophos AI YaraML Rules Repository https://github.com/sophos-ai/yaraml_rules
Augmented Intelligence to Scale Humans Fighting Botnets https://www.botconf.eu/2017/augmented-intelligence-to-scale-humans-fighting-botnets/
7 Steps to Ensure and Sustain Data Quality https://towardsdatascience.com/7-steps-to-ensure-and-sustain-data-quality-3c0040591366
FATE (Federated AI Technology Enabler) https://github.com/FederatedAI/FATE