Why your machine learning models fail to solve cybersecurity problems: being narrow-minded forth

Disclaimer: Nothing in this blog is related to the author’s day-to-day work. The content is neither affiliated nor sponsored by any companies.

Link to the Chinese version https://toooold.com/2021/11/28/why_ml_fails_security_ml_is_not_everything_cn.html

After talking about the first 3 reasons: wrong features, fragility, non-sense metric, we have this last post in the blog series. Being narrow-minded on problem solving is yet another reason why machine learning models fail to solve cybersecurity problems. When machine learning is used to solve problems, sometimes the method is used incorrectly, and sometimes the method and the problem simply do not fit.

Deep learning doesn’t work everywhere
Machine learning is a tip of the iceberg in AI
Have you thought of any simpler approaches?

Nope, deep learning is not your solution

We see a lot of cases where people think machine learning only means deep learning. Deep neural networks have demonstrated a powerful advantage for representation learning in areas such as images and text, as well as the magical effect of transfer learning in solving multiple problems brought by neural networks, which brings many sparkles to the mind to solve cybersecurity problems. We’re all wondering if this can help cybersecurity: can I crack the hidden features in the encrypted traffic? can I transfer the features from windows PE to Linux ELF? However, “there is no such thing as a free lunch,” and deep learning has gained its proble solving magic at the expense of its applicability. Some failures that have emerged in recent years have taught us some lessons.

The selection of a neural network structure for feature representation must be based on a thorough understanding of the problem and the model. Sequential models appear to be the most popular in the cybersecurity domain, and RNN/LSTM has gotten a lot of attention due to its simple open source implementation, so we can see it in a variety of problems, such as the previously mentioned multiple work on LSTM prediction of DGA algorithms, where the goal is to find the pseudo-random number generator (PRNG) behind the model fitting DGAs. Aside from the fact that DGAs may use a variety of PRNG combinations such as XOR, shifts, prime number twists, and so on, making it difficult for shallow networks such as LSTM to effectively fit and interpret them completely, the memory and dependency of LSTM on the initial state also hurts PRNG fitting. In Hassan Mostafa “Using Machine Learning to Crack Random Number Generators - Part 1: xorshift128” *, we see that the PRNG based on xorshift128 can be well fitted using only Dense Network and no LSTM. The article also compares the experimental results of LSTM + Dense Network and analyzes the fitting results, please feel free to dig further in the reference list.

“Garbage in, garbage out,” as the saying goes in machine learning. Deep neural networks are not a general AI or an oracle about feature engineering; it requires appropriate inputs that the structure of that network can digest so features can be obtained through representation learning. A typical example is the previously mentioned malconv which attempts to feed raw bytecode from a binary file to a simple convolutional layer, like in most computer vision problems, and extracting and generalizing the underlying features, despite the fact that simple convolution is insufficient to perceive the compiler’s combination of bytecodes, resulting in the network learning only file header signatures rather than the function call features associated with malicious binary code. The book “Malware Data Science: Attack Detection and Attribution”* by Joshua Saxe and Hillary Sanders analyzes opcode and opcode-based related modeling work, where instruction jumps or function output tables, etc., are used as input to the model to better support malware detection models.

The cybersecurity problem is highly adversarial and dynamic, which requires the model to make some assumptions in order to deal with the unknown situation and justify its prediction reason. However, because deep neural networks lack such inductive bias *, their prediction of the unknown situation is very uncertain and poorly explained, resulting in the “black box” when using deep models. In the case of linear regression, we can see the Y value as a linear function with the X vector as the parameter, and in the case of logistic regression, we can see the hyperplane cut for positive and negative samples, and these inductive biases can justify the model’s prediction, whereas deep neural networks can only show that Y is some kind of nonlinear function of the X vector, which is related to data augmentation, network structure, activation function, normalization, and various other constraints imposed during the training process make it difficult to justify the predictions in practice, and cybersecurity problems frequently necessitate strong domain knowledge for more expensive validation and operation, and recent work to improve model interpretability like LIME could provide only limited mitigation for this problem. SpamAssassin, an open source spam detection project, is an interesting example because it has a history of a fun bug that identifies all post-2010 emails as spam. Does it mean the model makes no sense? In a strong adversarial scenario like spam detection, the attacker keeps updating the tricks and the Bayesian classifier adjusts the weight of each feature by year, which is a reasonable approach, but there is no training data for date after 2010, and the classifier judges all unknown emails as spam with an induction bias that it would rather block than pass. Of course, SpamAssassin’s model bias provides an easy and understandable reason for justifying the prediction, and this issue was quickly identified and resolved.

Meanwhile, the individual characteristics of each problem in cybersecurity, as well as the strong requirement for domain knowledge, limit the usefulness of deep neural network in transfer learning by samples, in contrast to common scenarios such as images and text, where pre-trained models can be easily reused. Overall, deep learning, as a subclass of machine learning, is far from allowing people to hunt a rabbit from a blind shot, and its technical advantages are accompanied by application limitations. In short, ‘no free lunch’ requires smart approaches when use deep learning, we still need human for now.

Machine Learning « Artificial Intelligence

If open the text book “Artificial Intelligence: A Modern Approach”, we can find “Learning from Examples”, so called “machine learning” in the media, appears in chapter 19. Before this chapter, there are, searching, knowledge representation, planning, reasoning, first order logic, and there are many chapters after it. When we apply the AI approach, we must combine multiple techniques rather than machine learning only, because it is such a small topic in AI. For example, the success of AlphaGo, a benchmark AI application, comes from the combination of deep neural networks and Monte Carlo tree search (MCTS), an algorithm mentioned in every AI textbook on state search, while AlphaGo adds deep network feature extraction and adversarial training, expanding the scope of the MCTS algorithm from the textbook to win of the Go game.

Other than machine learning, there are numerous examples of AI methods in the cybersecurity domain. Here’s another interesting example: the attacker attempts to test K intrusion points of the target using N vulnerabilities and their combinations, and each time must use exact K of the N vulnerabilities, with the order in which the vulnerabilities are exploited being correlated with the results. After several rounds of testing, the attacker only receives some failed combinations and the reasons for their failure: perhaps the selected K vulnerabilities failed partially (only the number of failed vulnerabilities is known, but it is difficult to determine which part), perhaps the sequence of vulnerability combinations is not right. Can we design a new attacking strategy based on the known test results for more effective vulnerability combinations? A more difficult version can be, whether it is possible to design an automated strategy that makes adjustments based on the previous round’s results? Discrete search for solutions can be used to solve this problem. If we simplify it, it’s similar to the 3-digit combination lock riddle * where one picks three numbers from 0-9 to form a password, conclude the pattern from the wrong password tries, and get the correct password. The three-digit password lock problem (N=10, k=3) can be solved by brute force searching various combinations from 000 to 999 and validating whether they fall into the known errors, but if N and k are both large, we must use the MCTS search and design reasonable pruning conditions (e.g., combinations of vulnerabilities that may trigger partial invalidation of vulnerabilities, etc.) to reduce the search space. We can use an active learning approach to fine-tune the search direction and branches based on the proposed test method and its feedback. These are known as Mastermind * problems, please feel free to dig further in the reference link.

Machine learning and non-machine learning methods should not be mutually exclusive in problem solving, instead, they should collaborate! Learning from examples is always constrained by the sample set, and it requires the assistance of other models to “look elsewhere”. Entity disambiguation is a common example in NLP. For example, when an intelligent agent tries to understand the word “apple”, it needs to know from the context whether it is the fruit “apple” or the electronics company “apple”. There are many examples of similar approaches in cybersecurity that combine graph models and knowledge graphs, such as my team’s presentation in botconf 2020 “Honeypot + graph learning + reasoning = scale up your emerging threat analysis”* It begins by discovering the sequence association of two different URLs in network traffic, then connects the URLs, binary hashes, corresponding detection results, and other contextual information by building a knowledge graph, and then asks the graph to proof the knowledge connection by finding a semantic path using the link prediction algorithm. When sufficient but unnecessary and necessary but insufficient conditions exist, the first order logic inference method is used to ensure the rationality of the semantic path, resulting in the prediction of the download path of unknown malware.

However, this article cannot include all sub-methods and their combinations to solve cybersecurity problems using the artificial intelligence method. Please keep the momentum in the mind and find more approaches.

“Have you tried a simpler approach?”

Joshua Saxe asked a question in his tweet thread, “why they didn’t take a simpler approach than the approach they took” * Simpler approaches can result from understanding domain knowledge and representing it in general terms, pre-processing the data, careful understanding and splitting of the target problem, and a variety of other factors.

A good friend once raised an interesting question from his research: if the attacker uses dictionary word combinations to guess the target subdomain through subdomain enumeration during the reconnaissance, can we collect its DNS traffic and use machine learning to crack attacker’s subdomain enumeration dictionary for threat hunting? Before he tried to bring a few GPUs and jump into a deep model like BERT, I suggested that he could sort the data and guess a noisy dictionary using the longest common substring of the sorted consecutive records, where this noisy one is the superset of the true dictionary, then apply this noisy one to segment the subdomain and clean it up iteratively, transforming the dictionary problem into a string slicing problem. The follow-up experiments have shown that this simpler algorithm is not only effective in obtaining the vast majority of dictionaries, but it is also resistant to noise.

The advantage of machine learning is that it learns its statistical representation from data; intuitively, it fits the rules; however, problem solving does not exclude rules that come directly from domain knowledge, even if the rules only partially solve the problem. For example, Alexa Rank, a global website ranking that is frequently used as a reference for malware C&C domain detection results, contains domain knowledge stating that “malware is unlikely to use high-ranking domains as C&C”. Surely, the time is changing and Alexa Rank has been exploited by attackers, along with new business models and the escalated adversary. My colleagues have built our own domain reputation ranking through DNS traffic *, please feel free to read in the reference.

Sifting through the data can also lead to a more straightforward approach. Just as a good ingredient requires only simple preparation to bring out its flavor, good data requires only simple models to produce clear results. An interesting example comes from me discussing with my former colleague on his paper, Asaf Nadler et al “Detection of Malicious and Low Throughput Data Exfiltration Over the DNS Protocol”*, a method of data exfiltration used in APT attacks by detecting low throughput tunnels in DNS data streams. Because the signal of low throughput DNS tunnels is weak and rare, the anomaly detection by Isolation Forest requires good feature filtering, making it difficult to demonstrate its detection power under large-scale noisy data and limiting the scale of its problem solving due to limited computing power. We discovered that if all unseen domains in the DNS data stream are filtered and used as input to the Isolation Forest model, its prediction performance and runtime can meet the requirements of a large-scale data stream. With a deeper understanding of the target problem, We can apply more appropriate input data to take the existing model to the next level.

A simpler approach can also be reached by splitting the target problem, which could be a subgoal representing a portion of the target problem, or a reduction of the target problem, and so on. All of these follow the general problem solving approach, so please dig further on this topic. An interesting example is my team work presented at Botconf “Math + GPU + DNS = Cracking Locky Seeds in Real Time without Analyzing Samples”*, which detects Locky ransomware in DNS data streams, brute force cracks its DGA seeds using GPU, and successfully predicts all future domain names. We reduce this difficult problem into multiple steps, and reuse the anomaly and correlation models from related work:

Because Locky DGA domains are all new, we detect and filter never-before-seen domains in DNS anomalies.
Because Locky DGA contains multiple domains, we use domain2vec to calculate the sequential correlation among the anomaly domains and test their DGA properties only for the groups that are most strongly correlated.
Locky generates a single long integer using a pseudo-random number generator (PRNG) and outputs the domain string as such, so we can inverse each candidate domain name to get its corresponding long integer, and then use the GPU to batch crack the seeds that may correspond to that integer at the current date.

As a result, we successfully cracked dozens of Locky DGA random number seeds and submitted them to the cybersecurity research community.

Data science teams can think about solving each problem by reminding themselves that:

Is there another way to solve this problem, in whole or in part?

Summary and Epilogue

In the problem-solving process, we must stick to the main goal of “problem solving”, it is always the first, and the related technology selection is the method to support that goal, and collaboration among these methods should be preferred more than competition. This also necessitates data science teams broadening their horizons, paying attention to proven useful methods in other fields and the underlying reasons why they work in their fields, and attempting to implement them in the cybersecurity field. Simultaneously, we have observed many data science teams actively learning about the cybersecurity domain knowledge, which is the only way to be more effective in finding techniques appropriate for the problems in that domain.

We have received a lot of useful feedback and suggestions for this blog series, and we hope that all of you will contribute to the discussion by combining your own work and research with the topic of “why machine learning fails”. Data models in the field of cybersecurity have only recently been used on a large scale, bringing with them many issues and difficulties in the industry, which invariably entails numerous failures, all of which are predictable failures in accordance with modern scientific research methods. I also believe that by learning from the frustration of many failures and the surprise of a few successes, we together can draw enough lessons and create a general methodological framework for data modeling in cybersecurity.

Reference

Joshua Saxe with Hillary Sanders, Malware Data Science: Attack Detection and Attribution https://nostarch.com/malwaredatascience
Mostafa Hassan, “Cracking Random Number Generators using Machine Learning – Part 1: xorshift128” https://research.nccgroup.com/2021/10/15/cracking-random-number-generators-using-machine-learning-part-1-xorshift128/
Inductive Bias https://en.wikipedia.org/wiki/Inductive_bias
Monte Carlo tree search https://en.wikipedia.org/wiki/Monte_Carlo_tree_search
A step-by-step look at Alpha Zero and Monte Carlo Tree Search https://joshvarty.github.io/AlphaZero/
3 digit lock riddle: Using Prolog to solve a brain teaser (Master Mind) https://stackoverflow.com/questions/61276283/using-prolog-to-solve-a-brain-teaser-master-mind
Mastermind https://en.wikipedia.org/wiki/Mastermind_(board_game)
Joshua Saxe twitter https://twitter.com/joshua_saxe/status/1328834273214861314
“System for Domain Reputation Scoring” Patent us 14/937699
Asaf Nadler et al “Detection of Malicious and Low Throughput Data Exfiltration Over the DNS Protocol” https://arxiv.org/pdf/1709.08395.pdf
“Math + GPU + DNS = Cracking Locky Seeds in Real Time without Analyzing Samples” https://www.botconf.eu/2017/math-gpu-dns-cracking-locky-seeds-in-real-time-without-analyzing-samples/
“Honeypot + graph learning + reasoning = scale up your emerging threat analysis” https://www.youtube.com/watch?v=r7KbGJPFkxQ&ab_channel=botconfeu