Why your machine learning models fail to solve cybersecurity problems: feature space first

Disclaimer: Nothing in this blog is related to the author’s day-to-day work. The content is neither affiliated nor sponsored by any companies.

The cybersecurity industry has been fascinated by the magic of machine learning in recent years, and with the success of deep learning in areas such as computer vision and natural language processing, we all want to find a silver bullet. However, what we’ve seen in the industrial is more disappointing than promising: “the model generates too many false positives, the security operation team can’t handle it,” “the model only generates two unique detection samples, why don’t I just add two more yara rules,” “your slides say it’s deep learning, self-evolution, but the code is set of ‘if-else’ “, and so on. Why do so many machine learning models fail when apply to solve cybersecurity problems?

The reasons behind such poor effectiveness can be attributed as:

Incorrect feature space and sample labeling
Algorithmic fragility, engineering fragility, and security operational fragility
Incorrect benchmark metrics and optimization
Misconception that machine learning is the only type of algorithms of AI algorithms

Let’s get started with feature space and sample labeling.

What is an appropriate feature space

Does it sound right that, the image recognition problem’s feature space means the pixels in a picture, the NLP feature space as the word in the text, and the cybersecurity features are the characters of every WAF attack log, the raw binary code of each malware binary file?

The use of the appropriate feature space to describe the true nature of the problem can make it easier to solve, whereas the use of the incorrect feature space can make the problem exponentially more difficult. Consider the following example: Anyone who has learned basic arithmetic will have no trouble answering “what is 5765760 plus 2441880” as 8207640, no calculator is needed, but what if it is “5765760 times 2441880”? Most people find it is difficult. This is due to the fact that decimal is not a multiplication-friendly representation, so the problem is made artificially difficult because we chose the incorrect feature space. Factorization is a more multiplication-friendly representation that is even easier than addition if the problem is replaced by the equivalent one: “what is 11*12*13*14*15*16 multiplied by 17*18*19*20*21.”

The deep learning algorithm Malconv for malware detection is a similar example, where a binarycode based convolutional approach does not learn the approprite feature space. In the paper “Malware Detection by Eating a Whole EXE” by Raff et al.*, the binary EXE file is used as input, and malconv attempts to build an end-to-end static detection classifier from a raw bytecode feature space like 010101 using convolutional networks, achieving over 90% AUC on the paper’s test set. Leaving aside its unstable performance for new and adversarial sample detection, the paper “DeepMalNet: Evaluating shallow and deep networks for static PE malware detection”* introduces a new test set that compares several deep models, including malconv and a random forest model by the authors of the paper, and finds that a random forest model with manual feature engineering can almost match and even outperform malconv results. The reason is that convolutional networks in malconv doesn’t learn the appropriate feature space from the bytecode block, so the effectiveness demonstrated in the malconv paper is more or less just luck. According to the paper from Fireeye Coull et al “Activation Analysis of a Byte-Based Deep Neural Network for Malware Classification”*, the convolution result of malconv actually treats the header information of static binary files as the dominant feature, and the combination of instruction jumps only contributes limited weight to the classification. For details, please follow Bose et al. “Explaining AI for Malware Detection: Analysis of MalConv Mechanisms”* for a detailed analysis and explanation.

If we leverage certain domain knowledge and tools, such as the function export tables, or some dynamic analysis features such as the function call sequences from sandbox, or using static disassembly to get instruction sequences, and converting the original binary to these feature spaces that can better characterize the malware runtime behavior, the machine learning model can be much more stable and better than that of the raw binary models like malconv.

Similarly, we cannot expect an accurate end-to-end model to predict attacks based solely on raw WAF log without slicing and filtering tokens, nor can we expect a model to learn the complex character combination of DGAs and accurately classify or even generate new DGA domains, nor can we imagine a deep learning model reading any HTTPS stream and accurately predicting its corresponding website. The failure of these industrial machine learning models to solve cybersecurity problems highlights the importance of the appropriate feature space: a model on the wrong feature space may produce so-called “good results” because it happens to fit a specific dataset, but these results are not stable or reliable enough to support the production environment and product quality.

Of course, the discussion does not diminish the importance of malconv paper in the industrial. It was one of the first papers to apply machine learning to cybersecurity problems, and it creatively applied a convolutional network to binary code, implying that more work would follow, most likely with a better network structure or a completely new feature layer. More pioneer ideas and papers like this are needed in modern scentific research to find the path to a solution.

The term “machine learning” is frequently used to refer to statistical learning by example, in which models learn the statistical expectation of the distribution of samples in their feature space. We can imagine looking across the mountain to understand the impact of feature space on model effectiveness. If the feature space does not describe the essential causes of the sample distribution, the numerical distribution of the features does not provide enough discriminative power; intuitively, the model can only “look across” a series of “ridges” but not “look sideways” to the independent “peaks”. Because the model can only “look” at a series of “ridges” but not at the independent “peaks”, it can only try the best to divide the “ridges” by fitting the existing data set, so it doesn’t know the “true face of the mountain”, and it loses the essence of what the “peaks” are supposed to represent.

The sample labeling method is also related to the feature space. A real-world lesson in statistical learning is that even the best models can only learn as much manual labeling as they can. “The model can only learn the samples labeled by my yara rules, so what is the use of your bloody machine learning model?” many cybersecurity researchers ask so. Many models, according to our observation, have the misconception that “labeled samples are only labeled raw samples,” whereas good data scientists will label samples corresponding to the feature representation space, which can be a mapping of the original samples in the new phase space, such as learning vector representation learned in various association graph models, a.k.a ‘graph embedding’ or ‘something2vec’, and label the vector, or it can be a sequence or a graph of tokens where token can be labeled to overcome the limitations of the original samples. Many ways to label samples beyond the raw format allow the problem to be solved with simpler and more reliable machine learning models.

How to find my feature space

Some feature space choices are self-evident, while others necessitate extensive brainstorming. The distinction between a “ridge” and a “peak” is not only limited to the feature selection within the same dataset or the transformation of feature hypersurface, but it is also critical to leave out the “obvious” and “taken-for-granted” features. It is more important to look for features that describe the essential causes of the sample distribution rather than the “obvious” ones. A typical example is the family of LSTM algorithms for detecting DGAs mentioned in the previous blog “Why using LSTM to detect DGA is a wrong idea”*, where the feature space is the combination of adjacent strings for each domain name and the LSTM model works twice as hard to fit the unknown DGA function for the combination pattern of these adjacent characters, far exceeding the LSTM learning capability. In fact, no yet any network structure can be intelligent enough to learn functions containing complex operations such as xor, bit shift, and so on from some small data sets. Most DGAs are characterized essentially by DGA algorithms that generate sequences of domain names, and their behavior can be better modeled by mapping domain name sequences into the embedding vector space or by using their co-occurrence probabilitiesin a graph, and good DGA detection can be achieved using simple graph embedding models or graph adjacency matrix calculations.

For the time being, finding the right feature space does not have a one-size-fits-all solution, nor is it automated by an advanced artificial intelligence. What it takes is a good data scientist who understands the principles of existing models as well as the fundamentals of the specific security domain from a data modeling perspective. Because we have seen far too much work based on faulty assumptions and feature spaces, it is recommended that data scientists keep this issue in mind and remind themselves of it on a regular basis while problem solving:

“Can this feature space describe the essence of the problem?”

Summary

Algorithms and modeling in cybersecurity do not have intuitive definitions of examples, unlike computer vision. It is more similar to speech recognition and spatial control problems in that data scientists must have a deeper understanding of domain knowledge, explore the feature space that can characterize the essential causes of the problem, and intelligently map the problem from its original shape to the problem solving feature space. Rather than praying for deep learning to bring the power from nowhere, we can find a more appropriate way to solve the problem by adding an appropriate sample labeling method to describe the distribution of these features.

This post examines the primary reasons why machine learning models fail to solve cybersecurity issues. Other topics, such as understanding and dealing with model fragility in a system-wide context, will be covered in future posts. Not only in cybersecurity, but also in many other fields, researchers and engineers place too much emphasis on model work*, ignoring the fact that algorithm and modeling are system engineering problems, and that finding more and better samples, more descriptive features, and dealing with prediction errors are all important steps in the system. We hope that by focusing on systems and engineering, machine learning and other artificial intelligence algorithms will be able to play a meaningful role in the field of cybersecurity.

Reference

Raff et al, Malware Detection by Eating a Whole EXE https://arxiv.org/abs/1710.09435
Vinayakumar R., Soman K.P., DeepMalNet: Evaluating shallow and deep networks for static PE malware detection https://doi.org/10.1016/j.icte.2018.10.006
Coull et al, Activation Analysis of a Byte-Based Deep Neural Network for Malware Classification https://arxiv.org/abs/1903.04717
Bose et al, Explaining AI for Malware Detection: Analysis of Mechanisms of MalConv http://vigir.missouri.edu/~gdesouza/Research/Conference_CDs/IEEE_WCCI_2020/IJCNN/Papers/N-21218.pdf
Why using LSTM to detect DGA is a wrong idea

https://toooold.com/2021/07/16/lstm_dga_is_wrong.html

“Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI https://research.google/pubs/pub49953/

What is an appropriate feature space

Why the model effectiveness is related to the feature space

How to find my feature space

Summary

Reference