Disclaimer: Nothing in this blog is related to the author’s day-to-day work. The content is neither affiliated nor sponsored by any companies.

The Chinese version is here https://toooold.com/2022/01/04/useful_useless_model_cn.html

When developing machine intelligence models to solve cybersecurity problems, we rarely find any cases that a problem can be solved by a single end-to-end model. Let’s take phishing detection for example: besides the data engineering tasks such as preparing and cleaning sufficient input data, the detector starts with identifying key image elements on the page such as logos, combine the HTML code analysis features and classification, for some case, it also needs to run javascript rendering and fight the page’s phantomjs detector (well, the phishing guys are smart too), plus some network features like DNS or URL patterns as well, applying some filters like Alexa rank, and provide sufficient evidence to support further operational policy, and repeat each week to recover the precision loss. Such examples frequently frustrate data science teams as well as company executives: building a phishing detector shouldn’t be rocket science and every security company are making their phishing detectors, but why only we see more and more problems everyday on my detector? When can we finish it? With the pressure of time and resource, the team and management gradually developed burnout and started to regret a lot, so we have seen a lot of teams have to cop out great ideas and rush the deployment, which propagate the tech debt down to the security operation teams or the customers to repay.

The complexity of the problem in cybersecurity comes from its higher than normal requirement to “trustworthy”, due to both the dynamic environment of high-intensity adversary and the high cost of validation, as well as the consequence to incidents. Do we have to assign an SOC expert to respond to every alert predicted by a model? Is the model capable of providing IoC that I can trust and block my company firewall? Traditional cybersecurity methods insert well-trained human in the loop to ensure trust at every step by leveraging human experience; however, attackers know it too, and in addition to bypass methods, designing attack methods to exhaust SOC is also a hot topic in the attack method research. “How to reduce human cost in the trust chain” has become a research direction of modern cybersecurity solution methods, in which the model framework of cross-validation with multiple independent models is one of the effective methods, simply put, is to use multiple independent data sources with independent feature vectors of the model to form a jigsaw puzzle and put pieces together to solve a problem. This approach can also reduce the quality requirement of individual models, allowing as many weak models to be included to improve the scope of the problem solving, which can be discussed in another post separately. The question here is, how to combine multiple independent models to support trust effectively?

Combining the above two cases, we can look at this engineering problem from the system design point of view: when it is too expensive to design a set of models and tools from scratch and combine multiple models to solve a problem, it is better to build some generic models in the domain that can solve the common parts of a group of problems in the field, which can be functional modules or abstract parts of the problem. We can effectively solve multiple complex problems at scale by reusing these models and building problem-specific models for specific problems, and then combining them appropriately. This approach has proven to be very effective in solving complex problems in cybersecurity model building, and similar ideas have been seen in other fields too. These generic models are referred as “useful useless models.”

The name “useful useless models” comes from the conversation with my colleagues, who said, many teams were building one model per problem and they could keep solving problems by multiple models each week, while these models your team was working on can’t even solve any real problems, and only a few models after years of work, you couldn’t even tell the false positive rate for them. So, wasn’t it “useless”? I appreciate their brilliant mind on naming these models.

What “useful useless models” mean

“Useful useless models” are a type of models or tools that is designed to solve the shared common part of a list of domain-specific problems.

They can be called foundation models sometime. The paper “On the Opportunities and Risks of Foundation Models”* by Bommasani et al discusses the foundation models of deep neural networks in the fields of computer vision, voice recognition, and natural language processing etc, as well as their application methods and impacts. Deep neural networks can learn feature representations and transfer their pre-trained models across multiple similar problems, allowing pre-trained neural networks to solve multiple different problems within the same domain, with a little cost of fine tuning. In the same way that a ResNet model trained on the cat and dog ImageNet datasets in computer vision can detect tuberculosis by fine-tuning the input with lung X-ray images, a Webshell detection network trained on the TextCNN using the PHP script dataset can detect the malicious behavior of Powershell, a Windows-only script, by fine-tuning. Because both transfer tasks use deep neural networks to effectively characterize the local features, which is a common feature of this type of problems, the pre-trained model with fine-tuning approach performs well in these two seemingly disparate tasks. Of course, fine-tuning is only one method when use the pre-training models; other ways include zero-shot, few-shot, and a hot topic recently: prompting. Furthermore, fine-tuning based on network layer weights is one of a few methods of fine-tuning. For example, Transformer/BERT-like models are fine-tuned with their attention mechanism, please feel free to dig further on fine-tuning methods and pretrained models.

When we extend the approach of pre-training neural network models to more general cases, we can look for general properties that are required in problem solving for a specific domain, for example, attack patterns in cybersecurity. When analyzing patterns of attack behavior, we can ask a few question: what actions did the attacker take after action A, what actions did the attacker take before performing action A, what else they did within the time window or in the same network, and so on. The sequence properties of these atomic actions are general properties of this domain. Further more, we find that the sequence properties are independent of any specific problems, but only depend on the data sources that generate the sequences, such as the parent-child process in the endpoint command line logs, IP connect logs for netstats, the DNS logs, and so on. Each log source like the sequence DNS queries in DNS logs enables us to build a behavioral sequence model and apply it to solve a specific analysis problem. This is how the domain2vec model mentioned in the previous article is built, as a “useless model” that leverages a neural network to learn dense vector representation of each domain name and computes a statistical correlation as cosine similarity between any two domains, based on the query sequences. This correlation function can be used in a variety of scenarios:

  • Discovery of potential DGA groups: DGA domains have strong sequential correlation in the DNS queries, with spectral clustering on the long tail domain name dense vectors, one can quickly find such groups.
  • Revealing CDN domains hidden behind websites: the homepage URL have strong sequence correlation with their CDN sites. One can search the vector of each homepage domain for other domains with high cosine similarity.
  • XcodeGhost* and supply chain attacks: Because the group cosine similarity between the app’s primary and secondary API service domains is relatively stable, maintaining the history of this set and applying anomaly detection can find C&C domains, like the XcodeGhost init.icloud-analysis.com and other supply chain attacks.

Surely, the list of use cases is much longer than this. Because it provides a quantitative function for behavioral correlation as a task-independent representation, which is the key to solving such a class of problems, the domain2vec model, which serves as the foundation of several scenarios, can transform seemingly difficult problems into simple cosine calculation. Sequence-based behavioral models, of course, are not limited to domain name vector representation; they can be subgraph models or other entities such as IPs, URLs, and so on, and behavioral models are not only just sequences models.

Tools and models built on top of such tools can also provide the general properties to solve problems. Good examples include graph databases and knowledge graphs, as well as the algorithmic models built on top of them, such as ranking in graph, link prediction, subgraph pattern search and matching, and so on, some of which have been mentioned in previous blog posts. Building these tools needs time and engineering resources, so we better start early.

In general, solving complex cybersecurity problems needs building blocks, which can be domain-specific pre-trained models, more general computational functions with specific properties like task-independent representation, or useful tools and algorithms to partially solve a list of problems. The more building blocks, the more effectively lowers the threshold for solving complex problems, meanwhile, they encourage data scientists to bravely explore new ways of solving problems at a much low cost of failure. From the systems engineering prospective, each “useless model” can be invested with nice ETL and engineering optimization resources, and developed by a good engineering team to support its quality in a production environment. Here it is a fun story: domain2vec, a model created in 2014, has been maintained and further developed, and that it still keeps discovering new threat intelligence and having new application scenarios built on top of it, six or seven years later.

How to build and use them

Although “Genius is one percent inspiration and 99 percent perspiration” (Thomas Edison), inspiration doesn’t always come after much perspiration. Here are some scientific method examples to inspire more. The scientific methods described below are common in modern research, let’s re-discover them with a few cybersecurity problem solving cases.

Inductive reasoning and deductive reasoning

Inductive reasoning (also known as Induction) and deductive reasoning (also known as Deduction) are at the heart of the scientific method. Inductive reasoning can summarize general concepts by observing things in the objective world, for example, most intrusions require steps such as reconnaissance, intrusion, exploitation, privilege escalation, lateral Movement, obfuscation, denial of service, exfiltration, and some are after some others, so people generalize the concept of cyberattack killchain*; deduction gradually expands from logical judgment to a problem-solving model, for example, exfiltration behavior occurs after intrusion. A real-world example to leverage both reasoning methods can be that, exfiltration behavior must appear after intrusion and exploitation, then an exfiltration detection model must be look for earlier steps by locating them in the timeline with possible intrusion and exploitation features. Because the exfiltration has stronger features and good detection accuracy, it makes intrusion and exploitation detection easier by simple ‘look back’ for anomalies. It also connects multiple steps which we can automate by building a knowledge graph of events with temporal attributes and linking prediction methods.

Induction (and some luck) can help us effectively come up with some shared commons in the field, after a broad dive into the domain and a deeper understanding of the field problems, sometimes can even find the root similarities between multiple problems. The domain2vec domain sequence correlation model, for example, is based on the observation that:

  • Most websites use hard-coded CDN domains, and as the home page loads, requests for CDN domains are sent to DNS servers.
  • Malware that uses the DGA algorithm will send multiple DGA domain requests in bulk.
  • Modern browsers, such as Chrome, which prefetch some links on a page in advance, also initiate multiple DNS requests (not always true, but works most time)

and so on. We discovered that they are related in the sense that DNS request sequences have certain correlation, and we don’t have a function to measure and compute such correlations, so such computational function could solve a list of problems in the field. This is the induction procedure led to the ‘domain2vec’ model. My dear reader, you’ve probably noticed that IPs and URLs follow a similar pattern, right? Yes, this is yet another example of “deduction”, and as you would generalize to ip2vec, url2vec, and whatever2vec models, please keep in mind the characteristics of specific entities’ behavior, like IP connections in netstats are different from DNS queries. Of course, deduction can be applied in a variety of ways: are there other problems in which the correlation function can be used? Will the computational function’s spectral properties lead to other applications? Is there another correlation function that can bring about different spectral properties? And so on. In the thinking process, the two methods of induction and deduction can be iterated repeatedly. Please feel free to give it a try.

It is worth noting that induction yields only concepts instead of ground truth or theorems, which makes it potentially very subjective and vague, e.g. people lacking scientific training often like to reduce anything all the way to a few basic principles they have heard of, well, the combination of black body radiation, maxwell daemon and the second law of thermodynamics is not a good model to solve cyber security problems (no kidding, I have actually seen people do this!). Attempting to deduce during the induction process ensures the proper scale and direction of induction, because “More is different” * (problems of different scales and sizes require different solutions). Only a model that can be generalized and lead to a solution is a reasonable scale for induction, as Feynman stated about how to discover a new physical theory, “First you need to discover it, then you need to generalize it as much as possible and falsify it with experimental facts.” Similarly, the result of induction could be emperical and only apply to a narrow scope of problem solving, and the deduction can be tested by generalizing the inducted concept until it fails.

ab initio and thought experiments

Even when we are working on problem solving on a daily basis, there is no guarantee that we will be able to easily generalize approaches to these problems; after all, modern technology and the society have created these complex problems. Here’s a useful suggestion: “ab initio” (“from the beginning”) The term “ab initio” usually refers to a problem solving method in quantum physics, and you probably are more familiar with its buzzword: “First Principle”. Regardless of what it is called, the approach is basically removing unnecessary assumptions.

Having enough necessary assumptions can effectively limit the scope of problem solving, but some assumptions are not necessary and may be caused by historical experience and confused with other assumptions, often creating obstacles to problem solving. When Elon Musk used first principle in business models to analyze the cost of launching a rocket, he discovered that recycling the rocket was actually feasible, only because historical cost considerations of the NASA launch model made recycling under-researched, so he began to recycle the rocket. Most network security vendors have made the detection of malicious samples their primary function, and the debate over malware detection has raged on for decades. In the cloud era, many attacks start to use methods such as fileless attacks, forcing traditional vendors to add more detection targets as quickly as possible. While successfully capturing malware samples is important, cloud platforms can collect all sorts of logs and build behavior patterns detection models with machine intelligence, and “reliance on malware sample” is becoming an unnecessary assumption, making cybersecurity in the cloud computing era a behavior modeling-based path to success. Another interesting example comes from the domain2vec model, which is derived from the word2vec NLP model. While reading word2vec paper, we noticed that “natural language as input” in the model was an unnecessary assumption, so they replaced it with “domain name sequences in DNS queries” and optimized the hyperparameters for them. These are successful examples of having the courage to remove unnecessary assumptions. Of course, if one discovers that removing certain assumptions makes the problem more difficult, one may have discovered a necessary assumption.

We can simplify most complex problems by removing unnecessary assumptions, and then comparing simplified versions of these problems will make it easier to discover their commonalities. The work mentioned in the previous article “Honeypot + graph learning + reasoning = scale up your emerging threat analysis” uses a link prediction approach to prove that two URLs have exactly the same malicious behavior, in a way that is also consistent with the timeline-based approach to finding exfiltration and exploiting features mentioned above, because both problems infer associative features by looking for links in the knowledge graph, while both their entity identifiers are the same. As a result, we conclude that “link prediction is its commonality” in this set of problems, and it is feasible to develop and optimize a general toolkit of link prediction algorithms.

With a set of “useless models” for solving common problems, data science teams will not only be confident enough to solve the current problem, but will also be willing to try new ones. Not only do we get brilliant ideas from inspiration when we use these models, but we can also leverage thought experiment to find brilliant AND more feasible ideas. One of the most common methods used in physics is thought experiment, which simply means running something abstract and simplified in the imagination and observing its operation; for example, “Schrödinger’s cat” is one of the most famous thought experiments. There are many examples of thought experiments in cybersecurity, for example, as mentioned in the previous article, the attacker needs to experiment with N vulnerabilities corresponding to the target of K exploit methods, summarize a number of failed experiments, and get the correct sequence of exploitation, the thought experiment can be, there are numbered 1 to N balls thrown into K pits in order, by the way it bounces to determine the type of error, so use different types of thought experiments. How to throw them with active learning, mastermind strategy, multibandit strategy, which bounces they encounter, and which strategies would be better? Thought experiments can be used to play out all of these scenarios in the mind and we can conclude the next step afterward.

Using the “useless models” framework, we can use thought experiments to imagine combinations of models A, B, C, and so on, and introduce some hypotheses about a domain-specific problem, thus planning a solution to the problem. “Honeypot + graph learning + reasoning = scale up your emerging threat analysis” expresses the process of the thought experiment through two plus signs: The “useless model A” as graph embedding model can obtain the correlation features such as sequence and host of two URLs in HTTP traffic, which is equivalent to a dotted line connection between two nodes; while the “useless model B” as graph reasoning model can find possible reasoning links in any entity nodes, resulting in a detour path in graph as a solid line connection; furthermore, by filtering honeypot data, we can limit the types of entities in the graph to reduce search costs and the risk of ambiguity in inference, and the solution is a “combination of models to find solid links to prove dashed links”. It is important to note that these combinations of foundamental models provide the shoulders of giants, and we must also build the input data and subsequent models that fit the problem on top of the giant’s shoulders in order to refine the solution for the problem specificity.

Of course, ab initio and thought experiments are not limited to the methods listed above, and there is no secret recipe so we can all become proficient with practice. Meanwhile, my dear readers may have discovered that ab initio and thought experiments, as well as other related approaches, are also good practice of induction and deduction, and that by repeatedly practicing and exploring this approach, we will form more sets suitable for our own fields.

Tips beyond technology

“Useless models” are never created from the top down approach; instead, they require data scientists and security researchers to thoroughly understand the domain problem and generalize as they solve it. There is a time cost to obtain these models, as well as a higher engineering cost to build them, so I have some tips here for the management.

It may be difficult to find your first “useless model”, and learning from existing examples in the industry may be a good way to go. There is almost no “useless model” that will be perfect at the start, so there is no need to rush into engineering optimization too early; it is a big win to consider finding more problems in the domain and attempting to solve a small portion of them, no rush at all; every “useless model” must find some useful sample problems to solve at each milestone. This will not only allow us to delve deeper, but it will also help the model on sustainable resource investment from the senior management.

Summary and small talk

Cybersecurity teams are frequently faced with the dilemma of having a slew of unsolvable problems on one hand and racking their brains for new solutions on the other. I hope that this post can provide some inspiration. The complexity of problems is caused by a variety of factors, and this post attempts to solve the complexity caused by scale and diversity by developing a generic model that partially solves the common part.

A few “useful useless models” can support a large portion of problem solving work and people should not discount the contribution by the number of models, since “jewels are not weighed on a grocery scale”* Such models can significantly reduce the problem solving difficulty level and scale up the model production.

The idea of developing a domain general model that can partially solve the problem came from a conversation between me and 360netlab research team about domain name correlation model domain2vec and other graph models a few years ago, and my good friend Yiming Gong called the idea of applying this idea to the cybersecurity field without the trappings of NLP as “cross-border,” which inspired me to investigate the extension of this “cross-border” idea. I’d like to thank the 360netlab team, they have a twitter account https://twitter.com/360netlab .

Additional comments why I wrote a few ‘problem solving’ blog post on ideas rather than technology recently. Besides “strong minds discuss ideas”* quote, the real-world problem solving is a superset of solving a given problem in contest, like Kaggle, where data has been cleaned up, evaluation metrics is given and, the most important, the target is set so people just run for it. The real-world problem solving needs to dig for the approach, find the data and clean it (and repeat for more data), understand its business impact to determine a good metrics before running to optimize the model. Data scientists at more senior positions also need to identify new problems in the field and open the door to a new direction. All of them need “problem solving”, which motivates me to post a few ideas and thoughts.

Reference