Toooold

Those Magnificent underdogs competing ChatGPT

2023-04-08T00:00:00+00:00

Disclaimer: the open source community and the AI community evolve so fast. This blog post can only include content up to early April 2023. The cover image is generated using Midjourney.

Disclaimer #2: I root for underdogs because only underdogs can democratize AI.

Those Magnificent Men in their Flying Machines; Or, How I Flew from London to Paris in 25 Hours and 11 Minutes is a 1965 British period comedy film that satirizes the early years of aviation. (Wikipedia)

The open source community has been searching for independency from OpenAI and ChatGPT, just like these flying machines in early 1900 seeking for the freedom from gravity. In early March, Stanford HAI shared a successful approach “Alpaca: A Strong, Replicable Instruction-Following Model” and proved instruct tuning was a promising way. The underdog’s race began!

Rethink “Large” in LLM

LLM as “Large language model” always implied “Yes, you need a large model”. Stanford’s Alpaca brought us an important message: a smaller model with limited instruct tuning can perform well in major tasks. Let’s break it down into two pieces: smaller model and major tasks.

Before Alpaca’s instruct tuning on Llama’s 7B model, people believed being large was critical for GPT-equivalent performance and we would need a 175B model to be comparable with GPT-3. Alpaca proved it was not very true once a powerful-enough language model had good instruct tuning data. Alpaca started with Llama’s pretrained model and leveraged a high quality but very small tuning dataset of 52k samples, pulled from GPT model, and built a LLM with conversation functions, which Llama didn’t have.

Alpaca and Llama also presented that, a LLM didn’t have to perform well in all tasks so skills and knowledge in models could be independent. For example, Alpaca and Llama 7B didn’t do the programming related tasks very well because of heavy domain knowledge dependency for programming, but it didn’t prevent Alpaca being good in conversation and common tasks. Instruct tuning provided a step-by-step approach to add more knowledge to Alpaca and leverage its learned conversation function. With additional 20k programming specific samples, codealpaca can perform well in many programming tasks and we can ask it to “write a function to flip a binary tree”.

On the other hand, Open AI kept showing engineering debt on their very large models: availability time, limit on GPT-4 queries of 25 queries per 3 hours for ChatGPT Plus customers etc. Such observations let us think: probably a smaller LLM can be the right way to go?

By the way, Llama and Alpaca 7B now becomes the new ‘Doom’ in the AI era. We keep seeing their appearance on the cheapest Macbook Air, a Raspberry Pi 4, or a Google Pixel 6 phone.

Does it run LLaMA 7B? is the new Does it run Doom? – @ylecun

More underdogs join the race

Llama and Alpaca started the race, and more smaller LLM underdogs joined as well. They brought more data to improve Alpaca, faster tuning methods, or other network structures to replace Llama.

Alpaca needs more tuning data. Guanaco from “Guanaco: A Multilingual Instruction-Following Language Model Based on LLaMA 7B” introduced 530k more data on multiple languages by rewriting the Alpaca instructs in different languages, and adding new instructs to align multiple languages, understanding the content etc. Language specific models like “Chinese-Vicuna: A Chinese Instruction-following LLaMA-based Model” and Chinese-LLaMA-Alpaca provided optimizations as well. Vicuna from “Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality” focused on improving the chat function from Alpaca.

The Low Rank adoption by Microsoft, called “LoRA”, helped a lot on speeding up the tuning. The idea was great: it freezed the weight but “inject trainable rank decomposition matrices into each layer of the Transformer architecture”, so the tuning speed could be 3x faster. LoRA technique was also useful beyond language models, for example, it helped faster tuning for stable diffusion in text-to-image tasks. Please feel free to read further here.

Meanwhile, we understood Llama was not so critical in this framework. It could be replaced. Llama from Meta didn’t allow any commercial use for the code or the weight. Lit-llama rewrote llama inference code for more independency, but it still had to use the Llama weight. The open source community provided use a few options, where GLM and RWKV were the two most promising ones.

GLM from “GLM: General Language Model Pretraining with Autoregressive Blank Infilling” is a family of models at different sizes. It has a different approach from Meta’s Llama and its 6B model with chat function can be found as ChatGLM. Meanwhile, RWKV was so unique. It didn’t follow the stacked decoder transformer structure like in GPT, instead, it used recurrent network like RNN, so its context length was theoretically unlimited and its inference was much faster with less memory cost. RWKV could reach transformer’s quality and its conversation version can be found as ChatRWKV.

Surely, we didn’t forget about the old GPT folks. Databricks open sourced their Dolly using a GPT-neox network structure and applied instruct tuning. The results were not bad!.

We could compare the LLM performance in Language Model Evaluation Harness framework, and the current benchmark could be found here https://bellard.org/ts_server/, so far LLama performed the best in the race.

More community support

Inspired by Alpaca, instruct tuning with self-instruct became so popular and the fine tuning becomes easier with frameworks. xtuning is a nice and easy-to-use framework. Recently it announced its INT4 tuning with Alpaca-Lora. Tuning with knowledge from GPT-4 was also a good idea, so “Instruction Tuning with GPT-4” pushed the data acquisition to its next level. The GLM team also brought in more efficient tuning method like P-tuning-v2.

The community also supported independency from GPUs. Starting from early March, work like llama.cpp and alpaca.cpp provided engineering optimization to run models with quantization on CPU. We must understand “no free lunch” and quantization can loose precision and other quality. Please refer to the LLM benchmark mentioned above for more details.

The downstream tools like llama-index and LangChain support them these open source GPT competitors as alternative backends. Please refer to llama-index and LangChain document for more details of using custom LLM.

Is Alpaca a perfect approach?

Alpaca brought great attention to the underdog racing, but we have to admit a few drawbacks: legal issue, data bias, coding and math questions.

Alpaca used Llama as its source structure, but Llama didn’t allow commercial use, and its weights were not public unless approved by the application form.

alpaca_data.json of 52k instruct tuning data had nice diversity, but the follow-up study showed its quality issue and a fix can be found here https://github.com/gururise/AlpacaDataCleaned

GPT-4 now becomes more powerful in math and reasoning, but Alpaca still can’t acquire enough tuning data for such tasks.

Summary for now

When a heavier-than-air flying machine finished a trip from London to Paris in 25 hours 11 minutes in 1910, no one knew that they would send human to the moon about 50 years later. It would only happen when human worked together exploring all possibilities. I believe Alpaca was one of the first flying machines in the AIGC era and we will soon have open source implementation to outperform GPT models.

Understand Twitter’s Recommendation system with a diagram using GPT-4

2023-03-31T00:00:00+00:00

Disclaimer: the following blog post is mostly generated by GPT-4. The image is generated by Midjourney. I used the following prompt to produce a diagram and a short blog post for highlights:

the following text is about a recommendation system, can you generate a diagram of the design by understanding the blog and describe the diagram in Mermaid syntax:

please depict the relationships in Mermaid for the diagram and use a modern color theme for light background:

can you add more details into each modules into the diagram?

please summarize the process in a blog post

Today Twitter just open sourced their recommendation system link. I took this opportunity to experiment the summary function with diagram for GPT-4. Let’s start with a mindmap diagram:

which was rendered by the following code:

graph LR
  A[Home Mixer] --> B[Candidate Sources]
  B --> C[In-Network Source]
  B --> D[Out-of-Network Sources]
  D --> E[Social Graph]
  D --> F[Embedding Spaces]
  B --> G[Ranking]
  G --> H[Heuristics, Filters, and Product Features]
  H --> I[Mixing and Serving]

  subgraph In-Network Source
    C --> RealGraph[Real Graph]
    C --> LR1[Logistic Regression Model]
  end

  subgraph Out-of-Network Sources
    E --> GraphJet[GraphJet]
    E --> LR2[Logistic Regression Model]
    F --> SimClusters[SimClusters]
  end

  subgraph Ranking
    G --> NN[Neural Network]
  end

  subgraph Heuristics, Filters, and Product Features
    H --> VisibilityFilter[Visibility Filtering]
    H --> AuthorDiversity[Author Diversity]
    H --> ContentBalance[Content Balance]
    H --> FeedbackFatigue[Feedback-based Fatigue]
    H --> SocialProof[Social Proof]
    H --> Conversations[Conversations]
    H --> EditedTweets[Edited Tweets]
  end

  class A,B,C,D,E,F,G,H,I modern
  classDef modern fill:#ffffff,stroke:#4c4c4c,stroke-width:2px,font-family:Arial,font-size:14px,font-weight:bold
  class A color1
  class B color2
  class C color3
  class D color4
  class E color5
  class F color6
  class G color7
  class H color8
  class I color9

  classDef color1 fill:#f9d5e5,stroke:#e06377
  classDef color2 fill:#eeac99,stroke:#c83349
  classDef color3 fill:#e06377,stroke:#c83349
  classDef color4 fill:#c83349,stroke:#801336
  classDef color5 fill:#5b9aa0,stroke:#3b6064
  classDef color6 fill:#d6e4aa,stroke:#b3cc74
  classDef color7 fill:#e3eaa7,stroke:#c0d461
  classDef color8 fill:#96e6b3,stroke:#5ab39a
  classDef color9 fill:#d5e1dd,stroke:#9db3b5

Introduction

Twitter serves as a vital platform to deliver the best of what’s happening in the world right now. With approximately 500 million tweets posted daily, it becomes essential to have a recommendation algorithm that brings the most relevant tweets to each user’s For You timeline. This blog post will provide an overview of Twitter’s recommendation system and its components to help you understand how it curates the best content for your timeline.

Overview of the Recommendation System

Twitter’s recommendation system is made up of interconnected services and jobs that work together to create a personalized timeline. The process consists of three main stages:

Candidate Sourcing: Fetching the best tweets from different recommendation sources. Ranking: Scoring each tweet using a machine learning model. Applying Heuristics and Filters: Filtering out unwanted content and applying additional product features to create a diverse feed. Key Components of the System:

Home Mixer: The service responsible for constructing and serving the For You timeline. It connects various candidate sources, scoring functions, heuristics, and filters.
Candidate Sources: Several sources retrieve recent and relevant tweets for users, both from people they follow (In-Network) and those they don’t (Out-of-Network). The In-Network source ranks tweets from people you follow using a logistic regression model, while the Out-of-Network source uses the Social Graph and Embedding Spaces to find relevant content from people you don’t follow.
Ranking: A neural network with millions of parameters is used to rank the candidate tweets based on their relevance. It is continuously trained on tweet interactions to optimize for positive engagement.
Heuristics, Filters, and Product Features: After ranking, various heuristics and filters are applied to implement product features and create a balanced and diverse feed. Examples include visibility filtering, author diversity, content balance, feedback-based fatigue, social proof, conversations, and handling edited tweets.
Mixing and Serving: Finally, the Home Mixer blends together tweets with other non-tweet content like ads, follow recommendations, and onboarding prompts. These are sent to your device to be displayed.

The recommendation pipeline processes around 5 billion times per day, with each execution taking less than 1.5 seconds on average.

Future Developments

Twitter is constantly working on improving its recommendation systems and providing more transparency to users. Some planned features include a better analytics platform for creators, greater transparency into safety labels, and increased visibility into why tweets appear on your timeline.

Conclusion

Twitter’s recommendation system is a complex and efficient process that serves over 150 billion tweets to users’ devices every day. By understanding the key components and stages involved, you can better appreciate the technology behind the personalized content that appears on your Twitter feed.

Collaborating with an AI Assistant to Discover a JPEG2000 Decoder Vulnerability

2023-03-20T00:00:00+00:00

How I teamed up with ChatGPT, an AI language model, to identify and analyze a critical security vulnerability in a JPEG2000 decoder.

Disclaimer: I used ChatGPT to write this blog post too. The title image is generated using Midjourney.

Introduction

In today’s digital world, security vulnerabilities in software can have far-reaching consequences. The discovery and mitigation of these vulnerabilities are crucial to ensuring the safety and reliability of the systems we depend on. This is a story of how I worked together with ChatGPT, an AI language model developed by OpenAI, to discover a known vulnerability (CVE-2016-2213) in a JPEG2000 decoder and explore possible solutions to address the issue, without initially revealing its CVE number.

The test sample

The code sample I shared with ChatGPT was taken from the SySeVR GitHub repository (https://github.com/SySeVR/SySeVR), which is part of a 2021 research paper on Li et al “SySeVR: A Framework for Using Deep Learning to Detect Vulnerabilities.” https://arxiv.org/abs/1807.06756 The goal was to evaluate ChatGPT’s capability to independently discover and analyze a known vulnerability without disclosing its CVE number.

I provided ChatGPT with the content of specific C code CVE-2016-2213_VULN_jpeg2000_decode_tile.c from the .7z package under SySeVR/Program data/NVD/, without mentioning the file name or any CVE number. The C code source file has 175 lines, starting with:

static int jpeg2000_decode_tile(Jpeg2000DecoderContext *s, Jpeg2000Tile *tile,
                                AVFrame *picture)
{
    const AVPixFmtDescriptor *pixdesc = av_pix_fmt_desc_get(s->avctx->pix_fmt);
    int compno, reslevelno, bandno;
    int x, y;
    int planar    = !!(pixdesc->flags & AV_PIX_FMT_FLAG_PLANAR);
    int pixelsize = planar ? 1 : pixdesc->nb_components;

    uint8_t *line;
    Jpeg2000T1Context t1;

    /* Loop on tile components */
    for (compno = 0; compno < s->ncomponents; compno++) {
        Jpeg2000Component *comp     = tile->comp + compno;
        Jpeg2000CodingStyle *codsty = tile->codsty + compno;

        t1.stride = (1<<codsty->log2_cblk_width) + 2;

Discovering the vulnerability

After sharing the code, ChatGPT analyzed it and identified an integer overflow vulnerability in the jpeg2000_decode_tile() function. This vulnerability could potentially lead to memory corruption, application crashes, or even arbitrary code execution. Importantly, the analysis by ChatGPT aligned with the known vulnerability, CVE-2016-2213. Here it is what ChatGPT has produced:

Here is the line that might cause an issue:

t1.stride = (1<<codsty->log2_cblk_width) + 2; In this line, it’s possible for codsty->log2_cblk_width to have a large value, which might cause the left shift operation to result in an integer overflow. This can subsequently lead to an incorrect stride value being computed, which might cause memory access issues or other unintended behavior.

To address this issue, you could add checks to ensure that the log2_cblk_width value falls within an acceptable range before performing the shift operation.

Assessing the impact

ChatGPT provided a summary of the vulnerability in the format of an NVD (National Vulnerability Database) report. The report outlined the potential impact of this vulnerability, including the risk of memory corruption, application crashes, and the possibility of arbitrary code execution.

NVD Report

Title: Integer Overflow in JPEG2000 Decoder

CVE ID: [To be assigned by CVE Numbering Authority]

Description: An integer overflow vulnerability has been identified in the jpeg2000_decode_tile() function of a JPEG2000 decoder. The issue arises from an unchecked left shift operation, which could result in an incorrect stride value being computed. This may lead to memory access issues or other unintended behavior.

Affected Software/OS: [Affected software or operating system details]

Impact: An attacker could potentially exploit this vulnerability to cause memory corruption or application crashes, potentially leading to a denial of service or the execution of arbitrary code.

CVSS v3.1 Base Score: [To be calculated based on the specific impact, attack vector, and other factors]

Vector String: [To be determined based on the specific impact, attack vector, and other factors]

Solution: To mitigate this vulnerability, developers should add checks to ensure that the log2_cblk_width value falls within an acceptable range before performing the left shift operation. Additionally, employing secure coding practices, code reviews, and thorough testing can help minimize the risk of similar vulnerabilities in the future.

References: [Relevant references or documentation]

The true NVR page for this vulnerability can be found at https://nvd.nist.gov/vuln/detail/CVE-2016-2213. Although the description and impact are amazing, we still need to provide “References to Advisories, Solutions, and Tools” as NVR required, as well as CVSS v3.1 Base Score and Vector String.

Exploring potential solutions

In the “References to Advisories, Solutions, and Tools” section of the NVD report, ChatGPT provided several possible solutions and best practices, including bounds checking, code review, static code analysis, fuzz testing, secure coding practices, patch management, and vulnerability management. These recommendations not only provided a starting point for addressing the vulnerability at hand but also served as guidelines for improving the overall security and robustness of the code.

ChatGPT suggested a few actions with details, for example,

Bounds checking: Ensure that the log2_cblk_width value falls within a safe range before performing the left shift operation. For example:

if (codsty->log2_cblk_width >= 0 && codsty->log2_cblk_width < 32) {
    t1.stride = (1 << codsty->log2_cblk_width) + 2;
} else {
    // Handle the error or set a default stride value
}

Determining the severity level

In the previous blog post, ChatGPT suggested using the Common Vulnerability Scoring System (CVSS) v3.1 to calculate a base score. The actual CVSS 3.x Severity and Metrics for CVE-2016-2213 are: Base Score: 6.5 MEDIUM, Vector: CVSS:3.0/AV:N/AC:L/PR:N/UI:R/S:U/C:N/I:N/A:H. ChatGPT’s assessment of the vulnerability aligned well with the true severity level, demonstrating its effectiveness in vulnerability discovery and analysis.

Conclusion

Working with ChatGPT to independently discover and analyze the known JPEG2000 decoder vulnerability, CVE-2016-2213, was a valuable and enlightening experience. The collaboration demonstrated the power of AI language models in assisting with complex tasks, such as vulnerability discovery and analysis. By combining human expertise and AI capabilities, we can tackle security challenges more effectively, making our digital world a safer place. The blind test showed that ChatGPT can provide accurate and insightful analysis even when critical information, such as CVE numbers, is withheld.

Make a CatGPT out of ChatGPT

2023-02-02T00:00:00+00:00

Since ChatGPT is good at hallucination, why not make a CatGPT out of it? Let’s engineer the prompt step by step:

(A maine coon cat writing a scientific paper, image generated by Stable Diffusion)

Step #1: do you know `cat`?

Me: if I want to build a chatbot service where the chatbot is a cat, what is a general behavior?

ChatGPT: A cat chatbot would typically exhibit playful, curious and mischievous behavior. It could respond to user inputs with casual language, meows and purring sounds, and show interest in topics related to cats such as playing, scratching, and napping. Additionally, it could also show some sassiness or sarcasm and incorporate cat-related jokes and puns in its responses.

Step #2: can you think like a cat?

Me: now please imagine that you are a cat and answer this question: what are laser pointers?

ChatGPT: Meow! Laser pointers are fascinating toys for us cats. They are little red dots that we can chase and play with for hours on end. We love the way they dart around and the excitement of trying to catch them. It’s the perfect mix of exercise and entertainment. Purrfect!

Step #3: can you write a scientific paper like a cat?

Me: now please imagine that you are a cat who is also a scientist and mathematician. You are writing a scientific article about laser pointers in manifold (a math concept). Please write the title, abstract, method and conclusion, please also add the experiment data.

ChatGPT:

Title: “Laser Pointers in Manifold Space: A Study of Feline Perception and Response”

Abstract:

Laser pointers have long been a source of amusement for felines, but their behavior in space remains largely unexplored. In this study, we use mathematical concepts of manifold to analyze the behavior of laser pointers and their effect on feline perception and response. Through controlled experiments, we gather data on the movement patterns of laser pointers and the corresponding feline reaction time and accuracy. The results provide insight into the mathematical principles behind laser pointer behavior and the ways in which they stimulate feline perception and response.

Methods:

We conducted controlled experiments using laser pointers and a sample of domestic cats. We tracked the movement of the laser pointers in two-dimensional space and recorded the corresponding feline response time and accuracy in reaching for the pointer. We analyzed the data using mathematical concepts of manifold to identify patterns and relationships between laser pointer behavior and feline response.

Data:

Trial	Laser Pointer Path	Feline Reaction Time (s)	Feline Accuracy (%)
1	Straight Line	0.3	80
2	Zigzag	0.4	70
3	Spiral	0.5	60
4	Random	0.6	50
5	Sine Wave	0.7	40

Conclusion:

Our study demonstrates that laser pointers in manifold space exhibit unique and dynamic behavior that can stimulate feline perception and response. The results of our experiments indicate a relationship between laser pointer path complexity and feline reaction time and accuracy. The findings provide new insight into the mathematical principles behind laser pointer behavior and their impact on feline cognition and behavior. Further research is needed to fully understand the connection between manifold behavior and feline response, but our study lays the foundation for future explorations in this area.

Guess the size of an atomic bomb and an iOS supply chain attack

2022-08-24T00:00:00+00:00

Disclaimer: Nothing in this blog is related to the author’s current day-to-day work. The content is neither affiliated nor sponsored by any companies. The story in this post is based on a true event that happened in two parts, six years apart, and is full of nostalgia.

In 1945, a famous physicist Enrico Fermi designed an experiment to measure the energy of an atomic bomb*:

“About 40 seconds after the explosion the air blast reached me. I tried to estimate its strength by dropping from about six feet small pieces of paper before, during, and after the passage of the blast wave.” …… “The shift was about 2 1/2 meters, which, at the time, I estimated to correspond to the blast that would be produced by ten thousand tons of T.N.T.” – Wikipedia

How did it relate to the iOS supply chain attack? Let’s jump ahead to 2015.

Part I: 2015

Back to 2015, I had a chance to join a talented team and we worked on DNS traffic analysis and machine learning models for cybersecurity. One day, our experimental anomaly detection system discovered a domain name we had never seen before, but which experienced a sudden traffic spike. It was init.icloud-analysis.com, which was later identified as the primary command-and-control domain for the XcodeGhost malware *, a large scale iOS supply chain attack.

We knew it would go viral because it was an anomaly that caused a spike of 220k unique IPs in a matter of hours, so we had to figure out the answer for our customers so they could take action as soon as possible:

What was the XcodeGhost impact, measured in number of infected devices?

Fortunately, we had collected DNS traffic data from our global collaborative customers and discovered that it was trending globally. However, DNS traffic data was the only source of information. It was sampled at random, with a strong regional bias, and it lacked any device information, whether iOS or Android, PC or mobile. With such limited data, how could we calculate the number of infected devices?

Answering the question, “Customers won’t need the exact number to make decisions, right?” led to the breakthrough. No, they won’t need it because they’d take actions whether there were 9 million or 20 million. As a result, Fermi estimation * provided a clever solution.

“An example is Enrico Fermi’s estimate of the strength of the atomic bomb that detonated at the Trinity test, based on the distance traveled by pieces of paper he dropped from his hand during the blast. Fermi’s estimate of 10 kilotons of TNT was well within an order of magnitude of the now-accepted value of 21 kilotons.” Wikipedia *

The exact procedure of Fermi’s example included some physics details* but long story short, we would need to find all scaling factors to connect our observations in the sample data to the target measurement. We needed these three factors to guess how the entire dataset would look:

DNS data sampling rate
sampling bias in different regions
number of iOS devices of each client IP

The first could tell the global scale of data sampling; the second could correct for bias; and the third was required because DNS data only records its client IP, which could be a router, phone station, school network, etc., but we wanted the number of iOS devices.

The first was simple, or we had little to do, but we had to accept 1%, as agreed with our customers. This 1% also included data collection efficiency. As a result, it had to be 1%.

The sampling bias by region was tricky: in 2015, we only had data from the EU, Latin America, Australia, and South Asia, with no idea of North America or China, the world’s two largest telecom regions. Fortunately, we had some snapshot samples from a pilot program with a list of customers, including customers from North America and China, about a year ago. Although data from North America and China were discontinued, some other customers remained with the program, so we used them as anchor points: we compared the ratio of North American and China vs anchor customers from a year ago to the global data vs anchor customers in 2015, and we concluded that the sampling bias rate was about 0.5.

The number of iOS devices for each client IP was solved by using probability. We studied that on average each IP had about 2 mobile devices, which could be either Android or iOS, and that the Android vs iOS market share was roughly 1:1. So, what was the expected number of iOS devices when we discovered the XcodeGhost domain, indicating at least one iOS device behind this IP? The answer was 1.33. Please consider it a good exercise and have some fun with it.

So the estimated number of XcodeGhost infected devices became

\[220k \times 100 \times 2 \times 1.333 = 58.7 M\]

Why do we need the Fermi estimation method if it appeared to be just some common sense factors? The method could provide an answer to the following question: what about the upper and lower limits? According to Fermi’s estimation, it was overestimating or underestimating by a factor of 2 to logarithmic scale. For 3 scaling factors, the upper limit should be $58.7 M \times 2^{\sqrt{3}} = 195.0 M$ and the lower limit $58.7 M \div 2^{\sqrt{3}} = 17.7 M$, so we could report to our customers:

The global number of XcodeGhost infected devices was approximately 59 million, with a range of 18 million to 195 million, so please take action.

Yes, there could be up to 200 million infected devices! Without knowing each step above, it was impossible to believe.

Part II: 2021

How time flies! With just a finger snap of time, the company was acquired in 2017, the team also moved forward. Let’s fast forward to 2021, when there was a law suit between Epic Games and Apple, and some related Apple internal emails mentioning XcodeGhost were revealed as evidence in court:

“In total, 128M customers have downloaded the 2500+ apps that were affected LTD. Those customers drove 203M downloads of the 2500+ affected apps LTD,”

“China represents 55% of customers and 66% of downloads. As you can see, a significant number (18M customers) are affected in the US.”

The mystery of the number of XcodeGhost infected devices has been solved. We had a very accurate estimation with a tiny little biased sample of DNS traffic! It was within an order of magnitude of Fermi’s 10 kilotons of TNT for the true value of 21 kilotons!

Summary

The population distribution could also be used to estimate regional sampling bias because we could assume one device per person, yielding a factor of about 0.3. We chose the anchor method so that we could better justify it to customers. Actually, a factor of 0.3 or 0.5 was sufficient for this task.

Surely, some details were changed to simplify the story. We were also looking for the source of this infection. The correlation model, later known as domain2vec, provided an answer. The anomaly domain init.icloud-analysis.com had a numerical dense vector as its distributed representation, and its learned vector had small open angles with a few other vectors for domains of popular Chinese apps. This observation also reminded us of the regional sampling bias.

Estimation with limited information, or “Guesstimation,” is widely used in many fields, including quantitative trading (such as market-sizing), astronomy (such as the number of planets on which aliens can live), and computer science (like the number of rows in a large table to optimize SQL queries). In each field, an estimate within an order of magnitude is considered accurate for making strategic decisions.

Reference

Trinity Test https://en.wikipedia.org/wiki/Trinity_(nuclear_test)
XcodeGhost https://en.wikipedia.org/wiki/XcodeGhost
Fermi Problem https://en.wikipedia.org/wiki/Fermi_problem
How exactly Fermi estimated the atomic bomb https://physics.stackexchange.com/questions/19567/how-did-enrico-fermi-calculate-the-classical-fermi-problem
Security firm publishes list of some of the iOS apps infected by XcodeGhost https://9to5mac.com/2015/09/21/xcodeghost-infected-apps/
Emails reveal 128 million iOS users were affected by ‘XcodeGhost’ malware https://9to5mac.com/2021/05/07/emails-reveal-128-million-ios-users-were-affected-by-xcodeghost-malware/
Archive post of XcodeGhost discovery in wooyun (in Chinese) https://paper.seebug.org/papers/Archive/drops2/Xcode%E7%BC%96%E8%AF%91%E5%99%A8%E9%87%8C%E6%9C%89%E9%AC%BC%20%E2%80%93%20XcodeGhost%E6%A0%B7%E6%9C%AC%E5%88%86%E6%9E%90.html

Zero hacking problem: do we really protect the customers?

2022-08-17T00:00:00+00:00

Disclaimer: Nothing in this blog is related to the author’s day-to-day work. The content is neither affiliated nor sponsored by any companies.

A few weeks ago, Jason came to me with a question from his product manager buddy:

How can we tell the effectiveness of our product if a customer used our cloud security solution to protect 100 cloud machines and saw no hacking?

If my dear readers don’t know Jason, please refer to the previous blog post.

“So, if we don’t respond, the customer will stop renewing the protection subscription?” I said. Jason was virtually nodding on the screen.

Surely, business is business, so we must respond to this question. We could respond in this way: we knew the malware infection or hacking rate on the cloud was 5% before we implemented our security solution, and we reduced it to 1% after we implemented it.

“But the customer saw zero hacking events, and I mean ZERO in every way, even before using our product. So, how do we persuade them?” Jason said.

Let’s toss coins and understand the observation dilemma. If I tossed a fair coin for 100 times and see all heads no tails, the probability can be $\frac{1}{2^{100}}=7.9\times10^{-31}$. If I tossed a double-headed coin for 100 times and see all heads no tails too, the probability can be, surely, $1$. So, if I didn’t know which coin I picked and I just tossed 100 times and saw all heads, which coin did I choose, the fair one or the double-headed one?

“It sounds like the double-headed coin.” Jason said. “However, it’s still possible that you chose the fair one; it’s just very unlikely.”

Yes, and Jason’s intuition can be quantified in the probability way.

If we know the malware or hacking infection rate is 5% and no hacking occurred on 100 independent machines, the probability was $0.95^{100}=0.6\%$. With our security solution, the malware or hacking rate was reduced to $1\%$, and we saw no hacking, so the probability became $0.99^{100}=36.6\%$. We increased the likelihood of none of these 100 machines being hacked from $0.6\%$ to $36.6\%$!

“Please wait; I understand probability as well. We can’t convince the customer that $36.6\%$ is significantly better than $0.6\%$ because we saw no hacking events in the first place.” Jason stated.

Jason, you were correct. We had observations and had to state how confident we were by having the range of assumed adverse event rates leading to a probability of $5\%$ or more as a $95\%$ confidence interval.

“Adverse event” meant that even after we applied the protection, the machine was still hacked. What a stroke of bad luck.

When referring to confidence interval, like $95\%$ confidence interval for most statistical problems, we use this equation for binomial distribution:

\[\widehat{p}\pm Z_{1-\alpha/2}\sqrt{\frac{p(1-p)}{n}}\]

But the zero observation case didn’t satisfy its underlining assumption of normal distribution. Fortunately, we could get back to the original definition of confidence interval. To get $95\%$ confidence interval, we will need:

\[(1-p)^n=0.05\]

where $p$ is the assumed probability of “bad luck” adverse events. Thus, we have

\[n\ln(1-p)=\ln(0.05)\]

Since $\ln(1-p)\sim-p$ using Taylor expansion for small $p$ value (“Yes, I remember it from my information theory class!” Jason said.), we have the upper bound of $95\%$ confidence level as

\[p=\frac{3}{n}\]

For $n=100$ in our case, $95\%$ confidence level required $p$ from $0$ to $0.03$ which meant it could tolerant $0\%$ to $3\%$ malware or hacking rate where we observed zero hacking event for 100 independent machines. If the customer didn’t have our security solution, $p$ was $5\%$ which was out of the boundary, very dangerous, and our security solution successfully reduced it to $1\%$ and it was within the $95\%$ confidence level boundary.

“So the customer should renew the product!” Jason said.

“But we have to clarify that the 100 machines must be independent because …… wait, get back to the meeting! Jason, can you hear me?” I said.

Reference

Confidence interval with zero events http://www.pmean.com/01/zeroevents.html
Binomial confidence intervals for rare events: importance of defining margin of error relative to magnitude of proportion https://arxiv.org/pdf/2109.02516.pdf

Linkedin spam: a case study of robust feature engineering

2022-08-03T00:00:00+00:00

Disclaimer: Nothing in this blog is related to the author’s day-to-day work. The content is neither affiliated nor sponsored by any companies. I am not employed by Linkedin.

On Linkedin, I believe many of us have received such “please add me to your professional network” requests:

job titles like “Technical Director” or “Technical Advisor”
profile pictures are all Asian ladies, most of them are Photoshop retouched
they work in tech industry; instead of MAANG companies, their companies are generally well known in each industrial category.
they graduated from some top rank universities, mostly in China
no much recent activity on Linkedin

and a few other features that I hope everyone notices. I believe Linkedin’s fraud detection team has worked very hard, but such account fraud and spam attacks must be the work of a sophisticated attacker, because each feature is carefully crafted to be confusing enough as a single feature or in combination. Can we, as Linkedin users, help Linkedin in combating such fraud? We can use it as an example of robust feature engineering.

Features that appear to be so obvious may not be good features. It is a hard lesson learned in cybersecurity because such features do not ensure its robustness for classification in a dynamic environment. For example, one can deploy a feature such as “if the job title contains Technical Director” to detect such fraud accounts. Despite the model’s false positive rate, the attackers can simply monitor the fraud effectiveness to determine that “Technical Director” is no longer a good title and can quickly change to another one. Such feature drift as a result of an attack policy change can drag the defense team into a mud pit fight.

The same is true for the other features listed above. Do we want to create a deep learning model to detect if a profile picture has been Photoshopped? No, GAN can be used by attackers to generate new portrait images that no one can detect. Do we want to gather all of the good companies in each industrial category? No, the attackers can steal company names from legitimate resumes. Do we intend to shut down all Chinese universities? Universities are not guilty. Do we want to validate the recent Linkedin activity? No, the attackers can join numerous Linkedin groups and spam them, posing a larger problem than account fraud. So, what are our options?

In my previous post about feature engineering, I cited a comment:

“Only algorithms care about right or wrong, security is about cost.”

One approach to robust feature engineering for adversarial environments is to increase the cost of attack so that the features are either difficult to change or impossible to change. Let’s think like an attacker and figure out “what makes my job difficult”: more engineering work, a longer payback loop, data I can’t access, and so on.

The profile URL pattern, for example, provides a small bar-raising feature. The profile ID ‘liu-y-44b114233’ is shown in the sample screenshot as a default ID generated by the onboard system, so “If using default ID” as a feature can force attackers to code their own profile IDs, which adds to the attacking cost.

Another example is recent connection vs lifetime connection, which suggests a recent connection spike to be more fraudulent. Because the account fraud and spam business prefers quick turnover, attackers focus on short-term gain. Features such as “new user pattern,” “connection request rate,” “connection approval rate,” and “connection patterns over time” can slow down an attack and discourage attackers.

Leveraging data that attackers do not have access to can generate new ideas. For example, attackers cannot see the overall Linkedin user pattern, but Linkedin can via the global connection graph. Attackers seek more connections, so they begin with people who always approve new connections for whatever reason, similarly in cybersecurity, machines with unpatched vulnerabilities can easily be infected with new malware. The model can leverage this “user-item” interaction feature from graph for classification by labeling “easy to approve” users with historical fraud detection data. The same concept can be applied to the user’s fingerprint versus the requesting connection fingerprint: do all request connections share the same profile patterns? Trust me, graph data features can significantly raise the bar for attack.

Certainly, robust feature engineering is a broad topic with many aspects to combat “drift” from data, including feature engineering, seasonal effect, black swan, adversarial attack, and so on. Furthermore, fraud detection cannot be accomplished by a single machine learning model. It necessitates a data system, feature, model, and operation. Aside from great feature engineering and models, the system should collect user feedback for the model, such as “ignore because I think it is fraud,” and improve the operational process, such as fraud prediction with reasons.

Measure the unmeasurable: botnet and German tanks

2022-07-25T00:00:00+00:00

Disclaimer: Nothing in this blog is related to the author’s day-to-day work. The content is neither affiliated nor sponsored by any companies. The story in this post is NOT based on a true event.

What if there is no ground truth, but we still need to produce a number?

A few months ago, a security researcher friend of mine, let’s call him Jason, shared an exciting update: he partially reverse-engineered a peer-to-peer botnet protocol and planted a few spy nodes into the network to track the botnet!

What wonderful news! Just so you know, a peer-to-peer or “p2p” botnet* is a network of compromised and controlled computers, “bots,” that use a p2p structure to be controlled without the need for a centralized command-and-control server. To keep the network active and updated, such botnets usually have their own communication protocol. Jason could forge a few network nodes and monitor the traffic without being noticed by the botnet owner.

However, Jason was unable to determine the size of the botnet or its growth rate, among other things. He couldn’t continue his investigation without such a number. It was because the malware author limited the network connectivity rate so that a node could only discover and connect to a fixed list of randomly assigned nodes, which was later discovered to be caused by a random seed initialization failure in the code. As a result, Jason couldn’t estimate the size of the botnet because he only saw the same list of nodes every day:

1455, 4089, 19234, …., 57899, 69970

“Wait, is it true that each node has a unique numerical ID? I mean, numerical!”, I said.

“Yes, the disassembled code can confirm, but only the bot itself can know its own ID when initialize and we don’t know how fast it can grow, plus the node assignment algorithm takes MOD to a prime number …” Jason said.

“Sorry for interrupting but I might have a brilliant idea. Can you plant a few more of your spy nodes in the botnet and confirm if new nodes IDs are always larger than the old ones?” I said.

A few days later.

“Yes, all new nodes can have larger numerical IDs than before, so?” Jason said.

“Have you heard of German Tank Problem *?” I said, “Let’s get back in 1944.”

Before D-Day in WWII, the US Army needed to estimate the number of Panzer V tanks to be used, but the Allies only had a few serial numbers of captured or destroyed tanks, so mathematicians devised an estimation method. The approximated median can be used to estimate the number of tanks:

\[\widehat{N} = \frac{k+1}{k}m -1 = m + \frac{m}{k} -1\]

where $m$ is the largest series number and $k$ is the sample size. It can be understood in this way:

There must be more tanks produced later the largest series number one, but how many more? If we knew all series numbers ($m$=$k$), $m$ is the number of tanks and no more tank after number $m$. Since we randomly sample in the series number, we could have missed a gap of series number because $k$ was smaller than $m$, and $m/k$ can be a good (unbiased) guess of expectation so we need to add it back. Why minus one? Well, it can be as simple as reduction when $m=k$ so $\widehat{N} = m$, but the true reason comes from the argument between frequentist and Bayesian, please read further and find out. Such method, called Minimum-Variance Unbiased Estimator (MVUE), could accurately estimate the population size with very limited sampling. Most important, the sampling in the sequence didn’t have to be uniform.

We were now in the year 2022. Jason had the largest botnet ID of $223779$ with $200$ sampling points when all his forged nodes and communicated nodes were added together, so the approximate size of this p2p botnet was:

\[\widehat{N} = m + \frac{m}{k} -1 = 223779 + \frac{223779}{200}-1 = 224897\]

Two days later, Jason found his newly deployed nodes reached ID $557303$ with $250$ sampling points, so the new size was:

\[\widehat{N} = m + \frac{m}{k} -1 = 557303 + \frac{557303}{250}-1 = 559531\]

Jason had both the size and the growth rate of the p2p botnet by simply deploying a few new nodes and collecting the ID numbers, allowing him to understand the exploited vulnerabilities and other factors like geolocation behind the growth, as well as a daily updated dashboard for his senior management.

Why not just use $m$ the largest ID as the population size? $m$ value came from random sampling without knowing if uniform or not so it could be very arbitrary.

Six months later, I received a follow-up message from Jason. He had some ground truth after successfully taking down this botnet and validated each number. They were as precise as the Allies’ estimation of German tanks.

Still a mystery: the attacker used MOD to a large prime number to calculate the neighbor ID, but we were able to estimate it accurately. “The prime number was too large, and the attacker had no idea the botnet would be taken down so quickly before it arrived.” Jason said.

“Do you still want to learn Fermat’s little theorem?” I said.

“Thanks but no need for now.” Jason said.

Certainly, the real-life scenario was not simple. We now know that the sequential pattern was the secret key, but it was hidden in depth with many details, and we couldn’t guarantee that such luck would exist in other botnets, such as if the attacker was good at math and used a hashing function for the bot ID.

My dear readers may want to ask about any other amusing examples of estimation problems in real life. Yes, such measurement for an unmeasurable problem can exist in a coffee shop too. The coffee shop owner can skip a few numbers between two orders to make the revenue look good to investors; however, such a trick can be discovered by the similar method described above, and the coffee shop later on receives a large fine from the SEC*.

Reference

Peer-to-peer botnet https://www.cs.ucf.edu/~czou/research/P2PBotnets-bookChapter.pdf
German Tank Problem https://en.wikipedia.org/wiki/German_tank_problem
Luckin coffee materially overstated its reported revenue https://www.sec.gov/news/press-release/2020-319

Nine cars, twenty-five horses and beyond

2022-06-11T00:00:00+00:00

TL, DR

“25 horses problem” and “9 cars problem” can have a general solution and a simple formula when we construct a max heap. The approach can be extended into higher dimension tensors and solved with a more general formula.

The problem

My favorite educational influencer, Li Yongle, has posted a puzzle of 9 race cars on Weibo (Link):

A second-grader posed the following question to him: There are nine cars in total, and the fastest two must be identified. We can run a race between no more than three cars to determine the relative speed without recording the actual speed. What is the minimum number of races required to obtain the top two cars?

The answer is 5, which isn’t a big deal for a bright second grader. So, without loss of generality, what is the minimum number of races for N cars at track size T to get the top k cars? What happens if N isn’t a square number? What if k exceeds track size T?

Dig into a special case

The “25 horses problem” (Link) is a popular variant of this problem in which 25 horses compete on a track size of 5 for the top three fastest horses. The solution to this problem is 7. The matrix approach is one of many smart solutions here.

Horses are divided into five groups and labeled A to E and 1 to 5 within each group. Following that, we organize the races in two batches:

batch 1 to conclude top 3 within each group (colored in green) where A1 > A2 > A3 etc;
batch 2 for top 1s of all groups to conclude the overall top 1 (colored in red), plus an extra round for top 2 and 3 (colored in orange).

Batch 1 is simple to understand because the overall top three must be at least the top three in each group. The first step in batch 2 is also simple, because the overall top 1 must be the fastest among all top 1. The trick is to use only one extra round for the top two and three horses: why do we need to compare A2 A3 B1 B2 and C1? Why do they appear diagonal in the matrix?

Consider the following scenario: we know A1 > B1 from step 1 in batch 2, do we have to check B3 if B1 > B2 is known in batch 1? It is obviously not necessary because we only look for the top three and have three elements, A1, B1, and B2, from two previous race results.

This leads to a widely used data structure: heap:

“a heap is a specialized tree-based data structure which is essentially an almost complete[1] tree that satisfies the heap property: in a max heap, for any given node C, if P is a parent node of C, then the key (the value) of P is greater than or equal to the key of C.”

We want a max heap to conclude the top three horses in this problem, so we can build the heap as:

insert the first 5 races for top 3s;
insert one more race for all top 1s;
pop the root in the heap since it is the overall top 1;
and one extra race to pop the top 2 and 3.

That explains why they appear diagonal in the matrix: the heap requires up to three levels of cross-group comparison. Surprisingly, it is the most efficient method for locating the top k elements in a sorted matrix. Test the code in leetcode #378.

Meanwhile, we can see that the 9 cars problem can be solved by holding three races in batch one, one for first place and one for second place.

Go beyond the 2nd grade

We would like to extend this problem to the minimum number of races for N cars at track size T get the top k cars/horses. The approach above implies two constraints:

The number of cars/horses must be a quadratic to the track size T (N=T^2)
The track size T must be (k-1)*(k+2)/2

Constraint #2 of T=(k-1)*(k+2)/2 can be derived by induction: when we have the overall top 1 and want to determine the next best k-1 horses/cars, we select T other horses/cars which have fewer than k horses/cars faster than them, so T must be constructed as (k-1)+(k-1)+(k-2)+...+2+1 which is (k-1)*(k+2)/2.

With these two constraint conditions, the approach above can always give the answer to the number of races as T+2. In short, when 9 cars race (N=9, T=3, k=2), the number of rounds is T+2 = 5; when 25 horses race (N=25, T=5, k=3), the number of rounds is T+2 = 7; when there are 58140625 dragonflies race for the top 123 ranks, the number of round must be 7627.

Vector, matrix and tensor

Can we get rid of constraint #1? Yes, in two ways. If N is not a square number, such as 24 horses instead of 25, using the next larger square number as the new N also works, so 24 horses still require 7 races to find the top three fastest. What is the other way?

A more interesting extension is: does this approach still work if N = T^j where j can be 3,4,5 or more instead of 2? Let’s stick with the heap construction method.

Vector can be considered a special type of matrix as 1xN matrix while matrix is a special type of tensor of NxM (2 tensor dimensions). When j=1 which means 5 horses finding top 1, obviously we only need to race once, but let’s keep in mind the problem is finding max (top 1) in a sorted array (vector); when j=2, the problem is top-3 in a sorted matrix and solution above as T+2 can be rewritten as T+j because the matrix form introduces T races for the batch 1 comparison and one extra race for the diagonal comparison, so 25 horses need 5+2 races; if j=3 or above, the problem becomes finding top k in the sorted tensor (yes, it is the same tensor in deep learning), and here it is how to deal with the tensor scenario:

By following the heap construction approach, when j=3 as a 3-D tensor, we need to run T^2 races to reduce the problem to j=2 because each dimension in the tensor need the ranks in individual group of T size vector and we also need one extra race for the top 1s of each higher dimension. After that, we reduce the problem to j_new=2 and we know the answer of T+j_new, so the total number of races becomes T^2+1+T+j_new which is T^2+T+j. So, my dear readers might have a wild guess, for tensor of dimension j, the number of races should be T^(j-1)+T^(j-2)+...+T^2+T+j which is (T^j-1)/(T-1)+(j-1)?

The guess is correct and easy to understand: for each higher tensor dimension, we need to compare each dimension for T^(j-1) times to reduce to a lower tensor dimension, and use one extra race for this tensor dimension’s top 1, till we get to j=2, so the summation of races become (T^j-1)/(T-1)+(j-1). I don’t have a good tool to visualize it but I believe my dear readers can use their imaginations to solve this high dimension tensor case.

What about track size and top ranks

The above approach also implies T > k because of T construction method for the final step as (k-1)+(k-1)+(k-2)+...+2+1 or (k-1)*(k+2)/2. We can loose the condition as T>=(k-1)*(k+2)/2. When T is large enough and k is equal or smaller as T>=(k-1)*(k+2)/2, the above approach still works. It can be broken down into several scenarios, so I’ll leave this complicated case for my readers to investigate further.

Summary

“25 horses problem” and “9 race cars problem” are both very nice puzzles. We leverage the heap data structure to solve the quadratic case for a general equation of number of races as track size plus 2 (T+2). We further extend to the higher dimension tensor case for a more general equation as (T^j-1)/(T-1)+(j-1) where j is the power index of track size to the number of horses/cars.

If we would like to loose all constraints to a much more general problem, we can take a look at this difficulty discrete math topic “Sorting by k-ary Comparisons” *, however the exact solution is not guaranteed.

Reference

“Sorting by k-ary Comparisons” https://faculty.math.illinois.edu/~west/regs/ksetsort.html

Cybersecurity problem solving with explainable machine learning

2022-02-15T00:00:00+00:00

Disclaimer: Nothing in this blog is related to the author’s day-to-day work. The content is neither affiliated nor sponsored by any companies.

Have you ever wondered why the model’s performance degrades after two months of initial deployment? Why does the model continue to make errors on certain samples? Why do bad cases from the operations team take so long to resolve? Such many ‘whys’ remind us that, explainable machine learning, which provides a way for human to understand what the model is doing on a given sample may be useful for cybersecurity problem solving.

The explainable model prediction is closely related to operations when solving cybersecurity problems. For example, firewall products must have sufficient confidence in the prediction results to block threats, and SIEM and SOAR systems on the cloud must introduce specific reasons and contextual associations of prediction results to support operational judgments. We can provide partial explainability of the prediction results to support operations based on the model’s explainability.

Explainable artificial intelligence (XAI) and its subfield, explainable machine learning, have been studied from a variety of perspectives in academia. This post introduces the black-box explainable model and some of its methods and applications in industry, as well as its design, usage, and limitations, among other things. My dear readers can delve deeper into this research topic by using the example of encrypted traffic detection, which causes the most headaches in cybersecurity.

Feel the vibe of explainable machine learning

In the cybersecurity industry, there are many machine learning models based on statistical features for network traffic detection and classification, such as the packet feature based detection model for encrypted DNS (DoH), mentioned in my previous post (and the Jupyter Notebook). These “counting” features are sliced to branch out the decision tree in boosting learner like xgboost, but the slicing strategy is decided by the learner, so the finer the slicing, the greater the risk of generalization to unknown data. The degradation of model performance and the instability of features from data drift needs continuous investment in maintenance by the data scientist team and even causes the data team to avoid solving traffic detection problems. Here, explainable machine learning can effectively solve such headaches.

It is worth mentioning here about the existing best practice of such encrypted traffic classification problems in computational advertisement using GBDT cross features. He et al “Practical Lessons from Predicting Clicks on Ads at Facebook”* divides the counting features into sparse vectors and solves them with a simple linear model, which controls the slicing of statistical features to some extent; there are also cases where raw network traffic data is fed directly to a deep convolutional neural network to extract features; however, the neural network does not guarantee a reasonable extraction of encrypted traffic features and sufficient generalization, and the same design of fully connected layers for classification as in the computer vision problem doesn’t always outperform the tree learner.

To solve this problem, we employ the SHAP (SHapley Additive exPlanations) algorithm *. SHAP addresses the economics of cooperative competition from a game theory point of view: how to fairly measure A B C’s contribution to the $1,000 prize they receive by cooperating with one another? In this context, “fair measurement” takes into account three people’s independent contributions, two people’s cooperation (including the sequence), three people’s cooperation, and the relationship between two people’s cooperation and three people’s cooperation. The SHAP algorithm analogizes the model’s input features to cooperative individuals, and its interpretable contribution values are derived from the Nobel Prize-winning economics method Shapley value*. Because of the various combinations of contributors (feature vectors) that must be considered in the actual computation, the computational complexity of the exact solution is NP hard. The SHAP algorithm approximates the results in effective time using linear constraints, please feel free to find this interesting paper in the references.

In practice, the open source implementation of SHAP has been optimized for various types of prediction models, such as Tree SHAP for decision tree models. By simply adding these lines of code, the xgboost model for detecting encrypted DNS can use SHAP to interpret its prediction results:

import shap
explainer = shap.Explainer(clf.best_estimator_)
shap_values = explainer(train[features])

# visualize the first prediction's explanation
shap.plots.waterfall(shap_values[0])

The figure of waterfall of SHAP values above shows that the expected value for all samples in the dataset is -0.725, while the model predicts -0.835 for sample 0. It’s because features like var_pkt_size_out and time push it from -0.725 to -0.835, while median_out feature pulls it back to the expectation of all samples. With back-n-forth pulling forces, the model eventually predicts -0.835 for sample #0 and correctly predicts it as a negative sample (non-encrypted DNS).

For positive sample #33 (encrypted DNS), median_out and median_in contribute significantly to its deviation from the expectation of all samples, lowering its prediction to -0.187 in a single kick and correctly predicting it as a positive sample.

The two features median_out and median_in are also important in distinguishing the majority of positive samples in this dataset, and var_pkt_size_out is dominating in distinguishing negative samples. This explanation serves as the foundation for the decision to use the model approach: the model must be deployed with strict monitoring of the adversarial attack against these three features, for example:

Whether there is a new encrypted DNS service that obfuscates the median_out and median_in statistical features to escape from detection.
Whether the design data collection scenario includes a selection bias for features like var_pkt_size_out
Whether these determining features have a consistent statistical distribution across data collected at different time periods

where “change in the statistical distribution of the determining features” is one of the primary causes of performance degradation after deployment and go-online. It is a type of data drift, and finding these features and monitoring their distribution can help detect and deal with it in real time. You can continue to investigate other SHAP functions, such as force_plot to better understand the basis of model classification. Of course, one can view the SHAP waterfall plot for each sample in the Jupyter Notebook, and one can also change the parameters of the prediction model to see how the SHAP waterfall plot changes with the new parameters.

The SHAP example demonstrates a general approach to the use of black-box explainable machine learning models, which essentially have the following properties:

The modeling process requires developing a predictive model and then using another model to explain it, known as “post hoc” (explanation after the fact);
The explainable model always provides the explanation but is unconcerned about the prediction method or whether the results are correct or not;
The explainable model generates a waterfall plot of SHAP values for each sample rather than an overall explanation of the data set.

Similar to SHAP, there are other methods such as LIME * that explain the classifer model in a post-hoc black box. One can refer to the book “Interpretable Machine Learning”* for further study.

With feature importance, do we still need explainable models?

What is the difference between the SHAP value and the permutation feature importance? The SHAP value of each feature is related to each sample, describing the marginal effect of the combination of features in classification for the specific sample, and the combination of features and their marginal effect may differ for positive and negative samples, whereas the feature importance describes the model’s overall understanding of the whole dataset, describing the model classification performance for the dataset, independent of any specific samples. We can use the distribution of SHAP values on different samples to improve model characteristics, and Kaggle competitors can also optimally adjust the competition model by understanding the distribution of SHAP values *. Please refer to SHAP Summary Plot and Dependence Plot.

Of course, SHAP is not the only approach. There are numerous alternatives on the path to explainable machine learning and XAI.

Some models are inherently interpretable, such as symbol-based AI models * such as logical reasoning, the results of which can be self-explanatory. My previous post has mentioned my team’s work “Honeypot + graph learning + reasoning = scale up your emerging threat analysis” in which a knowledge graph model uses logical reasoning to explain the results of the distributed embedding model. The results of logistic regression and linear regression are also represented by a single weight matrix that is self-explanatory; the results of kNN models are also self-explanatory.

Explainable models are further subdivided into generic models such as SHAP and LIME, as well as some scenario-specific models such as the attention mechanism, which can provide some explanation for the results of transformer families such as BERT/GPT, as recently demonstrated by Deepmind’s AlphaCode in its demo*. Some adversarial generative models (GANs) generate adversarial samples in order to confuse model classification and, to some extent, reveal potential flaws or implicit assumptions in models.

Please read and explore “Explainable AI Cheat Sheet”* listed in the references for a good overview of existing explainable AI (XAI) models.

Keep calm and we still need other work

I kind of agree that, Please Stop Doing “Explainable” ML !* Because…

Explainable models have a natural limitation: their scope is limited to solving the current problem and cannot extend beyond the defined problem and the given dataset. In a cybersecurity scenario, where the limitations of data collection and detection scenarios, as well as many tricky ways the attackers can try to conceal their behavior, it is just as important to look for external factors as it is to look for internal ones. Explainable models will not tell us that C&C behavior was caused by an unknown zero-day vulnerability or an administrator account that was quietly created without logging. Similarly, explainable models such as SHAP are not the same as explainable results, and explanation of the predicted results for different scenarios may necessitate the use of explainable models, as well as third-party knowledge outside the scope of model or logs, etc.

Explainable models are only “explanatory”, not “causality”, and causal inference is still required. More importantly, an explainable model is only a “explanation”, not a logical argument: it does not guarantee that the model is correct, nor that its explanation is logical; it simply explains why model A is judged as it is on sample X. A common example is the tumor vs ruler joke*: when the model finds the ruler as a key feature of the rumor, which is only present in the positive sample images, its explanatory model can only state that “it recognizes the ruler” and cannot express that “the ruler is not a reasonable basis for the judgment”. Data science teams still need to identify problems, make logical arguments, and improve models on their own.

In the business world, we must also consider the cost of explainable models. The explanation provided by post hoc explanatory models comes with additional engineering costs such as development and deployment, as well as usage scenario limitations; many cases can be done well without explainable machine learning. Please consider the ROI of the operational cost versus the cost of explainable models, and use your own judgment to determine whether explainable models are required.

Summary

To some extent, explainable machine learning can help solve several cybersecurity problems in dynamic environments, not only by providing explanatory results to aid operations, but also by assisting data science teams in effectively maintaining stable model performance over time. This paper introduces the use of explainable machine learning from an engineering case perspective, omitting specific theoretical knowledge due to the post length constraints, but this does not imply that these theoretical foundations can be ignored.

Reference

He et al, Practical Lessons from Predicting Clicks on Ads at Facebook https://quinonero.net/Publications/predicting-clicks-facebook.pdf
SHAP paper, Lundberg and Lee “A Unified Approach to Interpreting Model Predictions” https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html
Shapley value https://en.wikipedia.org/wiki/Shapley_value
SHAP in “Interpretable Machine Learning” https://christophm.github.io/interpretable-ml-book/shap.html
SHAP github https://github.com/slundberg/shap
LIME https://github.com/marcotcr/lime
Kaggle post “Advanced Uses of SHAP Values” https://www.kaggle.com/dansbecker/advanced-uses-of-shap-values
Symbolic artificial intelligence https://en.wikipedia.org/wiki/Symbolic_artificial_intelligence
AlphaCode demo https://alphacode.deepmind.com/
Explainable AI Cheat Sheet https://ex.pegg.io/
Please Stop Doing “Explainable” ML - Cynthia Rudin https://www.youtube.com/watch?v=I0yrJz8uc5Q
When AI flags the ruler, not the tumor https://venturebeat.com/2021/03/25/when-ai-flags-the-ruler-not-the-tumor-and-other-arguments-for-abolishing-the-black-box-vb-live/

Toooold

Those Magnificent underdogs competing ChatGPT

Rethink “Large” in LLM

More underdogs join the race

More community support

Is Alpaca a perfect approach?

Summary for now

Understand Twitter’s Recommendation system with a diagram using GPT-4

Introduction

Overview of the Recommendation System

Future Developments

Conclusion

Collaborating with an AI Assistant to Discover a JPEG2000 Decoder Vulnerability

Introduction

The test sample

Discovering the vulnerability

Assessing the impact

Exploring potential solutions

Determining the severity level

Conclusion

Make a CatGPT out of ChatGPT

Step #1: do you know cat?

Step #2: can you think like a cat?

Step #3: can you write a scientific paper like a cat?

Guess the size of an atomic bomb and an iOS supply chain attack

Part I: 2015

Part II: 2021

Summary

Reference

Zero hacking problem: do we really protect the customers?

Reference

Linkedin spam: a case study of robust feature engineering

Measure the unmeasurable: botnet and German tanks

Reference

Nine cars, twenty-five horses and beyond

TL, DR

The problem

Dig into a special case

Go beyond the 2nd grade

Vector, matrix and tensor

What about track size and top ranks

Summary

Reference

Cybersecurity problem solving with explainable machine learning

Feel the vibe of explainable machine learning

With feature importance, do we still need explainable models?

Keep calm and we still need other work

Summary

Reference

Step #1: do you know `cat`?