Connecting the dots by clustering malicious code snippets

0004.0Cyber threat intelligence tasks can often be difficult and complex. One of the major challenges that analysts often face are the concepts related to the so-called “attribution“. However, establishing an attribution for cyber operations can certainly be considered very difficult but in my opinion not impossible. Furthermore, the correct labeling of the threat actor responsible for a cyber attack can be useful in formulating a technical response to such attacks.

Attributing an attack to a particular actor requires obviously collecting as much data as possible regarding the specific incident under examination and all involved entities. Beyond this, what is often underestimated, is the collection of data and information over the time (i could say, day by day); In fact, this can allow, later, to have a greater visibility on what is being observed in the present and provides the analyst with a wider range of resources and a broader knowledge on which to rely on during the analysis.

However, despite a daily commitment to observing what is happening in the world of “cyber operations“, sometimes we can find ourself having very few overlaps with our knowledge base. Fortunately or not, the most frequent condition is the one where we can find many factors to consider even if in contrast with each other (so-called “conflicting indicators“).

To address this specific condition, security researchers are used to identify “key indicators” to guide in accurate attribution such as:

Malicious Infrastructures: Generally speaking, the communication structures used to deliver a cyber capability or maintain command and control of capabilities. Note that threat actors can buy, lease, compromise and share network infrastructures. Their degree of reliability may vary according to their type (think, for example, that a Virtual Private Server can be destroyed and reassigned within a few hours).

External Sources: The game is easy because someone else has already done our job (hoping it’s right). This case includes reports and hypotheses from the security industry, media, think tanks, security researchers that can help us to quickly correlate an incident.

Malware: Malicious software designed to enable unauthorized functions on a compromised computer system such as key logging, screen capture, audiorecording and persistence. Obviously, a threat actor can modify some malware indicators within minutes and some can completely change its malware set between different operations to make the attribution more difficult. However, many advanced and persistent threat actors are often used to write proprietary malware and to treat its development cycle just like any other legitimate software. Like all development groups they do not like performing repetitive tasks or writing new code to perform actions for which they have already written code in the past. This means we can find reused code across their malware.

A circumstance of this kind is assessed very significantly in view of a possible attribution by security researchers.

// Playing with malware commonly associated with DPRK

As mentioned at the beginning, being able to have good resources in view of a right attribution requires a constant commitment. As for the malicious code (excluding other technical activities focused on network infrastructures that are outside the context of this post) this means continuously generating detection rules that go to describe portions of code shared between different malware samples that are related to the same family or belonging to the same actor.

These portions of code generally belong to the know-how of the individual researcher / company and normally go to compose those tools that are often addressed with the term of “attribution engines“.

For example, the following is the result of an analysis made with my own attribution engine for DLL identified by SHA256 26a2fa7b45a455c311fd57875d8231c853ea4399be7b9344f2136030b2edc4aa


Exploding one of the results found i can find correlations with another executable, identified by SHA256 bdff852398f174e9eef1db1c2d3fefdda25fe0ea90a40a2e06e51b5c0ebd69eb


If we focus our attention on these two, they can be considered fundamentally different judging by a rapid comparative analysis between them.


However, in consideration of a more general comparison of the files with the totality of the code fragments extracted from the entire dataset belonging to malware commonly related to NK, the samples in question, even if only in part, can be correlated by some shared pieces of reused code.

To get even more in details, one of the extracted common code fragment between the two samples is reported following:






Following this principle, for every samples we have at our disposal, it’s possible to extract some portion of interesting code frames that, especially thank to the analyst’s experience, can be considered useful in order to identify something that can be associated with a good degree of confidence to a specific threat or group.

For example, considering the sample 26a2fa7b45a455c311fd57875d8231c853ea4399be7b9344f2136030b2edc4aa, we could exctract the following further code snippets to be inserted in our “test” queue:


In addition to exactly corresponding code fragments between samples, it ‘s useful to take into consideration pieces of slightly modified code between the samples and adjust our rules in order to handle the difference.

For example:

From bdff852398f174e9eef1db1c2d3fefdda25fe0ea90a40a2e06e51b5c0ebd69eb


From d3e6e396c6bc1bbbcec4991d16ceff9603c41a9c6d5b1da39ef29ad193fd8077



// Connecting the dots

This is what i’ve obtained playing through a quick extraction of few code snippet from a very limited set of samples commonly associated to NK malware.

Following are graphically showed some of the relations obtained between them:



This is quickly to show how extraction of code snippets from a large dataset of samples belonging to the same threat actor can be really useful in order to recognize variants not only of the same malware family but also malicious code belonging to the same threat actor.

Depending on our rule generation algorithm,  in some cases it may not be immediately possible to correlate an artifact to a specific malware family but we will have at least a suggestion that the artifact may belong to a specific threat actor (something like… oh man! it seems a Lazarus stuff !  ).

I continued palying this game for a while by expanding the set of samples I originally came from and extracting new snippets.

Some of other code snippets extracted are reported following (truncated for space reasons):

# Code Snippet #



# Code Snippet #



Finally, converting and correctly morphing these code snippets, can be possible to obtain dozens of correlation rules from different blocks of samples.

Interesting to note is the relation we can obtain between one recent sample spotted out here and some older samples.

Furthermore, using the first obtained detection rule (as is) in a retro hunt task, I obtained (along others) a file identified with the SHA256 adfb60104a6399c0b1a6b4e0544cca34df6ecee5339f08f42b52cdfe51e75dc3; This hash has been reported by Cisco Talos Threat Intelligence Team on January 30, 2019 in their research at

// Conclusion

Over time, adopting practices aimed at extraction of code fragments on very large datasets helps above all in the classification and attribution of samples not yet publicly shared and for which we can’t rely on publicy available / free evaluation tools. It’s to be said, indeed, there are many reasons an actor could reuse its own code; having as an example a common cyber crime campaign, once the malware used become more detectable, malware writers are used to change only some basics of threir code or to implement /change packers in order to improve resistance to detection mechanisms. If we speak about targeted attacks, an adversary must keep its tools undetected for as long as possible. By identifying reused code, we gain valuable insights about the artifacts put in place by threat actors and we could be in a position to recognize and classify still not attributed pieces of malware with good degree of confidence.

An example here as a variant of Lazarus RAT submitted to Virus Total more than one year ago on 2018-03-19 that I found during this exercise. Link below:

// Indicators of Compromise

Artifact [SHA256]: 5a518bec337c784911f79daba99a502fcbfa9d4fc417b2717e5dfa153be20881

Artifact [SHA256]: 26a2fa7b45a455c311fd57875d8231c853ea4399be7b9344f2136030b2edc4aa

Artifact [SHA256]: adfb60104a6399c0b1a6b4e0544cca34df6ecee5339f08f42b52cdfe51e75dc3

Artifact [SHA256]: bdff852398f174e9eef1db1c2d3fefdda25fe0ea90a40a2e06e51b5c0ebd69eb

Artifact [SHA256]: d0b970e8052a4e3a353e99f8f2f4f6436298e473466ca407c353715ec10c3087

Artifact [SHA256]: e2199fc4e4b31f7e4c61f6d9038577633ed6ad787718ed7c39b36f316f38befd

Artifact [SHA256]: 4d7ac076c4955f745b17bab9ab5b61aa14832b689b3a9e852fbd77938d23bf99

Artifact [SHA256]: f460692ea6c4e5dbb968def9567090335d5d7188167c3e487d05c526d7201108

Artifact [SHA256]: cd2e8957a2e980ffb82c04e428fed699865542767b257eb888b6732811814a97

Artifact [SHA256]: d3e6e396c6bc1bbbcec4991d16ceff9603c41a9c6d5b1da39ef29ad193fd8077

// Detection Rules

Download YARA  (I decided to shared one of the generated Yara on TLP base in order to preserve the effectiveness of the extracted rules. contact me for password).

Leave a Reply

Your email address will not be published. Required fields are marked *