Files
flow-hugo/content/posts/bad-malware-analysis-character-count.md

4.9 KiB

title, date, draft
title date draft
Bad Malware Analysis: Character Count 2020-03-06 false

Introduction

I'm thinking of doing a series on bad malware analysis. Hopefully it'll be fun and at least a little informative.

Today's post consists of performing a string analysis on malware. Where most string analysis looks at the big picture, I thought I would take it a step further and look at individual characters. This approach is terrible, as you will soon see.

Why Strings

If you've made it this far, I'm assuming you already have some basic knowledge of computers and maybe even looking at malware. As such, you may already know what string analysis is all about, but here is a quick crash course on strings.

Strings is a way to look at what readable characters are available in a binary and is low hanging fruit for a researcher. This is why a lot of researchers will start with this step to gain basic insight into a piece of malware.

Some of the strings spit out are function calls, both internal and library calls. This can inform the researcher on expected behaviors. In the case of very specific function calls, signatures (or at least an indicator) can be created for firewalls and anti-virus. Strings can also output messages put in by the developer, these can range from cryptic to debug statements (in the lucky cases). Again this can give the researcher information or the ability to create signatures/indicators.

In order for a signature to be created from strings, it needs to be very very specific in order to avoid false positives. As such this is rarely done and string analysis is usually reserved for giving a researcher a leg up before diving deeper into their analysis.

Indicators are a little more practical for use with strings. The more indicators the more confident you can be that this is a piece of malware.

Why Characters

Now for my terrible way of using strings ... character analysis.

How can we make analysing strings terrible, by braking it down into individual characters. This can then be used for generating an indicator for a confidence value.

Cutting to the chase, if a piece of software has a lot of the following characters, it may be malware:

v j ; , 4 q 5 /

Why Those Characters

How did I come to such a wild conclusion that v's and j's are a problem ... time for some terrible analysis.

Where Are the Samples

To perform my analysis, I pulled down around 500 samples of malware from theZoo (https://thezoo.morirt.com/) and dasMalwerk (https://dasmalwerk.eu/). For samples of benign software I grabbed all of /bin on Fedora and 200 libraries from C:/Windows directory.

How Was it Analysed

Next I wrote a python program to run strings, loop through each individual character, make them lowercase, then count. This was done for both malware and benign samples, then compared in two ways:

  1. Count the total number of characters in the malware samples and the total number in the benign. Then subtract the two. Sort and look
  2. Take the ratio of each character count to the file size for the malware and benign samples. Average that across all files, then subtract and compare. (don't worry I'll explain)

Basic Count

The basic count is fairly self explanatory, just keep a running tally of characters and subtract. Here are the top ten characters most likely and least likely to be in malware:

" 5 j q 4 , 6 2 1 3

r i c o e _ h t $ a

This is terrible for many reasons, but specifically because it is un-weighted. So if there is a single piece of small malware that uses the number '5' very heavily (or small benign that uses the letter 'a'), but no others do, it could show up here.

I wanted to find a way to weigh the characters, such that a single sample couldn't skew all of thebad-malware-analysis-character-count results.

Ratio Analysis

Here's where it get's more complicated and I'll try to explain.

  1. Keep a running tally per malware sample (not a total for all samples)
  2. Calculate the ratio of file size to count: Letter Count / File Size
  3. Average the ratios: sum(ratio) / Number of Files
  4. Compare to the benign samples using steps 1-3

Hopefully that made sense. This was so even if one file had a ton of one character, it wouldn't skew the results. So what were the top 10 malware characters and top 10 benign:

v j ; d l , 4 q 5 /

_ s g r o $ f i a "

Obviously this gives pretty big differences: double quote went from being the worst offender to most benign. Using the ratio gives a much better analysis since it doesn't allow a single sample to skew the results.

So What Now

Using the ratio is probably good on it's own, so how did I come up with my character dirty list. I looked at the worst offenders of both ways to analyse the code and came up with the list:

v j ; , 4 q 5 /

More analysis needs to be done to determine the threshold for confidence. For now, the higher the ratio of those 8 characters to size of file, the more confident you can be it is Malware.