3.0 KiB
title, date, draft
| title | date | draft |
|---|---|---|
| Bad Malware Analysis: String Count vs File Size | 2021-03-08T20:20:31Z | false |
Introduction
Next up in bad malware analysis is comparing the size of a file to the output of of the command strings. The idea here is that malware may contain less strings per KB than benign binaries. This would make logical sense as many malware samples are packed, encrypted, and/or stored in the data section of the binary, to be extracted later. This is done to help obfuscate them from hash signatures.
Samples
There are around 500 malware samples, coming from two sources: theZoo and dasMalwerk. For samples of benign software I grabbed 200 libraries from C:/Windows directory.
Calculations
Using python I wrote a quick script to count the number of strings returned (separated by a newline) and compared it to the size (in KB) to the file. I performed this using strings of min size 2, 3, 4, 5, and 6. Why those numbers ... because that is where I decided to stop. The average strings per KB was then calculated.
| String Min Len | Benign (Str/KB) | Mal (Str/KB) | % Diff |
|---|---|---|---|
| 2 | 51.54 | 51.70 | - 0.31 % |
| 3 | 24.15 | 20.52 | 15.03 % |
| 4 | 12.40 | 9.70 | 21.77 % |
| 5 | 5.59 | 5.58 | 0.18 % |
| 6 | 4.32 | 3.96 | 8.33 % |
Results
The results are kinda in line with what I thought. Most of the malicious binaries have less strings per KB than the benign. Surprisingly looking at a minimum string length of two and five, the benign and malicious binaries have about the same number of strings per KB. The string length of two makes sense as a lot of stings that small come down to random bytes in the binary looking like strings.
The five string length (and six is pretty close too) is kinda surprising tho. It may be able to be explained by debug messages or something similar. If this wasn't bad malware analysis I would look into it more. It could also be due to long strings in binaries being low occurrence anyway. Malicious code wants to give as little away as possible and benign code would probably just use external resources at that point.
It appears the sweet spot for comparing malicious to benign binaries is four. At this length there are around 22% more strings per KB in benign binaries than malicious.
Overall the results were in line with what I expected, however they were a lot closer than I thought they would be.
Future Work
If this were not bad malware analysis I would continue to look at the individual strings for patterns ... oh wait that was in previous bad malware analysis.
It may be time to combine these bad ways of analyzing the strings and see if we can make meaningful predictions. I've always wanted to play around with neural nets in python, maybe there will be a way to use all my bad string analysis together to form a decent confidence value.