Updating bad analysis

2025-08-24 22:50:28 -04:00
parent 7e27faef70
commit 9630a14124
6 changed files with 46 additions and 46 deletions
--- a/content/posts/bad-malware-analysis-character-count.md
+++ b/content/posts/bad-malware-analysis-character-count.md
@@ -4,13 +4,13 @@ date: 2020-03-06
 draft: false
 ---

-# Introduction
+## Introduction
 I'm thinking of doing a series on bad malware analysis. Hopefully it'll be fun and at least a little informative.
 

 Today's post consists of performing a string analysis on malware. Where most string analysis looks at the big picture, I thought I would take it a step further and look at individual characters. This approach is terrible, as you will soon see.

-# Why Strings
+## Why Strings

 If you've made it this far, I'm assuming you already have some basic knowledge of computers and maybe even looking at malware. As such, you may already know what string analysis is all about, but here is a quick crash course on strings.
 
@@ -23,7 +23,7 @@ In order for a signature to be created from strings, it needs to be very very sp
 
 Indicators are a little more practical for use with strings. The more indicators the more confident you can be that this is a piece of malware.

-# Why Characters
+## Why Characters

 Now for my terrible way of using strings ... character analysis.

@@ -33,21 +33,21 @@ Cutting to the chase, if a piece of software has a lot of the following characte

 v j ; , 4 q 5 /

-## Why Those Characters
+### Why Those Characters

 How did I come to such a wild conclusion that v's and j's are a problem ... time for some terrible analysis.

-## Where Are the Samples
+### Where Are the Samples

 To perform my analysis, I pulled down around 500 samples of malware from theZoo (https://thezoo.morirt.com/) and dasMalwerk (https://dasmalwerk.eu/). For samples of benign software I grabbed all of /bin on Fedora and 200 libraries from C:/Windows directory.

-# How Was it Analysed
+## How Was it Analysed

 Next I wrote a python program to run strings, loop through each individual character, make them lowercase, then count. This was done for both malware and benign samples, then compared in two ways:
 1. Count the total number of characters in the malware samples and the total number in the benign. Then subtract the two. Sort and look
 2. Take the ratio of each character count to the file size for the malware and benign samples. Average that across all files, then subtract and compare. (don't worry I'll explain)

-## Basic Count
+### Basic Count

 The basic count is fairly self explanatory, just keep a running tally of characters and subtract. Here are the top ten characters most likely and least likely to be in malware:

@@ -59,7 +59,7 @@ This is terrible for many reasons, but specifically because it is un-weighted. S

 I wanted to find a way to weigh the characters, such that a single sample couldn't skew all of thebad-malware-analysis-character-count results.

-## Ratio Analysis
+### Ratio Analysis

 Here's where it get's more complicated and I'll try to explain.
 1. Keep a running tally per malware sample (not a total for all samples)
@@ -75,7 +75,7 @@ _ s g r o $ f i a "

 Obviously this gives pretty big differences: double quote went from being the worst offender to most benign. Using the ratio gives a much better analysis since it doesn't allow a single sample to skew the results.

-# So What Now
+## So What Now

 Using the ratio is probably good on it's own, so how did I come up with my character dirty list. I looked at the worst offenders of both ways to analyse the code and came up with the list: