Updating bad analysis

This commit is contained in:
2025-08-24 22:50:28 -04:00
parent 7e27faef70
commit 9630a14124
6 changed files with 46 additions and 46 deletions

View File

@@ -1,10 +1,10 @@
# Summary
## Summary
I'm a Software Engineer with over 11 years development and 15 years professional experience, with exposure to C, Python, PHP, JavaScript, Java, and C++ languages; various SQL databases; JQuery and Pytest frameworks; Docker containerization; and Rest API, JSON, XML, and nginx technologies.
# Work Experience
## Work Experience
## Binary Defense
### Binary Defense
**Sr Software Engineer**: April 2022 - Present
@@ -12,7 +12,7 @@ I'm a Software Engineer with over 11 years development and 15 years professional
- Develop security alarms for Windows, Linux (Debian and RedHat), and MacOS
- Design and build containment for all platforms upon detected compromise
## Kyrus Tech
### Kyrus Tech
**Sr Software Engineer**: Nov 2020 - April 2022
@@ -21,7 +21,7 @@ I'm a Software Engineer with over 11 years development and 15 years professional
- Design compact router fingerprinting and vulnerability analysis: Android, HTTPS, TCP/IP, StreamCypher Encryption
- Modify existing code to suppress logging from inside the Linux Kernel: various Linux Kernel versions, Ghidra
## Parsons
### Parsons
**Cyber Security Software Engineer**: Apr 2018 - Nov 2020
@@ -34,7 +34,7 @@ I'm a Software Engineer with over 11 years development and 15 years professional
- Track and maintain multi-level user access
- Generate metadata for searching
## NSA
### NSA
**Security Software Engineer**: Nov 2011 - Apr 2018
@@ -54,7 +54,7 @@ I'm a Software Engineer with over 11 years development and 15 years professional
- Organize, train, and participate in team performing 24x7 call-in rotation
- Responsible for 5+ domestic and foreign system deployments
## Salisbury University
### Salisbury University
**Software Developer**: Nov 2006 - May 2008
@@ -69,7 +69,7 @@ I'm a Software Engineer with over 11 years development and 15 years professional
- Maintain the Linux labs on campus: dual boot OpenSUSE, WindowsXP, and OpenSUSE server
- Perform backups, updates, user management (LDAP), disk quotas, and remote access
# Education
## Education
University of Maryland Baltimore Campus
@@ -87,7 +87,7 @@ Royal Military College (RMC Canada)
: Training in OpenBSD development and administration
# Miscellaneous
## Miscellaneous
RedBlue Conference

View File

@@ -4,13 +4,13 @@ date: 2020-03-06
draft: false
---
# Introduction
## Introduction
I'm thinking of doing a series on bad malware analysis. Hopefully it'll be fun and at least a little informative.
Today's post consists of performing a string analysis on malware. Where most string analysis looks at the big picture, I thought I would take it a step further and look at individual characters. This approach is terrible, as you will soon see.
# Why Strings
## Why Strings
If you've made it this far, I'm assuming you already have some basic knowledge of computers and maybe even looking at malware. As such, you may already know what string analysis is all about, but here is a quick crash course on strings.
@@ -23,7 +23,7 @@ In order for a signature to be created from strings, it needs to be very very sp
Indicators are a little more practical for use with strings. The more indicators the more confident you can be that this is a piece of malware.
# Why Characters
## Why Characters
Now for my terrible way of using strings ... character analysis.
@@ -33,21 +33,21 @@ Cutting to the chase, if a piece of software has a lot of the following characte
v j ; , 4 q 5 /
## Why Those Characters
### Why Those Characters
How did I come to such a wild conclusion that v's and j's are a problem ... time for some terrible analysis.
## Where Are the Samples
### Where Are the Samples
To perform my analysis, I pulled down around 500 samples of malware from theZoo (https://thezoo.morirt.com/) and dasMalwerk (https://dasmalwerk.eu/). For samples of benign software I grabbed all of /bin on Fedora and 200 libraries from C:/Windows directory.
# How Was it Analysed
## How Was it Analysed
Next I wrote a python program to run strings, loop through each individual character, make them lowercase, then count. This was done for both malware and benign samples, then compared in two ways:
1. Count the total number of characters in the malware samples and the total number in the benign. Then subtract the two. Sort and look
2. Take the ratio of each character count to the file size for the malware and benign samples. Average that across all files, then subtract and compare. (don't worry I'll explain)
## Basic Count
### Basic Count
The basic count is fairly self explanatory, just keep a running tally of characters and subtract. Here are the top ten characters most likely and least likely to be in malware:
@@ -59,7 +59,7 @@ This is terrible for many reasons, but specifically because it is un-weighted. S
I wanted to find a way to weigh the characters, such that a single sample couldn't skew all of thebad-malware-analysis-character-count results.
## Ratio Analysis
### Ratio Analysis
Here's where it get's more complicated and I'll try to explain.
1. Keep a running tally per malware sample (not a total for all samples)
@@ -75,7 +75,7 @@ _ s g r o $ f i a "
Obviously this gives pretty big differences: double quote went from being the worst offender to most benign. Using the ratio gives a much better analysis since it doesn't allow a single sample to skew the results.
# So What Now
## So What Now
Using the ratio is probably good on it's own, so how did I come up with my character dirty list. I looked at the worst offenders of both ways to analyse the code and came up with the list:

View File

@@ -4,13 +4,13 @@ date: 2020-04-12
draft: false
---
# Introduction
## Introduction
For this bad malware analysis, I thought I would continue the theme of counting letters ... that way I could use most of my old code :)
Today, I decided to hash each file using sha512. Hashing is supposed to be completely random, so this is almost a test of that as well. I used around 3000 malicious samples and 1800 benign, so lets get started.
# Why Hash, Why sha512
## Why Hash, Why sha512
Hashing binaries is done all the time to verify downloads, check for changes, provide signatures, provide low hanging fruit for malware signatures, and many more purposes. It is so widely used, I was wondering if it was possible to use the hash itself as a flag to determine if this could be malware (beyond just a hash table).
@@ -20,31 +20,31 @@ The reason I decided to do the letter count on hashes was for two reasons; 1) it
The reason I decided sha512 is also two fold; 1) it's long, so it'll provide some of the most data and 2) sha in general is one of the most accepted hashing algorithms, so I went with that.
# What Was My Result
## What Was My Result
Surprising! There seems to be a pattern of what characters show up most in hashes for malware.
# What!
## What!
Yep, it appears that if you see around 3% more f's and 1% more 7's and 5's in your sha512 hash, then you might have some malware.
## That Can't be Right!
### That Can't be Right!
Hard to believe that is what it seems like 'f, 7, and 5' show up more and 'e and 6' show up 1% less in malware.
# Ok, So How Was it Done
## Ok, So How Was it Done
## Where are the Samples
### Where are the Samples
Same as my string analysis, to perform my hash analysis, I pulled down around 500 samples of malware from [theZoo](https://thezoo.morirt.com/) and [dasMalwerk](https://dasmalwerk.eu/). For samples of benign software I grabbed all of /bin on Fedora and 200 libraries from C:/Windows directory.
## How was it Analysed
### How was it Analysed
I modified my program from doing string checks to perform the hash analysis. Now, instead of running strings on each of the files it performs a sha512 hash. I then averaged the number of each character seen for each file. This means I counted the number of '1's seen for all malicious file hashes, then dividing by the total number of files.
This was done for all characters for each malicious and benign binaries. After that I subtracted the benign averages from the malicious averages and divided by the original value.
# Why?
## Why?
So a difference of 1 - 2% is not that much, but 3% seems more significant. This shouldn't happen, all characters should show up about evenly. This can probably be accounted for with just the samples that I had chosen. Choose a different set of 1000 binaries and the results could be different.

View File

@@ -4,15 +4,15 @@ date: 2021-03-08T20:20:31Z
draft: false
---
# Introduction
## Introduction
Next up in bad malware analysis is comparing the size of a file to the output of of the command strings. The idea here is that malware may contain less strings per KB than benign binaries. This would make logical sense as many malware samples are packed, encrypted, and/or stored in the data section of the binary, to be extracted later. This is done to help obfuscate them from hash signatures.
# Samples
## Samples
There are around 500 malware samples, coming from two sources: [theZoo](https://thezoo.morirt.com/) and [dasMalwerk](https://dasmalwerk.eu/). For samples of benign software I grabbed 200 libraries from C:/Windows directory.
# Calculations
## Calculations
Using python I wrote a quick script to count the number of strings returned (separated by a newline) and compared it to the size (in KB) to the file. I performed this using strings of min size 2, 3, 4, 5, and 6. Why those numbers ... because that is where I decided to stop. The average strings per KB was then calculated.
@@ -24,7 +24,7 @@ Using python I wrote a quick script to count the number of strings returned (sep
| 5 | 5.59 | 5.58 | 0.18 % |
| 6 | 4.32 | 3.96 | 8.33 % |
# Results
## Results
The results are kinda in line with what I thought. Most of the malicious binaries have less strings per KB than the benign. Surprisingly looking at a minimum string length of two and five, the benign and malicious binaries have about the same number of strings per KB. The string length of two makes sense as a lot of stings that small come down to random bytes in the binary looking like strings.
@@ -34,7 +34,7 @@ It appears the sweet spot for comparing malicious to benign binaries is four. At
Overall the results were in line with what I expected, however they were a lot closer than I thought they would be.
# Future Work
## Future Work
If this were not bad malware analysis I would continue to look at the individual strings for patterns ... oh wait that was in previous bad malware analysis.

View File

@@ -4,13 +4,13 @@ date: 2020-09-16
draft: false
---
# Introduction
## Introduction
Continuing from my Bad Malware Analysis, we now take a look at Bad Password Analysis. Mostly this is just for the fun of it, but we'll see if we can learn anything along the way.
In this Bad Malware Analysis post, we'll look at consecutive character frequency. I've done analysis on two and three consecutive characters and compared it to a word frequency list generated from subtitle archives.
# Data
## Data
The passwords come from several leaks. These include honeynet, myspace, rockyou, hotmail, phpbb, and tuscl lists. All of these lists contain the count of how many times a password was used as well. Total there are 14,584,438 unique passwords.
@@ -18,7 +18,7 @@ For comparison, I'm using an English word frequency list generated from subtitle
I wrote a quick script to combine these into a single text file, to remove all duplicates and update all counts. This was the list used for all further analysis.
# Algorithm
## Algorithm
Everything is written in python.
A few decisions needed to be made before analyzing the data. The first thing I decided was to not worry about substitutions. So in my analysis @ does not equal a. This is a limitation, since it would provide a more accurate representation of characters in passwords. Second, all passwords and English words where set to lower case. This way patterns would be more apparent. If the goal is cracking, it's incredibly easy to just change cases.
@@ -37,24 +37,24 @@ There is an option to turn off the use of frequency. I've analyzed this below as
Given a starting character, will this analysis allow us to predict what the next character will be?
# Analysis
## Analysis
For the analysis, I looked at with and without frequency counts. Within that, I did an internal comparison of frequency of each character set seen as well as a comparison with the dictionary values.
## With Frequency
### With Frequency
With frequency taken into account, the top 100 password two character combinations only cover 11% of the all combinations. This seems rather low (I know very technical), so intuition says, this is not a good way to predict the password. In addition the top combination is 's2' which only constitutes 0.15% of combinations.
Lets compare this to the dictionary words. The top 100 combinations cover a staggering 60% of all combinations. This would be a good predictor for what would be the next letter in English. The top combination in the dictionary data 'th' covers almost 3% of all combinations.
Comparing the two further we can see 10 character combinations shared in the top 100 password and dictionary characters. I was expecting this to be higher, but this could be due to character substitutions in passwords. Such as 'mo' is in the dictionary top 100 but not for passwords. However 'm0' is in the top 100 password list.
## Without Frequency
### Without Frequency
Without taking into account the frequency of words and passwords doesn't change munch. The top 100 passwords now accounts for 35% of all combinations, which seems like it could be a better predictor. But this weighs good unique passwords the same as common ones. Dictionary gets worse at only 45% of combinations accounted for in the top 100.
Without taking into account frequency, '08' becomes the top password combination at 0.79% and 'se' becomes the top dictionary combination at 1.13%.
Surprisingly, without taking frequency into account, we see less substitutions in the password data. This means we now see 64 out of 100 duplicates between the data. This is closer to what I would have expected. Most people tend to use dictionary words for their passwords, so it would make sense to see duplicates across the data.
# Conclusion
## Conclusion
This is probably not a good way to go about cracking passwords. Mostly this data simply shows to use dictionary word lists and substitution lists.
We could have done a few things better. One of which is look at common substitutions and see how that changes things. In many of the passwords, the standard alpha characters are replaced by numbers and symbols; such as @ or 4 for a, 5 or $ for s, and so on.

View File

@@ -4,7 +4,7 @@ date: 2021-03-11T18:55:01Z
draft: false
---
# Introduction
## Introduction
For this episode of bad analysis, we are going to be looking at word frequency in passwords. Overall this isn't terrible analysis, but what makes it bad is I'm just looking for the first occurrence of a dictionary word in each password. This will miss *a lot* of words in passwords.
@@ -17,7 +17,7 @@ Additionally we will miss words because:
We are missing a lot of words in these passwords, but that is why this is bad analysis.
# Data
## Data
The passwords come from several leaks. These include honey-net, MySpace, rockyou, hotmail, phpbb, and tuscl lists. All of these lists contain the count of how many times a password was used as well. Total there are 14,584,438 unique passwords.
@@ -25,9 +25,9 @@ This took forever to loop through, pulling out the words, then comparing them to
I'm comparing the password list to the American English word list found on Linux. There may be a more complete list somewhere out there, but this worked for me.
# Results
## Results
## Raw Data
### Raw Data
The word were extracted, counted, and sorted. There were 68,402 unique words, the top 10 words account for around 5% of total words seen, and there were 21,191 unique words only seen in their own password.
@@ -48,7 +48,7 @@ All percentages are approximate
| and | 0.2 % |
| ito | 0.2 % |
## Additional Fun Stuff
### Additional Fun Stuff
How positive are people's passwords. Using a list of positive words found at [Positive List](https://gist.github.com/mkulakowski2/4289437) and a list of negative words found at [Negative List](https://gist.github.com/mkulakowski2/4289441), I've compared to our word frequency from our list.
@@ -64,7 +64,7 @@ Positive words were used 1,172,617 times and negative words were used 1,172,617.
Looking at positive and negative occurrences has it's own issues beyond just the word analysis. As you can see there are certain omissions that I would think would be in positive, like "baby." There are also inclusions in negative that I would not have made, such as "mar" which could just be March for someone's birthday. Better lists would need to be found or crafted, or entire passwords would need a language processor to determine if they are negative or positive.
# Conclusion
## Conclusion
Not much to conclude here, mostly this was for fun. Don't use dictionary words in your password, it doesn't take long to loop through the dictionary, and if you do, try to use longer random words, rather then meaningful ones.
@@ -72,7 +72,7 @@ People tend to be more positive in their passwords which is nice to see.
This was a lot of fun to implement and I may come back to this to see if I can improve upon looking at words.
# Future Work
## Future Work
- Thread all the things, maybe it'll run faster.
- Look for more than just the first word in each password