While waiting for approval to release certain statistics, I figured I would release some high level information I found interesting from my malware dataset. To help put things into perspective I will list some comparisons to the random dataset I currently possess.
Malware (36) : Random (2690)
- Filesize – 292KB : N/A
- Obj with EndObj – 100% : 9998218404184205%
- Stream with End Stream – 100% : 9868555312896987%
- Pages – 1.5 : 12.1
- Entropy – 6.6765 : 7.8569
Clearly there is a problem comparing these two datasets given the sheer difference in collected samples, but it does help to give some idea of what sticks out (identified above). At the end of the day we are worried about identifying malware or files that look suspicious. With out current malware sampleset we can glean and assume the following:
- Filesize is relatively small compared to collected samples from Google
- Malware authors seem to pay attention to ensuring their objects and streams are properly closed with corresponding tags (this could be the work of generated documents from a PDF creator)
- Malware files seem to be limited on the amount of pages added to a document whereas files of similar sizes in the random set averages out to about 12
- Randomness inside the files tends to be considerably lower than a normal document (files with 1 page in the random set still had higher entropy even with similar sample sets)
Based on the above assumptions it looks like we could make the following statement of what suspicious could look like:
The problem with the above statement is that a legitimate, known-good PDF file could fall into this same category. What are the chances? That is the next question that must be answered. I will construct a small script that goes through all my current (and future) data to identify those following characteristics. While the statement is broad, it could turn out to be pretty specific in regards to tracking malware. The problem of course lies in the fact that all the data above is easily manipulated by the author of the document. Adding pages, data and randomness are all easy to do to help make the PDF blend in.