• Data Mining + Malware = Improved Analysis

    by  • April 30, 2012 • Uncategorized

    Over the past few weeks I have been talking with different analysts, programmers and RE folks about the future of malware analysis and how we combat changes in attacks. Ripping apart binaries and developing signatures based on TTPs doesn’t scale (it goes without saying that signatures do a great job, but this knowledge is gained from knowing something about an attack) as more and more new threats emerge, so we need to start thinking about something new.

    In the next few postings I will take some time to focus on data mining. This field of study can be applied to any discipline, but you often come up short when Googling it coupled with malware. My background is not so much in the math world, so I will steer away from the inner workings of the algorithms, but will highlight the strengths and weaknesses identified. To avoid overkill, I will also leave it to you, the reader, to figure out the data mining process. 


    Using thousands of malicious, known good, and targeted documents, a classifier was trained and tested using multiple algorithms with a high success rate. Each PDF from the dataset would be transformed from its native format to a flat vector of unweighted features that would then be fed to a learner. After training the learner, several tests were ran using known techniques to evaluate classifier success in order to identify how successful the project was. 

    The top two algorithms (decision tree and k-nearest neighbor) were implemented within a new method (classify) that is part of the PDF X-RAY API. Users can now get classification results back for each of the respective algorithms by simply submitting their questionable file. 

    Dataset Files

    For any good classifier to work well, one must have a sizable selection of data that can be used to train it. For the PDF example, the following datasets were used:

    • Malicious – 15K
    • Non-malicious – 6K
    • Targeted – 320

    Just based on the supplied sample-set, it is easy to see that targeted may have issues later on just because there is such an underwhelming amount of data. 

    PDF to Vector

    Having worked with the PDF format for quite some time now, I felt this would be a good example to start with. To train the classifier, one must first transform the data from the native PDF format to a vector of features. Features are typically extracted using various techniques, but in this case, I relied on my experience with the documents to pick the features myself. 

    In total, I ended up picking 35 features to represent what a PDF “looked like”. My first test runs included 23 features, but I later found that by adding more, I was able to get more stable results. These 35 features include items like known named dictionaries used in malicious attacks, filters, structural attributes such as object counts, size, etc. 

    In most cases the problem dataset appeared to be targeted vs. malicious/non-malicious documents. This was to be expected though given that a lot of targeted documents are nothing more than a good document with malicious code injected into them. It is possible that this problem could be solved with more fine-grained feature selection to account for the subtle differences between a targeted document and the others. It is also possible that weights could be introduced into certain features to better control the end decision. 

    Working Implementation

    Unlike most of my other tools, I am not quite ready to release all the code I have put together for this project, but wanted to provide the public with a way to take advantage of it. Earlier today I added a new API call to PDF X-RAY that lets users submit files to the server and get back a classification response. 


    To submit to the API, users can use the same code present on the API page with a slight tweak to the API. 



    More testing needs to be done, but so far introducing the concept of data mining into malware analysis appears to be a great direction. Now that the classifier has been built, this significantly reduces the amount of time spent looking at files to determine whether or not they are malicious. Furthermore, with a bit of tuning, this tool could soon aid in the detection of targeted attacks. In some of the future postings, other techniques like clustering will be demonstrated to begin grouping similar files of different types.

    As a final thought and note – is this sort of solution perfect? No, but I have yet to see any tool in this field that’s perfect. Anti-virus companies make millions off their products that only work sometimes whereas these concepts and methods are free and based off solid math foundations. I suspect this sort of technology will find its way into malware analysis more and more as time progresses.