• Malware, MongoDB and Map/Reduce : A New Analyst Approach

    by  • March 24, 2011 • Uncategorized

    For those who read my blog and follow my research then you know I chose MongoDB as my backend database to store my PDFs and if you don’t, well now you know. The catch is that I don’t actually store the file in the database like some people/scripts will do. Instead I take the malicious PDF, convert it to JSON using malpdfobj and then insert it into the database. This provides me with the ability to query, search, crunch, update, share, among many other thigns on any piece of data I store in the object including the full object itself.

    Why not standard SQL? Well, I wanted the data to be returned without having to parse a blob everytime (JSON/BSON), PDF files contain a lot of data that are often unique to themselves (document based storing) and Mongo also made it easy to handle dynamic content (no columns).

    I am not going to focus on the deep inner working of Mongo or the differences between that and SQL. Instead I wanted to highlight an interesting way of collecting data and answering questions about my malware using Map/Reduce. In the simplest definition, map/reduce allows me to aggregate a bunch of data together and get back some unique output. Examples tend to speak louder than theory, so I have included a bunch of jobs I created to answer some interesting questions. These jobs were ran on a malicious collection containing 310 PDF documents that were converted to JSON and then stored in my malware collection. 

    Summary Breakdown:

    1. Unique named functions with counts
    2. Unique encoded named functions with counts
    3. VirusTotal aggregation and anti-virus detection rates with counts/signatures
    4. PDF structural component counts and averages


    PDF files contain named functions that often lead to the exploit or payload. Usually you will see /OpenAction, /JS, /Javascript and a few others sprinkled throughout a document, but there are often hundreds of these named functions and some may serve no real purpose.


    Assuming each PDF has hundreds of named function calls, of these, how many unique names are there total and how many times do they show up throughout the dataset?


    Job Statistics:

    In this output we can see that in most cases obj/endobject and stream/endstream are both included meaning they are well-formed. We can also see that the startxref, trailer and xref tables always appear to be included in the documents as well. While this is not enough to conclude a definitive answer, it is enough to say that malicious documents appear to follow the PDF specification in regards to structural components.