For those who read my blog and follow my research then you know I chose MongoDB as my backend database to store my PDFs and if you don’t, well now you know. The catch is that I don’t actually store the file in the database like some people/scripts will do. Instead I take the malicious PDF, convert it to JSON using malpdfobj and then insert it into the database. This provides me with the ability to query, search, crunch, update, share, among many other thigns on any piece of data I store in the object including the full object itself.
Why not standard SQL? Well, I wanted the data to be returned without having to parse a blob everytime (JSON/BSON), PDF files contain a lot of data that are often unique to themselves (document based storing) and Mongo also made it easy to handle dynamic content (no columns).
I am not going to focus on the deep inner working of Mongo or the differences between that and SQL. Instead I wanted to highlight an interesting way of collecting data and answering questions about my malware using Map/Reduce. In the simplest definition, map/reduce allows me to aggregate a bunch of data together and get back some unique output. Examples tend to speak louder than theory, so I have included a bunch of jobs I created to answer some interesting questions. These jobs were ran on a malicious collection containing 310 PDF documents that were converted to JSON and then stored in my malware collection.
- Unique named functions with counts
- Unique encoded named functions with counts
- VirusTotal aggregation and anti-virus detection rates with counts/signatures
- PDF structural component counts and averages
Assuming each PDF has hundreds of named function calls, of these, how many unique names are there total and how many times do they show up throughout the dataset?