When I mention mapreduce I typically get a blank stare in return and even more so when I talk about it with malware. I think raw data tends to speak much louder than theories or whitepapers, so attached are several JSON output files from multiple mapreduce jobs ran against 14,993 malicious PDF files. What can you do with this information? Well, it all came from malicious files, so in a lot of cases you should be able to use this data to at least classify a document as suspicious or worth looking into further.
I was able to generate this output using a quick python script I dubbed magic_mappy. Basically all this does is take a definition of my leaf node fields (ones I want to crunch down), then fills in the generic mapreduce job (this is a super simple job) and then outputs the results into its respective mongodb collection. Below is the code:
I still need to prune through the results, but expect to see some more useful output that associates the top level MD5 hashes with a given value in something like the metadata or stream contents. This sort of output should help analysts in tracing back where a file may have come from or what was used to generate it.