One of the most important elements to a malicious PDF is the content and payload within it. Up until this point the PDF malware object my tool outputted just contained various details about the structure, hashings and scanning data which was great, but made it difficult to identify the exploit being used. I just made an update that now includes a new section within the object representation: contents. Contents contains all the objects within a PDF with their ID, encoded output and decoded output.
Note: the raw output is preserved as much as possible, so be careful when displaying it in browsers.
While this does work rather well, there are still some issues:
- Large binary streams do not decode that well and therefore look like a giant mess in the object output
- Objects with streams are not included in the “encoded” portion of the contents object due to the large binary stream issue
- Cascading filters applied on a object may have strange results
I plan on taking on some of these issues soon, but want to begin work on the next phase in this project. While I find the tool valuable in its current form, I have noticed it to be a little annoying to use when working in the field. Since I don’t have an object parser written myself and use MongoDB, a quick “give me the score of this pdf” is quite difficult if I am on a random workstation. To solve this problem I plan on developing a RESTful API that anyone can use. Running the tool will still work, but the API will allow for much more control in the data you receive when you submit a sample. The goal here is to take this JSON output and let you, the users, choose what you want back whether that is everything or just a small portion. I think this will be really exciting and I can’t wait to get it done.