Throughout this year we have seen a rise of attacks using PDFs as a delivery or exploit mechanism. One of the things I feel is lacking is a way to identify or distinguish between a malicious PDF and known good PDF. Tools like Virus Total or Wepawet help solve part of the problem by providing areas where you can submit a binary and check it against virus scanners or other proven tools. The problems with these sites however are low detection rates and a lack of sharing of their results.
The sites themselves only provide feedback on individual items without showing how your PDF may be similar to a set of known bad PDFs based on differing characteristics. While the information is better then nothing at all, it still leaves much to be desired. I have began doing research on better ways to identify malicious PDFs or at least classify them as suspicious based on a collected dataset.
The goals of this project are pretty large and the intention is to provide a service similar to Virus Total or Wepawet with the key difference being that this site would focus strictly on PDF files and release all results for free. This means when you upload your PDF for analysis, not only do you get a report attempting to classify it as good or bad, but you would also get to see results from other scans that are similar to your file.
At the completion of this project I would like to have the following tools made for release:
- Web Application that outputs a generic PDF score based on structural and internal characteristics
- Releasing all collected results
- API that can be used to query files/information remotely
- Provide generic scoring of a file based on characteristics
- Tools to compile a dataset locally on your own servers (good, bad and random)
- Command lines tools to run locally on a system that performs the same functionality as the web application
- Python class that can be used to analyze structure and internals of a PDF
As I go through my research and findings I will release the work openly and freely distributable. I plan on using work from others who have already put a lot of effort into this area (Rodrigo Montoro and Didier Stevens) to help bootstrap the project and get sample results pushed rather quickly. Any modifications made to their tools or research will be cited and released along with my own.