Using the filter I created in the previous postings, I decided to port it over to something more useful. Having a SQL statement is fine on database data, but it is not the most practical method of scanning when doing ad-hoc queries. I needed to leverage data from Didier Stevens PDFiD tool, so I just added a class to do the scoring for me from there. In addition to the class, I also added a command line argument so that you could just get the output without digging through the code. Aside from scoring, I have also added options to dump to a database, output CSV and output raw XML. These options were used as I was testing, but are no where near perfect, so the tool should be used with caution.
Here is the general concept I followed when developing the scores:
- Primary Score
- Filesize = 1
- Obj/End Obj = 1
- Stream/End Stream = 1
- Pages = 1
- Secondary Score
- j2bigdecode = .5
- richmedia = 1
- launch = .5
- colors = .5
- Primary: 5 (requires attention)
- Secondary: 2.5 (suspicious when combined with a higher primary score)
The primary score is based on the findings highlighted in previous postings with a 5 considered something worth looking at. It should be noted that if a PDF does not contain one of the filters then the score is decremented by 1. Doing this makes it easy to distinguish between something that is worth looking at compared to something that is not.
The secondary score was created to aid in the identification of suspicious files, but is not weighted as highly as the primary. In few cases, malicious PDF files used functions identified within the secondary score and while they did not appear to be a good way to do filtering, they did in some cases provide reason to look further. Unlike the primary score, secondary scores do not deduct points as they only add value and do not take away from the primary score findings.
The scoring has been briefly tested on malicious files and a random dataset with good results, but still needs to be applied to more data before it can be considered truly useful. The next steps to improving this will be to scan a completely new dataset from known malicious domains (obtained by dirtyhands) and see if the scoring is able to find any malicious documents.
To download the modified version of PDFiD.py, click here.