• Generic Filter May Be Specific

    by  • December 5, 2010 • Uncategorized

    In my last post I highlighted what I felt to be interesting characteristics on malicious PDF files compared to my random dataset. Towards the end of the post I mentioned the following potential filter based on the identified information:

    A file may be suspicious if it is less than 1MB, has less then 2 pages, follows the specification of closing all objects/streams 100% of the time, has at least one call to JavaScript and has a lower level of entropy. 

    After further investigation it looked like the entropy levels were lower due to a wide spectrum of values for each malicious PDF. For the testing I removed that aspect of the generic filter and translated it to the following SQL statements (one for the random set and the other for the malicious):

    select count(*) as total from pdf_data_dump where (page >= 1 and page <= 2)and obj/end_obj = 1.0 and stream/end_stream = 1.0 and filesize < 1887436.8 and (javascript > 0 OR js > 0);

    select count(*) as total from pdf_structure where (page >= 1 and page <= 2)and obj/end_obj = 1.0 and stream/end_stream = 1.0 and filesize < 1887436.8 and (javascript > 0 OR js > 0);

    After running the queries I was quite surprised with the results given how generic the filters were. I noticed that filesize tended to influence the results the most when making queries, so I recorded a couple of variations with differing filesizes and ended up with the following:

    Filesize = 1.8MB (1887436.8)

    • mal: 25/34 73%
    • ran: 58/15011 0.0038638331889947%

    Filesize = 1.5MB (1572864)

    • mal: 24/34 70%
    • ran: 51/15011 0.0033975084937712%

    Filesize = 1.3MB (1363148.8)

    • mal: 24/34 70%
    • ran: 49/15011 0.0032642728665645%

    Filesize = 1MB (1048576)

    • mal: 24/34 70%
    • ran: 44/15011 0.0029311837985477%

    From the data we can see that at least 70% of the malicious files are detected when using the filter with a low potential false positive rate on the random dataset. I personally do not feel that this filter is something that could be used as a means for 100% detection given that attackers could easily modify the data that we are using to classify documents with. It does however serve use in the meantime until we see a shift in how attackers construct their documents.

    What is interesting to point out is that when submitted to VirusTotal, results varied on detection rates between some of the more popular (Kaspersky, NOD32, McAfee, Symantec) anti-viruses. It is safe to say that while having an anti-virus may stop some attacks (0day or not), it may not pick them all up and it is highly unpractical to run multiple anti-virus solutions to cover the gaps. It should also be pointed out that the most popular attack vector was to use one or more of the following buffer overflow attacks:

    So moving forward with these results I plan on now creating a generic scoring system to help classify whether a PDF should be looked at further. Combine that score with VirusTotal and Wepawet and you should have a pretty decent idea of if the PDF document is malicious. In the next day or two I should have the scoring system worked out and I will release my malicious statistics (same format as the random data). The scoring system could also be ported to a Windows Application that hooks the necessary Windows functions to first scan the PDF before allowing reader to process it.