Before I get to the interesting news, I wanted to point out that I released the Malware database snapshot. This is essentially the same thing as the random dataset, but the content is derived from malware. I am refining some of my data before I release the full database, but expect to see that pop-up soon. If you are interested in viewing the malware data you can see it by here.
Now on to the cooler part of the posting – testing results from the score tool. For testing I made a slight modification to the tool and added the ability to see the filename on the same line as the score itself. This made life a lot easier because I could then pipe all results to grep and filter strictly on high scores only.
I decided it was best to use a few different sets of data that I had laying around to ensure I was getting a wider range of results. On my machine I had about ~1000 PDF files – 600 downloaded from known MDL sites using dirtyhands, 100 downloaded from Google using bighands and 100 given to me from other security researchers. Of all the files scanned I collected 10 new malware samples and had one false positive. The false positive occurred in the MDL data and was verified to be clean by using VT and Wepawet.
I need to do further analysis on the new malware samples collected, but one appeared to be using an exploit I had not seen within my other collected results. Another blog entry will be posted on the details of the newly collected files before they are added into the rest of the collection. Continuing on with the project, I want to begin collecting/storing more data and put the scoring tool to work (constant scans of files from Google) to identify any issues. I emailed Didier Stevens about the updates I made to his tool and I am currently waiting on a response to see if he would like to include it.