• 15K Random Dataset

    by  • December 1, 2010 • Uncategorized

    To gain an understand of what PDF files looked like from Google, I needed to gather a pretty large dataset programatically. Using a quick tool I wrote called Bighands, I was able to use the Google AJAX Search API with a random search query to download PDF files. Once downloaded, I used a modified version of Didier Stevens PDFiD.py tool to scan each document, generate a hash, collect some additional statistics and then input them into a flat database table.

    Located within the dump file are two different tables consisting of thousands of entries. PDF_DATA_DUMP contains the ~15K+ entries of everything Didiers tool would provide along with file size and original filename. PDF_MEGA_DUMP contains ~6K+ entries, but without filesize. The data itself is not terribly useful as there is no way to guarantee where (malicious domain or clean domain) the files came from and whether or not they were infected. The purpose of this collection was to test Bighands and to quickly view the relationship between objects and other structural pieces of a PDF in a large dataset. 

    Use the data at your own risk and do not consider it to be free of malicious data. All results came from Google so there is a certain expectancy that the files will be clean, but since this can not be proven the data must always be classified as dirty. 

    Click here to download the dataset (MySQL 5.0)