I needed a quick way to get PDF data local to my machine without having to go and individually look for documents. I decided the best way to get the data (hoping that it was mostly clean) was to use the Google AJAX Search API to randomly query for PDF filetypes with a seed query to help assist in making the search unique. I also needed a way to pull the same sort of PDF filetypes from known malicious links and thus bighands/dirtyhands were born.
The tools themselves are not really that special, but they work. I was able to run multiple instances of bighands and download a couple thousand random PDF documents a day. Using the dirtyhands tool in combination with the MDL host list I was able to download ~600 PDF files that were being hosted on known or once considered malicious domains. Dirtyhands comes with the list I used along with a way to clean the host list from the malwaredomainlist. There are issues with the tools (mentioned in the readme), but they were created fast and meant to just pull data at will.
A friend of mine just passed me a link from Symantec that discussed detecting malicious files such as an Excel file based on random data. While this is already being done with the PDF work I am doing, I think it would be interesting to see other filetypes such as Excel, Word, etc. get mined for structural statistics. Using bighands and dirtyhands, one could collect a pretty big dataset to query. Just a thought.