• Targeted Document Gathering with Dpacker

    by  • September 7, 2013 • Uncategorized

    As part of creating an incubation environment, I need to download a bunch of “interesting material” that is relevant to the organization I am attempting to mimic. 2 years ago I had written a tool to solve this issue, bighands, that worked well using the Google Ajax API. Unfortunately, and for quite a while now, that API has been dead and gone, but not all hope is lost. Google provides a new API to programmatically use their search simply called “Customer Search API” and it works like a charm.

    Screen Shot 2013-09-07 at 4.59.34 PM

    Before you can really make use of the API, you need to setup a CSE (custom search engine) through Google’s interface (see above). This is the standard process like any other API where you activate the service, generate some keys and click a button or two. For me, I wanted my engine to merely act as a global search repository so my links looked like “*.com”, “*.net” and so on.

    Once all that is taken care of, you can begin messing around with dpacker. The code was developed with my use cases in mind, so you may want to remove the pymongo-related content if you don’t want the additional baggage. Essentially, dpacker is meant to be a stand-alone class where you interface with it by setting up a query and choosing a location to save the output zip file. Everything else is randomized in the back-end and again, fits my specific use case.

    Screen Shot 2013-09-07 at 5.12.32 PM

    Once you “create” a dpack, the class will use your supplied query, randomly pick a set number of files to grab of a specific type and then download them before swapping over to another format and so on. Threading and queues were initially used for this to improve speeds, but were ultimately ditched in favor of the cleaner gevent library. Each request is done within a greenlet and independent of the other requests meaning no blocking per file download. A default kill timer of 60 seconds has been set to avoid any lagging downloads and the entire process should take really no more than that. Below is a idea of the output traffic with debugging on.

    Screen Shot 2013-09-07 at 5.08.22 PM

    If you decide to keep the mongodb code, you can spray your documents with the associated metadata into GridFS for later reuse with a single additional call. What I like about this code is that I get a diverse collection of documents based on a set of keywords that should appear interesting to the operators I am attempting to engage with. One thing to keep in mind though is that this API cost money once you go over a certain limit, so be careful how many times you decide to run it a day otherwise you may be shelling out some cash.