• But is it Web Scale?

    by  • April 27, 2011 • Uncategorized

    I have been keeping a private collection of malicious PDFs (not for long – 😉 ) stored in my MongoDB repository. The collection started off with 30 and progressed as I pulled PDFs off the Internet and the local network. I now have a couple thousand PDFs and MongoDB is still going strong. My major concern for MongoDB was how it would scale given my documents were not the most conventional (some documents tend to be right at the max size limit) and how MongoDB did its querying. Well, my concerns aren’t really concerns at the moment. My collection has over 11,000 documents (still small) and querying is really no different.

    I decided to run a couple map/reduce jobs to see how their speed may have changed given the large document set, but I still got good timing:

    I still stand firm behind my decision to use MongoDB and think it has been one of the major factors my tool seems to work better then any other I have seen out there (web wise). As of now the only complaints I have are that some documents seem to get dropped when doing batch ingesting without warning or error; it’s as if they were never even attempted for insert. I have only seen this behavior on bulk inserts (hundreds and thousands of documents), but it is still difficult to deal with nonetheless. 

    In other web scale news, there is something to be said about execution time when processing data. It goes without saying that code should be written in an efficient manner, but I think more often then not we like to take the easy way out. I mean what is a second or two extra, right? Well, I happened to learn this the hard way and quickly realized that shaving seconds and even milliseconds greatly improved performance on bulk updates. When processing a PDF file it was roughly taking between 10 and 30 seconds depending on the length of the file. This isn’t terribly long, but multiply that by 11,000 and you quickly see that the time needs to be reduced. 

    To handle this issue I started timing various functions and pieces of code I thought could be problem areas. The task initially started high level so I could find the major focus areas and then worked into more specific blocks as I saw the true issues. In the end I managed to reduce the time by several seconds and identify poor coding practices that were accounting for the extra time. Even now I am still going through pieces and trying to optimize in whatever way possiblle. Also, I should note that the ingesting time was never a big deal until I had to rebuild the entire collection from scratch. It was only then that I realized waiting 4 days to ingest everything was a bit too long. 

    Expect big news to come soon. Oh, and if you are going to be at MongoDC then shoot me an email. I will be giving a small talk on Malware, MongoDB and Map/Reduce.