A few weeks ago I needed a way of grabbing a bunch of different articles based on a set of basic keywords for one of my projects. Essentially what I wanted was a way to feed a whole bunch of different RSS feeds into a program, have it download the articles, summarize them and then store the important details in a database. The end goal was that I could provide a keyword such as “Gaza” and get back all the new stories associated with anything covering the recent Gaza conflicts.
I ended up finding this article where the author generated summaries by doing frequency distributions against an article. This was a nice hack, but the solution was lacking scalability and it neglected to use the NLTK library from Python. I decided to just roll my own quick project and labeled it “bookworm“. Below is some output and a database entry from running the cronned Python script.
Checks are performed before each article is processed to ensure no duplications make it into the database. I have identified some bugs in the summarization flow where it will pick very poor words, but this is something that you could tune to your liking. I have my script cronned so it’s always pulling down articles and summarizing them. Feel free to fork or download here.