• Hacking pdf-parser.py to Grab Data Objects

    by  • December 31, 2010 • Uncategorized

    I can’t say enough good about Didier Stevens and his PDF tools he has created. He has provided a lot of functionality in the couple tools he released and they’re great for getting the data I want. As I stated in my last post, I need JSON and prefer that format for all my data, but some of the tool outputs (print to the screen) are making the valuable data unusable. To solve this I did an ugly, but useful hack.

    When I looked into the pdf-parser.py file I had a couple hundred lines to filter through to identify what specifically was going on (no comments = sadness). After getting the general idea I noticed the areas where I needed to add my code. After a couple tweaks I had a working version that would take each object, send it to my jsonify function, add that to an array and then ouput it to JSON when all was said and done. The problem I still had was this this did not decouple me from the original program and I would essentially need to call it through the command line to get what I want.

    After thinking about it, I remembered I could just import a python file (module) into a new python file. What I needed to do now was just extract everything necessary to making the pdf-parser work and give me the hashes. I through this data into a function, added the component I needed to get the output in JSON and had my results. 

    {"objects": {"object": [{"length": 68, "version": 0, "id": 25, "md5": "b549e54140cc0aa0944a8eddc20569a4"}, {"length": 304, "version": 0, "id": 35, "md5": "896103845dbfbc117269b51ec55d55be"}, {"length": 176, "version": 0, "id": 68, "md5": "8a768590e37398de917dec3921110ebd"}, {"length": 193, "version": 0, "id": 26, "md5": "9908a9b7a1de4016ba89a8357190197a"}, {"length": 220, "version": 0, "id": 27, "md5": "82ebc3f82348778279dbf80124e5a140"}, {"length": 496, "version": 0, "id": 28, "md5": "5f02133a3bb4dcd3713216852b6ef8c3"}, {"length": 395, "version": 0, "id": 29, "md5": "0f5a56f8b4a7c22df483105a7194f160"}, {"length": 496, "version": 0, "id": 30, "md5": "92cf43c1257a02b9cde8bebb8fefb97b"}, {"length": 160, "version": 0, "id": 31, "md5": "17d41601c9dccf37382eb91c0315d204"}, {"length": 47, "version": 0, "id": 32, "md5": "8d7a28e99d782a34f8436bd04a4da6ac"}, {"length": 851, "version": 0, "id": 33, "md5": "e7531914b6a03a18847a1899d6c7121e"}, {"length": 213, "version": 0, "id": 34, "md5": "1f2d547361cc40d87db516005d04ed6f"}, {"length": 169, "version": 0, "id": 1, "md5": "71682d3c5ffbde42a6fddaf8ca0adb16"}, {"length": 1300, "version": 0, "id": 2, "md5": "6ee4205c92bff53681216f0bfa802422"}, {"length": 12062, "version": 0, "id": 3, "md5": "41e2db909692bebb638dc1279cf1ee9d"}, {"length": 1005877, "version": 0, "id": 4, "md5": "cf14d6655e04e123909d461cca3e4d0d"}, {"length": 30625, "version": 0, "id": 5, "md5": "22e976fbf741b293b21f49501a6d52aa"}, {"length": 3565, "version": 0, "id": 6, "md5": "c8d003b29928f2796096bc3eedf7a703"}, {"length": 459, "version": 0, "id": 7, "md5": "200d9996f90e03f8296cea0f388ece26"}, {"length": 231, "version": 0, "id": 8, "md5": "4dfcd943b31f047abce813e68aca75bb"}, {"length": 287, "version": 0, "id": 9, "md5": "b4c391711d784f86d072a165c40a1289"}]}}

    The above JSON is a dump of all objects within a PDF where each object contains an ID, version, length and most importantly, an MD5 hash. This information can now be added to the structure object (see last post) where I can query other malicious files for matching object hashes. 

    If you are interested in the code then please click here.

    About