• PDFiD.py Output to JSON

    by  • December 10, 2010 • Uncategorized

    I want to store as much data as possible about this malware being collected, and I realized that a database would be the best idea in storing the data. One of the things I was playing around with in my head was taking these detailed PDFiD scans and preserving them without having to parse them directly into a database format. While the XML output of the scan could be stored in the database, I felt JSON would be a much cleaner and lighter approach. I doubt putting a blob of JSON in a database is the most efficient thing to do, but doing so gives me the ability to store a whole scan that can be parsed with a relatively small footprint. 

    Rather than recreating the whole function to create the JSON from scratch, I decided to just convert the XML document created by PDFiD to JSON. The mapping was very similar to the CSV output, but a lot cleaner. All I did was simply add a PDFiD2JSON function, add a new switch and pass in the document. Those who are interested can grab the version of PDFiD.py that includes the JSON output switch here.

    Here is a sample of the output expected:

    [{"pdfid": { "countChatAfterLastEof": "0", "errorMessage": "", "dates": { "date": [ {"name": "/CreationDate", "value": "D:20100212134314-05'00"}, {"name": "/ModDate", "value": "D:20100212134454-05'00"} ] }, "filename": "/tmp/1003.pdf", "nonStreamEntropy": "4.965772", "header": "%PDF-1.4", "version": "0.0.11", "entropy": "", "errorOccured": "False", "isPdf": "True", "keywords": { "keyword": [ {"count": 46, "hexcodecount": 0, "name": "obj"}, {"count": 46, "hexcodecount": 0, "name": "endobj"}, {"count": 24, "hexcodecount": 0, "name": "stream"}, {"count": 24, "hexcodecount": 0, "name": "endstream"}, {"count": 2, "hexcodecount": 0, "name": "xref"}, {"count": 2, "hexcodecount": 0, "name": "trailer"}, {"count": 2, "hexcodecount": 0, "name": "startxref"}, {"count": 2, "hexcodecount": 0, "name": "/Page"}, {"count": 0, "hexcodecount": 0, "name": "/Encrypt"}, {"count": 1, "hexcodecount": 0, "name": "/ObjStm"}, {"count": 0, "hexcodecount": 0, "name": "/JS"}, {"count": 0, "hexcodecount": 0, "name": "/JavaScript"}, {"count": 0, "hexcodecount": 0, "name": "/AA"}, {"count": 0, "hexcodecount": 0, "name": "/OpenAction"}, {"count": 0, "hexcodecount": 0, "name": "/AcroForm"}, {"count": 0, "hexcodecount": 0, "name": "/JBIG2Decode"}, {"count": 0, "hexcodecount": 0, "name": "/RichMedia"}, {"count": 0, "hexcodecount": 0, "name": "/Launch"}, {"count": 0, "hexcodecount": 0, "name": "/Colors > 2^24"} ] }, "countEof": "2", "streamEntropy": "7.852893", "totalEntropy": "7.783445" } }] ;