• PDFid.py Output to CSV

    by  • December 2, 2010 • Uncategorized

    A lot of my initial data storage and collections have used pdfid.py from Didier Stevens. The tools is simple, quick and provides a lot of useful information in a single pass. After going through the code I saw that the output was being parsed out from a generated XML document. Having the data collected in XML is great when passing things around, but I wanted a CSV output just as another option. As time goes on it may be worthwhile to add some switches to the tool to output the raw XML, JSON or a text format like CSV. Being able to consume this data on an aggregate level seems to be more useful than on individual files.

    def PDFiD2CSV(xmlDoc, force): result = '%s,' % (xmlDoc.documentElement.getAttribute('Filename')) #filename result += '%s,' % xmlDoc.documentElement.getAttribute('Header') #header for node in xmlDoc.documentElement.getElementsByTagName('Keywords')[0].childNodes: result += '%d,' % (int(node.getAttribute('Count'))) if xmlDoc.documentElement.getAttribute('CountEOF') != '': result += '%d,' % (int(xmlDoc.documentElement.getAttribute('CountEOF'))) #count eof if xmlDoc.documentElement.getAttribute('CountCharsAfterLastEOF') != '': result += '%d,' % (int(xmlDoc.documentElement.getAttribute('CountCharsAfterLastEOF'))) #chars after eof if xmlDoc.documentElement.getAttribute('TotalEntropy') != '': result += '%s,' % (xmlDoc.documentElement.getAttribute('TotalEntropy')) #total ent result += '%s,' % (xmlDoc.documentElement.getAttribute('TotalCount')) #total ent bytes if xmlDoc.documentElement.getAttribute('StreamEntropy') != '': result += '%s,' % (xmlDoc.documentElement.getAttribute('StreamEntropy')) #stream ent result += '%s,' % (xmlDoc.documentElement.getAttribute('StreamCount')) #stream ent bytes if xmlDoc.documentElement.getAttribute('NonStreamEntropy') != '': result += '%s,' % (xmlDoc.documentElement.getAttribute('NonStreamEntropy')) result += '%s' % (xmlDoc.documentElement.getAttribute('NonStreamCount')) return result