Update: VirusTotal reached out on Twitter with the following:
@9bplus hi there. your proposal looks reasonable to me, but we can’t implement that changes in the short term. maybe in a 2.0 release
@jcanto > awesome. Do you guys have a time of when that may occur or a roadmap you would share? I really appreciate the quick response.
@9bplus not really sorry :/ we’re working on other matters a bit more urgent so I can’t really tell you when we’ll be able to handle that
@jcanto > I understand. I’ll re-parse into my format until then. Thanks again and now I’m glad to know I can reach out to you guys.
@9bplus no, thanks to you. not everybody makes so constructive feedback, believe me
In the next day or so I am going to just re-parse the VirusTotal results into the proposed format and include it in malpdfobj. I’ll post an update when this is done and in the meantime you can download the latest changes by syncing up with Github now.
I have been writing map/reduce jobs for the past few days now (more on this later) and have made my way to the scan results for each PDF file I have. For most of them I have a report from VirusTotal that I get using their Python API. It makes me happy that they in fact have an API and I am grateful for that in itself, but the output format is ugly and a royal pain in the ass to work with. One struggles to even use the word format and result in the same sentence when thinking about it. I have proposed a new format that is easy to parse without wanting to kill yourself or completely ditch the results, but before I get into that, let me jusitfy my gripes.
Here is the current format in all its glory:
Why is this format better you ask? Well, aside from the obvious improvements in getting the data you actually want, it accounts for future enhancements or data that anti-viruses may provide. The report itself is now a top level object with object data contained in it. Our date now has a label that actually tells us what it means. We also now have a scans object with an array of results. The results themselves are all the same in that they contain a scanner key and a result key. The scanner coresponds to the actual anti-virus used while the result gives you what the output was.
So how does this take into account future anti-virus changes you ask? Aside from scan results, we may want to know how long an anti-virus took to run or maybe we want to see the accuracy placed on the output sample. I am not certain if these are available in the output that VirusTotal gets, but if they are then they can easily be included within each scan result. Even if some of the results are missing the tag, it won’t break the parsing of the data. Overall, it is much easier to deal with because I know my results are now in scans rather then some random, unlabled index that I need to work to locate.
Some may read this and tell me to pipe down. I mean VirusTotal is a free service and they don’t have to provide an API at all, so I should take what I can get. While yes I agree with that and maybe the post is a little attacking in nature, it still doesn’t change the fact that the implementation is poor and is easy to fix. This is a simple change of structure and a label or two in the generation of the results. It is not difficult to realize that the data is hard to work with, so rather then accepting it, it should be changed.