A couple months back I remember reading a post from Symantec about visualizing entropy to identify infected Microsoft documents. At the time it didn’t really dawn upon me to visualize the PDF samples I had, but I did take a brief look into how entropy could be used in the detection of malicious PDFs and whether or not it was useful. I specifically looked at how entropy values (stream, nonstream and total) compared between a public dataset and a complete malicious dataset. During that period I found entropy to be quite useless in regards to detecting malicious samples as it never showed a pattern.
While looking back through some of my testing, I realized that the data mentioned above was always a full composite of entropy and not over the course of the file. I wanted the ability to see the randomness throughout the file, so I quickly hacked up two different entropy generators. The first was based on reading each line of the PDF whereas the second one used byte chunks. The interesting thing to note about the line-based entropy was that while it posed no aid in identifying anything malicious, it was able to find PDFs whos content matched everywhere except the payload. I thought this was pretty cool and while the byte chunks proved to do the same thing, it was not as defined.
Visual Entropy using Bytes