• Consuming Raw or Unstructured Data is Bad for your Health

    by  • December 27, 2011 • Uncategorized

    No, seriously, it is. Think about all the time wasted reformatting someone else’s data or dealing with management requesting some magical composite report built from five of your primary tools, all of which produce no structure for consumption. Late nights, stress and banging your head against the wall surely can’t be good for your health. 

    Jokes aside though, the two examples listed above are just a fraction of the problems experienced with raw or unstructured data. The definition for taking that raw data and making it usable by your tools is called a “hack”. This is the same word reserved for when someone breaks into your system (yes, I know there is an old school definition, but you know what I mean) and it always feels dirty when you have to say it to describe your application.

    So who is causing this problem you ask? It’s the developers making the tools you use. To keep this argument simple, I want to strictly focus on open source software because at the end of the day commercial products are slow to change and are unlikely to care about empowering users to take their data and feed it to another product. 

    A tool is defined as “anything used as a means of accomplishing a task or purpose”. I think it is fair to say a good portion of the tools we currently use for analysis are not truly tools because they don’t fully accomplish the task at hand. We are not able to easily persist the output data and make use of it later for comparison purposes. Implementing such a feature starts with structured output like XML, JSON or even CSV (I don’t consider this structured, but some do). 

    To illustrate what I am talking about, take a look at the following two output snippets from PCAP tools:




    In terms of malware analysis and threat research, times have changed significantly. Most anti-virus companies, commercial groups and contractors are bound to legal contracts, agreements and processes making them less agile whereas independent researchers or those fortunate to be in a loose organization can move much faster with little headache. I believe that individuals and small teams can begin to overtake these larger commercial entities by streamlining processes and chaining tools together to provide faster insight with more detail. 

    If we can begin to improve our tool output, we can start to identify new patterns and share our data more easily. Tools could start consuming data and making sense of it despite not being directly related and overall malware analysis could become much easier with richer results. Opensource software is typically free and provided with no guarantees, so it is hard to beat up on the developer, but adding in an option to structure your output is something that should be included.

    Most of my tools already include these options, but going forward I intend on making this a requirement before I release and it would be nice to see others do the same. Cheers to sharing data and empowering analysts.