• ClassyPDF Tool Up for Grabs

    by  • July 31, 2012 • Uncategorized

    Back at the tail end of April I had posted about data mining PDF data in order to classify whether or not a document were malicious. In the post I had talked about data and an API, but never released the tool out to the public. It has been a few months since the post and some closed testing and I now feel comfortable releasing the tool to use. 

    The only difficulty I had with the tool was building the Orange library and that was mainly due to poor build documentation for Ubuntu. I have included a PDF document inside of the code repository that has a good deal of explanation and considerations for when using the tool. Please feel free to modify the code as you see fit and explore what Orange has to offer. 

    Below are the additional notes from the PDF just in case you don’t want to open a PDF from me!

    Building Orange on Ubuntu

    Swapping classifiers

    Classifiers can be adjusted within the judge.py file towards the top where the standard KNN filter has been defined. See documentation on the orange website on other available classifiers. It is possible to combine multiple classifiers together for better results.

    False positives

    Apart from feature extraction, not much has done in regards to tuning the classifier or feature pruning. The best way to deal with false positives is to define an acceptable threshold for the classifier decision. Workflows should be put in place so that properly or manually classified documents are appended to the end of the complete dataset. This will improve the classifier over time and ensure it continues functioning as intended.

    Feature relevance

    The current classifier data includes 35 total features to represent the PDF. Feature relevance can be calculated against the dataset to identify features that may provide little to no value. It is recommended to adjust these based on your working needs. Note, this should be done using the Orange library and not deleting from the actual data tab file.

    Testing accuracy 

    Several tests for accuracy exist within the Orange framework. It is recommended that these tests be performed on the data during any major or minor change. This testing ensures that no newly introduced data severely influences the outcome of the classifier.