Back at the tail end of April I had posted about data mining PDF data in order to classify whether or not a document were malicious. In the post I had talked about data and an API, but never released the tool out to the public. It has been a few months since the post and some closed testing and I now feel comfortable releasing the tool to use.
The only difficulty I had with the tool was building the Orange library and that was mainly due to poor build documentation for Ubuntu. I have included a PDF document inside of the code repository that has a good deal of explanation and considerations for when using the tool. Please feel free to modify the code as you see fit and explore what Orange has to offer.
Below are the additional notes from the PDF just in case you don’t want to open a PDF from me!
Building Orange on Ubuntu
- svn co http://orange.biolab.si/svn/orange/trunk/orange/
- svn co http://orange.biolab.si/svn/orange/trunk/source
- Follow the rest of the directions here
Classifiers can be adjusted within the judge.py file towards the top where the standard KNN filter has been defined. See documentation on the orange website on other available classifiers. It is possible to combine multiple classifiers together for better results.
Apart from feature extraction, not much has done in regards to tuning the classifier or feature pruning. The best way to deal with false positives is to define an acceptable threshold for the classifier decision. Workflows should be put in place so that properly or manually classified documents are appended to the end of the complete dataset. This will improve the classifier over time and ensure it continues functioning as intended.
The current classifier data includes 35 total features to represent the PDF. Feature relevance can be calculated against the dataset to identify features that may provide little to no value. It is recommended to adjust these based on your working needs. Note, this should be done using the Orange library and not deleting from the actual data tab file.
Several tests for accuracy exist within the Orange framework. It is recommended that these tests be performed on the data during any major or minor change. This testing ensures that no newly introduced data severely influences the outcome of the classifier.