Anomaly detection using Python [closed] Anomaly detection using Python [closed] python python

Anomaly detection using Python [closed]


Here's a simple machine learning approach to the problem, and is what I'd do to get started on this problem and develop a baseline classifier:

Build up a corpus of scripts and attach a label either 'good' (label= 0) or 'bad' (label = 1) the more the better. Try to ensure that the 'bad' scripts are a reasonable fraction of the total corpus,50-50 good/bad is ideal.

Develop binary features that indicate suspicious or bad scripts. For example, the presence of 'eval', the presence of 'base64_decode'.Be as comprehensive as you can be and don't be afraid of including afeature that might capture some 'good' scripts too. One way to help to do this might be to calculate the frequency counts of words in the two classes of script and select as features words that appear prominently in 'bad' but less prominently in 'good'.

Run the feature generator over the corpus and build up a binary matrix of features with labels.

Split the corpus into train (80% of examples) and test sets (20%). Using the scikit learn library, train a few different classification algorithms (random forests, support vector machines, naive bayes etc) with the training set and test their performance on the unseen test set.

Hopefully I have a reasonable classification accuracy to benchmark against. I'd then look at improving the features, some unsupervised methods (without labels) and more specialised algorithms to get better performance.

For resources, Andrew Ng's Coursera course on Machine Learning (which includes example spam classification, I believe) is a good start.