Classification Evaluation Metrics


In this article, we will discuss several important metrics which are used in classification algorithms under supervised learning. Although there are many metrics which can be potentially used for measuring performance of a classification model, some of the main metrics are listed below

  • Confusion matrix– This is one of the most important and most commonly used metrics for evaluating the classification accuracy. Typically on the x-axis “true classes” are shown and on the y axis “predicted classes” are represented. Confusion Matrix is applicable for both binary and multi class classification. Please see the cat and dog classification example listed at the end of this article.  



  • Receiver operating characteristic curve  (ROC)– This curve is typically used for binary classifications. On the x-axis, we plot the “False Positive Rate (FPR)” and on the y-axis, we plot the “True Positive Rate (TPR)” . An ideal classification will have TPR rate of 100% and FPR rate of 0%. However this is not very practical and does not happen in reality. Also note that ROC will not be directly applicable for multiclass classification.


  • Area under the curve (AUC)– Typically this is area under the curve of the ROC curve. Higher the value of AUC better is the binary classification. The Best possible value of AUC is 1 ( or 100%)  and the worst possible value of AUC is 0 (0%). There are two main limitations of AUC- first, it is not applicable for multiclass classification and second, it is not a right metric for unbalanced data, i.e. for the data where one class is represented much higher than the other class. For example, in fraud classification where fraud incidence rate in the typically less than 1%.




  • Accuracy– This measures model’s overall performance in correctly identifying all classes. This metric is valid for both binary and multi-class classification however this is not very robust for the unbalanced data and we should use Precision and Recall metrics instead


  • Precision– When a model identifies an observation as a positive, this metric measure the performance of the model in correctly identifying the true positive from the false positive. This is a very robust matrix for multiclass classification and the unbalanced data. The closer the Precision value to 1, the better the model


  • Recall– This metric measures a  model’s performance in identifying the true positive out of the total true positive cases.  The closer the Recall value to 1, the better the model. As is the case with the Precision metric, this metric is a very robust matrix for multi-class classification and the unbalanced data.

In the below example, let’s pretend that we have built a classification algorithm to identify Dogs ( Positive Class) from a total of 100 animals where in reality 70 animals are Dogs (Positive Class) and 30 are Cats (Negative Class).




  • Fscore– This is a Harmonic mean of Precision and Recall metrics. The closer the Fscore to 1, the better the model.


For more details on any of the above metrics please read this excellent article from Wikipedia.

Thanks for reading! Please share.