Performance Measurement of Models

Pratik Kumar Roy
9 min readApr 16, 2020

Evaluating your model is an essential part of any project. Your model may give a satisfactory result when evaluated using metric say Accuracy score but it may give poor results on other metrics. We will see various performance measurement metrics. Let’s start one by one.

1. Accuracy

Accuracy is defined as “the number of correctly classified points divided by the total number of points in the test set”.

Accuracy = Number of correctly classified points / Total number of points in the test set

Accuracy ranges between 0 and 1 where 0 means bad accuracy and 1 means good/best accuracy.

Let’s see an example first,

So we have 100 points out of which 60 are positive and 40 negative and after passing through our model M, out of 60 positive 53 are correctly classified whereas 7 are negative classified and also out of 40 negative 35 are correctly classified whereas 5 are positively classified. Therefore we have a total of 88 correctly classified points out of 100 points which gives us an accuracy of 88%.

Problems with Accuracy

(A) Imbalanced Data

Accuracy is only done on the test data. Accuracy fails in case of imbalanced data. In the case of an imbalanced data-set, we should never use accuracy as the measure, because a DUMB model can give high accuracy.

Let’s see an example, suppose we have a dataset with 90% negative points and 10% positive points. So for a new point x when it is passed to a dumb model M it will give negative as a result and the accuracy of our model will be 90%. That’s why “we should never use accuracy as a measure on imbalanced data and in real-world we see lots of imbalanced data”.

(B)Probability Score

In the case of models, where the output is a probability score, we assume a threshold to be and make the labels above that to be one class, below that is another class. (threshold ~ 0.5).

The predicted class labels are exactly the same for the two models(M1 and M2), though the models are different. As we can see, model m1 is better than model m2, but if the term accuracy is used as a performance matrix, both will give the same accuracy and we can not distinguish between model m1 and m2 to tell which one is better.

2.Confusion Matrix

A confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice versa).

It cannot take probability scores. They only take binary values.

Here, TN = number of points such that actual class label = 0 and predicted class label = 0

FN = number of points such that actual class label = 1 and predicted class label = 0

FP = number of points such that actual class label = 0 and predicted class label = 1

TP = number of points such that actual class label = 1 and predicted class label = 1

In case of multiple-class classification:

We can draw a matrix for the predicted and actual values. If the model is sensible, the number of values along the principal diagonal must be more than the off-diagonal elements.

For better models, both TPR(precision) and TNR(negative predictive value) should be high and FPR and FNR should be low as possible.



In the case of imbalanced data-sets, the confusion matrix helps in making good inference from the model.

For example, In medical applications like diagnosing Cancer,

We must have very low FNR as we can’t afford to say an actual patient that he does not have cancer. Little high FPR is acceptable as if even if we say noncancer patients that he may have cancer, later through powerful taste that can be classified. Missing someone is extremely dangerous.

3. Precision, Recall and F1-score

Precision: Precision only computes the positive class predicted rate. Of all points, the model declared to be positive what percentage of them are actually positive.

Precision = TP / (TP+FP)

Recall: Recall only computes on actual positive class values rate. Of all points which actually belong to class ‘1’ how many are predicted to be positive. This is the same as TPR.

Recall = TPR = TP/(TP+FN)

We want both the precision and recall to be high and that’s where the F1-score came from.

F1 Score: F1 Score is the harmonic mean of precision and recall.

F1 score = 2* (precision*recall)/(precision+recall)

Precision and recall are more interpretable than F1-score as we can interpret precision and recall in simple English.

4.Receiver Operating Characteristic(ROC) Curve & AUC(Area under curve)

It was designed by Electronic and radio engineers. It is mostly used in binary classification tasks. There are other extensions of ROC for multiclass classification.

Let’s understand ROC through an example, We have a model M which predicts class label{0,1} and also gives us a score like probability score such that more high is the score more its chance to belong to class 1.

Step 1: Sort the values according to Y’. My table is already sorted.

Step 2: Thresholding(𝞽) :

In this step, we take a threshold value of score and based on it classify the class label of our data point.For example, 𝞽 = 0.95. Now if Y’ >= 𝞽 we mark it as 1 else we mark it as 0.

So now our table looks like this,

Next, we take 𝞽2 = 0.92 and do as above:

And this step is performed for all points taking the score of each point as a threshold.

The threshold gives us TPR and FPR. For each 𝞽 we calculate corresponding TPR and FPR and then plot it.

The TPR vs FPR plot looks like this and this gives us ROC curve:

Let’s compare ROC curve

AUC is Area Under the ROC curve and ROC can only be used for binary classification tasks.

The value of AUC will lie between 0 and 1. The higher the better is the accuracy.

Drawbacks of AUC:

A. In the case of Imbalanced data, AUC can be high.

Please go through this link for more information about ROC on imbalanced data.

B. AUC doesn’t care about the actual value of the scores, it only cares about the ordering of the scores.AUC for several models can be the same because we are sorting the scores that will result in the same order for classes.

=> AUC for a random model will be 0.5. A model is called the Random model if for a point x we randomly decide the class is 0 or 1.

Swapping class label: Another property of AUC is if you ever see a model M having AUC <0.5 let’s say 0.2 then just change the label of class from 0 to 1 or 1 to 0. And now your new model will have AUC of 1–0.2 =0.8 which is a better model.


The log-loss uses the actual probability score. In the above metrics, we haven’t used the probability score actually. In the AUC we used scores for ordering but not the exact score.

So Log-loss uses probability score. It ranges between 0 to ∞. The smaller the log-loss the better it is.

Let’s first see log-loss for binary classification task and then we will extend it to multiclass classification.

For test-set of n-points in binary classification,

In simple English, Log-loss is the average negative log of the probability of correct class label.

After applying log-loss on above label , the table looks like,

For small deviations in probability score this model is penalizing.

As P comes towards 0 -log(x) moves towards ∞.

Log-loss for multiclass classification,


Loss- loss is hard to interpret. Log -loss ranges from 0 to ∞ so if a model has log-loss 1.0 and another one has 10.0 we will not be able to say which is better but we know that beat case is 0.

=> If you care for a class which is smaller in number independent of the fact whether it is positive or negative, go for ROC-AUC score and If you care for absolute probabilistic difference, go with log-loss.

Check out this excellent blog to see the comparison between metrics with an example.

6.R-squared or coefficient of determination

This is a regression metric. In regression, we predict a real-valued output(Y’).

So for {Xi, Yi} we have Y’ which belongs to the real number and for each point, we have an error( Ei = Yi — Yi’)= actual value — predicted value.

Let’s first define some terms,

Sum-of-squares(SS) = i=1n(Yi-ȳ)2 where ȳ = 1/n*1nYi or average value of Yi in test data.

So sum-of-square is the sum of square error using a simple mean model.

Sum-of-residual(SR) = i=1n(Yi-Yi’)2 = i=1nEi2

Residue = Ei=Yi-Yi’ where Yi is actual value and Yi’ is predicted value.

Now, let’s define R-squared,


Case 1: If SR = 0, then Ei =0 => R2= 1 {best value}

Case2: If SR < SS then R2 will be between 0 to 1

Case3: If SR =SS then R2 = 0

This means the model is the same as the simple mean model

Case 4: If SR> SS then R2<0

R-squared will be negative that means the model is worse than the simple model.

One Problem with R-squared is if Ei becomes large it will affect the R-squared. What this means is R-squared is not very robust to outliers.

So to overcome this we have a metric called Median absolute deviation(MAD).

7.Median Absolute Deviation(MAD)

This metric has a low effect on outliers. In MAD instead of taking mean we take the median of Ei’s.Medians and standard deviation are robust to outliers

Think of Ei as random variable then,

Median(Ei) = central value of errors

So MAD(Ei) = Median(|Ei -Median(Ei)|) where Ei — Median(Ei) is deviation and that’s why it’s called Median Absolute Deviation.

If Median(Ei) and MAD(Ei) are small then by looking at the two we can say errors are small.

Thank you all for reading. I hope you have got some knowledge from this post.

I you have any suggestion or you want me to add something in this post please reach out.


  1. Confusion Matrix :

Will see you soon with another post.