# Everything you need to know about log loss in machine learning

Loss functions are a measure of the accuracy of a machine learning model in predicting the expected outcome. Both the cost function and the loss function relate to the same thing: the training process that uses backpropagation to reduce the difference between the actual and expected outcome. The log loss function measures the cross-entropy of error between two probability distributions. This article will focus on learning the log loss function. Here are the topics to discuss.

## Contents

1. What is a loss function?
2. What is log loss?
3. Mathematical explanation

The logarithmic loss function is part of the maximum likelihood framework. Let’s talk about the loss function first.

## What is a loss function?

The term “loss” refers to the penalty for not achieving the planned production. If the discrepancy between the values ​​predicted and expected by our model is large, the loss function generates a larger number; if the variance is minor and much closer to the expected value, it produces a lower number.

A loss function converts a theoretical assertion into a practical proposition. Building a very accurate predictor requires continuous iteration of the problem by asking, modeling the problem with the selected technique, and testing.

The only criterion used to evaluate a statistical model is its performance – the accuracy of the model’s judgments. This requires the development of a method to determine the distance between a specific iteration of the model and the actual values. This is where loss functions come into the equation.

Loss functions calculate the distance between an estimated value and its actual value. A loss function relates decisions to their costs. Loss functions fluctuate depending on the work to be done and the goal to be achieved.

Are you looking for a comprehensive repository of Python libraries used in data science, check here.

## What is log loss?

When modeling a classification in which the input variables must be labeled according to different classes, the task can be represented as predicting the probability of belonging to each class. The model will predict the probabilities given the training data based on the weights in the training dataset, and the model will adjust its weights to minimize the difference between its predicted probabilities and the probability distribution of the training data . This calculation is called cross-entropy.

The phrase “cross-entropy” is sometimes used to refer to the negative log-likelihood of a Bernoulli or softmax distribution, although this is incorrect. It is possible to define a loss as a cross-entropy between an empirical distribution derived from the training set and a probability distribution derived from the model when characterized by a negative log-likelihood. Mean squared error, for example, is the cross-entropy between an empirical distribution and a Gaussian model.

Whenever the concept of maximum likelihood estimation is used by the algorithm, the loss function is a cross-entropy loss function. When changing model weights during training, cross-entropy loss is used. The goal is to minimize the loss, which means the lower the loss, the better the model. The cross-entropy loss of a perfect model is zero.

## Mathematical explanation

Consider the example of a loss function for a binary classification problem. The objective is to anticipate a binary label (y) and the expected probability (p) of 1. A loss function, which is a binary cross-entropy function, is used to evaluate the quality of the prediction (log loss ). The loss function seems to be a function of prediction and binary labels. A prediction algorithm incurs a loss when it produces a prediction when the actual label is 0 or 1.

The formula,

Where,

• y is the label (0 and 1 for binary)
• p(y) is the predicted probability that the data point is 1 for all N points.

For each observation, the log loss value is determined using the observation’s true value (y) and the prediction probability (p). A log loss score of the classification model is presented as the average of the log losses of all observations/predictions in order to evaluate and characterize its performance. The average of the log loss values ​​of the three predictions is 0.646, as shown in the table.

The log loss score of a model with perfect skill is 0. In other words, the model predicts the probability of each observation as the true value. If both models are applied to the same distribution of the dataset, a model with a lower log loss score outperforms the one with a higher log loss score. The log loss scores of two models run on two separate datasets are incomparable.

If the prediction probability is set to a certain level, the lowest log loss score will be set as the benchmark score. In the image that is the local minima. The naive classification model, which simply fixes all observations with a constant probability equal to the percentage of data containing class 1 observations, determines the base log loss score for a data set. A naive model with a constant probability of 0.25 on a balanced data set with a 49:51 ratio of class 0 to class 1 will yield a log loss score of 0.326, which is considered the benchmark score for this dataset.