Gradient descent

3/20/2023

Hence, unlike the typical Gradient Descent optimization, instead of using the whole data set for each iteration, we are able to use the cost gradient of only 1 example at each iteration (details are shown in the graph below). Stochastic gradient descent uses this idea to speed up the process of performing gradient descent. The word ‘ stochastic’ means a system or a process that is linked with a random probability. This problem can be solved by Stochastic Gradient Descent. It becomes computationally very expensive to perform because we have to use all of the one million samples for completing one iteration, and it has to be done for every iteration until the minimum point is reached. However, there is a disadvantage of applying a typical Gradient Descent optimization technique in our dataset. The trade-off between them is the accuracy of the gradient versus the time complexity to perform each parameter’s update (learning step). The main difference between them is the amount of data we use when computing the gradients for each learning step. Now let’s discuss the three variants of gradient descent algorithm. A low learning rate is more precise, but calculating the gradient is time-consuming, so it will take us a very long time to get to the bottom. With a very low learning rate, we can confidently move in the direction of the negative gradient since we are recalculating it so frequently. With a large learning rate, we can cover more ground each step, but we risk overshooting the lowest point since the slope of the hill is constantly changing. The size of these steps is called the learning rate (α) that gives us some additional control over how large of steps we make. Note that we used ‘:=’ to denote an assignment or an update. The size of our update is controlled by the learning rate. This new gradient tells us the slope of our cost function at our current position (current parameter values) and the direction we should move to update our parameters. To solve for the gradient, we iterate through our data points using our new weight ‘θ0 ’ and bias ‘θ1’ values and compute the partial derivatives. It would be better if you have some basic understanding of calculus because the technique of the partial derivative and the chain rule is being applied in this case. In Gradient Descent, one iteration of the algorithm is called one batch, which denotes the total number of samples from a dataset that is used for calculating the gradient for each iteration.

The cost is calculated for a machine learning algorithm over the entire training dataset for each iteration of the gradient descent algorithm. Gradient descent, therefore , enables the learning process to make corrective updates to the learned estimates that move the model toward an optimal combination of parameters ( θ). The derivative of a function (in our case, J(θ)) on each parameter (in our case weight θ) tells us the sensitivity of the function with respect to that variable or how changing the variable impacts the function value. Θ1 gradually converges towards a minimum value. We need to know the slope so that we know the direction (sign) to move the coefficient values in order to get a lower cost on the next iteration. The derivative is a concept from calculus and refers to the slope of the function at a given point. Mathematically, the technique of the ‘ derivative’ is extremely important to minimise the cost function because it helps get the minimum point. Gradient Descent runs iteratively to find the optimal values of the parameters corresponding to the minimum value of the given cost function, using calculus. Gradient descent is an efficient optimization algorithm that attempts to find a local or global minimum of a function. This is because the result of a lower error between the actual and the predicted values means the algorithm has done a good job in learning. In order to get the lowest error value, we need to adjust the weights ‘ θ0 ’ and ‘ θ1’ to reach the smallest possible error. Our goal is to move from the mountain in the top right corner (high cost) to the dark blue sea in the bottom left (low cost). The goal of any Machine Learning model is to minimize the Cost Function. The equation of the below regression line is hθ( x) = θ + θ1 x, which has only two parameters: weight (θ1)and bias (θ0 ). One common function that is often used is the mean squared error, which measures the difference between the actual value of y and the estimated value of y (the prediction). The goal is to then find a set of weights and biases that minimizes the cost. The primary set-up for learning neural networks is to define a cost function (also known as a loss function) that measures how well the network predicts outputs on the test set. The calculation method of Gradient Descent.Reason for minimising the Cost Function.

0 Comments

Gradient descent

Leave a Reply.

Author

Archives

Categories