This article is the notes part of teacher Wu Enda's deep learning course [1].

Author:Huang Haiguang[2]

Main writers: Huang Haiguang, Lin Xingmu (all the manuscripts of the fourth lesson, the first two weeks of the fifth lesson, and the first three quarters of the third week), Zhu Yansen: (all the manuscripts of the third lesson), He Zhiyao (the third week of the fifth lesson), Wang Xiang, Hu Hanwen, Yu Xiao, Zheng Hao, Li Huaisong, Zhu Yuepeng, Chen Weihe, Cao Yue, Lu Haoxiang, Qiu Muchen, Tang Tianze, Zhang Hao, Chen Zhihao, You Ren, Ze Lin, Shen Weichen, Jia Hongshun, Shi Chao, Chen Zhe, Zhao Yifan , Hu Xiaoyang, Duan Xi, Yu Chong, Zhang Xinqian

Participating editors: Huang Haiguang, Chen Kangkai, Shi Qinglu, Zhong Boyan, Xiang Wei, Yan Fenglong, Liu Cheng, He Zhiyao, Duan Xi, Chen Yao, Lin Jiayong, Wang Xiang, Xie Shichen, Jiang PengNote: Notes, assignments (including data, original assignment files), and videos are all downloaded in github[3].

I will successively post the course notes on the public account "Machine Learning Beginners", so stay tuned.

The third week Hyperparameter tuning, Batch regularization and program framework (Hyperparameter tuning)

3.1 Tuning process

Hello everyone, and welcome back. So far, you have learned that changes in neural networks involve the setting of many different hyperparameters. Now, for hyperparameters, how do you find a good set of settings? In this video, I want to share with you some guiding principles and some tips on how to systematically organize the super-parameter debugging process. I hope these will enable you to focus more effectively on the appropriate super-parameter settings.

One of the hardest things about training depth is the number of parameters you have to deal with, from learning rate to **Momentum** (momentum gradient descent) parameters. If you use **Momentum** or **Adam to** optimize the parameters of the algorithm **,,** and, maybe you have to choose the number of layers, maybe you have to choose the number of hidden units in different layers, and maybe you want to use learning rate decay. So, you are not using a single learning rate. Then, of course, you may also need to choose the size of the **mini-batch** .

The results confirm that some hyperparameters are more important than others. I believe that the most widely used learning application is that the learning rate is the most important hyperparameter that needs to be adjusted.

In addition, there are some parameters that need to be debugged, such as the **Momentum** parameter, 0.9 is a good default value. I will also adjust the size of the **mini-batch** to ensure that the optimal algorithm runs effectively. I also often debug hidden units. I circled these in orange. These three are the second most important ones I think, relatively speaking. The third most important factor is other factors. The number of layers can sometimes have a big impact, as is the learning rate decay. When applying the **Adam** algorithm, in fact, I never debug, and, I always choose 0.9, 0.999 and respectively, and you can also debug them if you want.

But I hope you have a rough idea of which hyper-parameters are more important, which is undoubtedly the most important. Next are the ones I circled in orange, and then the ones I circled in purple, but this is not a strict and fast standard, I think , Other deep learning researchers may disagree with my point of view or have different intuitions.

Now, if you try to adjust some hyperparameters, how do you choose the debug value? In the early generation of machine learning algorithms, if you have two hyperparameters, I will call them hyperparameter 1 and hyperparameter 2. The common practice is to sample points in the grid, like this, and then study these systematically Numerical value. Here I placed a 5 5 grid. Practice has proved that the grid can be 5 5, or more or less, but for this example, you can try all 25 points, and then choose which parameter has the most effect it is good. This method is very practical when the number of parameters is relatively small.

In the field of deep learning, what we often do, I recommend that you use the following approach to randomly select points, so you can choose the same number of points, right? 25 points, and then use these randomly selected points to test the effect of hyperparameters. The reason for this is that it is difficult for you to know in advance which hyperparameter is the most important for the problem you are solving. As you have seen before, some hyperparameters are indeed more important than others.

For example, suppose that the hyperparameter 1 is (learning rate), take an extreme example, suppose that the hyperparameter 2 is in the denominator of the **Adam** algorithm. In this case, the value of is important, but the value is irrelevant. If you take a point in the grid, and then, you test the 5 values, then you will find that no matter what value you take, the result is basically the same. So, you know there are 25 models, but only 5 values are tested. I think this is very important.

In contrast, if you randomly select values, you will experiment with 25 independent ones, and it seems that you are more likely to find the one that works well.

I have explained the two parameters. In practice, you may search for more than two hyperparameters. If you have three hyperparameters, you are searching for a cube instead of a square. Hyperparameter 3 represents the third dimension. Then, if you take a value in a three-dimensional cube, you will experiment with a lot of more values, three Each of the two hyperparameters is.

In practice, you may search for more than three hyperparameters. Sometimes it is difficult to predict which is the most important hyperparameter. For your specific application, random values rather than grid values indicate that you have explored more important ones. The potential value of the hyperparameter, no matter what the result is.

When you set values for hyperparameters, another convention is to use a strategy from coarse to fine.

For example, in the two-dimensional example, if you take a value, you may find a certain point with the best effect, and maybe some other points around this point also work well. The next thing to do is to zoom in. A small area (in the small blue box), and then get more dense or random values in it, gather more resources, search in this blue square, if you suspect that these hyperparameters are in this area For the best result, after a rough search in the entire grid, you will know that you should focus on a smaller grid next. In smaller squares, you can get points more densely. So this kind of search from coarse to fine is often used.

By experimenting with different values of hyperparameters, you can choose the optimal value for the training set target, the optimal value for the development set, or what you want to optimize most during the hyperparameter search process.

I hope this will provide you with a way to systematically organize the hyperparameter search process. Another key point is random value selection and precise search. Consider using a search process from coarse to fine. But the search content of hyperparameters is more than that. In the next video, I will continue to explain how to select a reasonable range of hyperparameter values.

### 3.2 Using an appropriate scale to pick hyperparameters

In the previous video, you have seen that in the range of hyperparameters, random values can improve your search efficiency. However, the random value is not a random uniform value within the effective range, but it is very important to choose a suitable scale for exploring these hyper-parameters. In this video, I will teach you how to do it.

Suppose you want to select the number of hidden units. Assume that the value range you select is a certain point from 50 to 100. In this case, you can randomly select points on this number axis from 50-100. This is a very intuitive way to search for specific hyperparameters. Or, if you want to select the number of layers of the neural network, we call it a letter, you might choose a value from 2 to 4, and then randomly and uniformly sample along 2, 3, 4. It is more reasonable. You You can also apply a grid search. You will think that the three values of 2, 3, and 4 are reasonable. This is a few examples of random and uniform values within your consideration range. These values are quite reasonable, but Not applicable to some hyperparameters.

Looking at this example, suppose you are searching for a hyperparameter (learning rate), and suppose you suspect that its minimum value is 0.0001 or its maximum value is 1. If you draw a number line from 0.0001 to 1 and randomly and uniformly take values along it, 90% of the values will fall between 0.1 and 1. As a result, between 0.1 and 1, 90% of the resources are used. Between 0.0001 and 0.1, there are only 10% of search resources, which doesn't look right.

On the contrary, it is more reasonable to search for hyperparameters with a logarithmic scale. Therefore, the linear axis is not used here. Take 0.0001, 0.001, 0.01, 0.1, 1, respectively, and randomly pick points on the logarithmic axis, so that the range is 0.0001 to 0.001 In between, there will be more search resources available, and between 0.001 and 0.01 and so on.

So in **Python** , you can do this, making

More commonly, if you take a value between and, in this case, this is (0.0001), you can use the calculated value, which is -4, and the value on the right is the value you can calculate, which is 0. What you have to do is to randomly and uniformly give values in the interval. In this example, then you can set the value based on randomly sampled hyperparameters.

So summary, take the value in the logarithmic coordinate, take the logarithm of the minimum value to get the value, take the logarithm of the maximum value to get the value, so now you take the value on the logarithmic axis to the interval, between, Select the value uniformly and set the hyperparameter to. This is the process of taking values on the logarithmic axis.

Finally, another tricky example is to give values to calculate the weighted average of the index. Suppose you think it is a value between 0.9 and 0.999, maybe this is the range you want to search. Keep this in mind, when calculating the weighted average of the index, taking 0.9 is like calculating the average over 10 values, a bit similar to calculating the temperature average over 10 days, and taking 0.999 is taking the average over 1000 values.

So similar to the content on the previous slide, if you want to search between 0.9 and 0.999, you can't use the linear axis to get the value, right? Do not randomly select values in this interval, so the best way to consider this problem is to explore that this value is in the range of 0.1 to 0.001, so we will give the value, probably from 0.1 to 0.001, before application The method introduced in the slide, this is, this is, it is worth noting that in the previous slide, we wrote the minimum value on the left and the maximum value on the right, but here we reverse the size. Here, the one on the left is the maximum value, and the one on the right is the minimum. So all you have to do is to randomly and uniformly set the value of r in it. You set it, so, then this becomes a random value of hyperparameters within a specific selection range. Hope to get the results you want in this way, you have as many resources to explore in the range of 0.9 to 0.99 as you do in the range of 0.99 to 0.999.

So, if you want to study more formal mathematical proofs about why we want to do this, why it s not a good idea to use a linear axis to get the value, this is because the sensitivity of the result will change when it is close to 1, even if there is a small Variety. So it doesn't matter if you take a value between 0.9 and 0.9005, your results will hardly change.

But if the value is between 0.999 and 0.9995, this will have a huge impact on your algorithm, right? In both cases, the average is based on approximately 10 values. But here, it is the weighted average of the index, based on 1000 values, now it is 2000 values, because this formula becomes very sensitive to subtle changes when it is close to 1. Therefore, during the entire value process, you need to select values more intensively, in the interval close to 1, or when close to 0, so that you can distribute the sampling points more effectively and explore the possibilities more efficiently. result.

I hope it can help you choose the right scale to set the value of the hyperparameter. If you have not made the correct scale decision in the hyperparameter selection, don t worry, even if you take values on a uniform scale, if the total number of values is large, you will get good results, especially when the application goes from coarse to With a detailed search method, in subsequent iterations, you will still focus on the range of useful hyperparameter values.

Hope this will be helpful to your hyperparameter search. In the next video, we will share some thoughts on how to organize the search process, and hope it will make your work more efficient.

### 3.3 Hyperparameters tuning in practice: Pandas VS Caviar (Hyperparameters tuning in practice: Pandas vs. Caviar)

So far, you have heard a lot about how to search for optimal hyperparameters. Before ending our discussion on hyperparameter search, I would like to share some suggestions and tips with you finally on how to organize your hyperparameter search process. .

Today's deep learning has been applied to many different fields. The hyperparameter setting of one application field may be used in another field, and different application fields are intermingled with each other. For example, I have seen ingenious methods emerging in the field of computer vision, such as **Confonets** or **ResNets** , which we will talk about in subsequent courses. It has also been successfully applied to speech recognition, and I have also seen that ideas originally derived from speech recognition are successfully applied to **NLP** and so on.

In the field of deep learning, one thing that has developed very well is that people in different application fields will read more and more articles in other research fields to find inspiration across fields.

As far as hyperparameter settings are concerned, I have seen some intuitive ideas become very lacking in new ideas, so even if you only study one problem, such as logic, you may have found a good set of parameter settings and continue. Develop the algorithm. Perhaps in the course of a few months, you may observe that your data will gradually change, or maybe just update the server in your data center. Because of these changes, your original hyperparameter settings are no longer It's easy to use, so I suggest maybe just retest or evaluate your hyperparameters, at least once every few months, to make sure you are still satisfied with the values.

Finally, on the question of how to search for hyperparameters, I have seen about two important schools of thought or two important but different ways that people usually use.

One is that you look after a model, usually with a huge data set, but without many computing resources or enough **CPU** and **GPU** , basically, you can only afford to test one model or a small batch of models at a time. In this case, you can gradually improve even when it is experimenting. For example, on day 0, you initialize random parameters, and then start experimenting, and then you gradually observe your own learning curve, maybe the loss function J, or the data setting error or other things, gradually decrease in the first day, then this day At the end, you might say, look, it learns really well. I try to increase the learning rate a bit and see how it will work. Maybe it turns out that it does better. That's your performance the next day. Two days later, you will say that it is still doing well, maybe I can fill **Momentum** or reduce variables now. Then enter the third day, every day, you will observe it and constantly adjust your parameters. Maybe one day, you will find that your learning rate is too high, so you may return to the previous model, like this, but you can say that you spend time looking after this model every day, even if it is in many days or many weeks During the test. So this is a way for people to take care of a model, observe its performance, and patiently adjust the learning rate, but that is usually because you do not have enough computing power to test a large number of models at the same time.

Another way is to experiment with multiple models at the same time. You set some hyperparameters and let it run by itself, or for one or more days, and then you will get a learning curve like this, which can be loss function J or experiment Error or loss or loss of data error, but it is a measure of your curve trajectory. At the same time you can start a different model with different hyperparameter settings, so your second model will generate a different learning curve, maybe one like this (purple curve), I would say this one looks more Better. At the same time, you can experiment with the third model, which may produce a learning curve like this (red curve), and another (green curve), maybe this one deviates, like this, and so on. Or you can test many different models in parallel at the same time. The orange lines are the different models. In this way you can experiment with many different parameter settings, and then just choose the one that works best in the end. In this example, perhaps this one looks the best (green curve below).

For example, I call the method on the left the panda method. When pandas have children, they have very few children, usually only one at a time, and then they spend a lot of energy raising panda babies to ensure that they can survive, so this is indeed a kind of care, a model similar to a panda baby. In contrast, the way on the right is more like the behavior of fish, which I call the caviar way. During the mating season, some fish will lay 100 million eggs, but the way fish reproduce is that they will produce a lot of eggs, but they don't take care of any of them, just hope that one, or a group of them, can perform well. I guess this is the difference between the reproduction of mammals and the reproduction of fish and many reptiles. I will call it the panda way and the caviar way because it's fun and easier to remember.

So the choice of these two methods is determined by the computing resources you have. If you have enough computers to test many models in parallel, then definitely use the caviar method, try many different hyperparameters, and see how effective it is. But in some application areas, such as online advertising settings and computer vision applications, there are too many data, you need to test a large number of models, so it is very difficult to test a large number of models at the same time, it really depends on the application process. But I see organizations that use the panda method more, where you will look after a model like a baby, adjust the parameters, and try to make it work. Although, of course, even in panda mode, test a model and see if it works, maybe after the second or third week, maybe I should build a different model (green curve) and take care of it like a panda, I guess it can raise several children in a lifetime, even if they only have one child at a time or the number of children is very small.

So I hope you can learn how to perform the hyperparameter search process. Now, there is another technique that can make your neural network more solid. It is not applicable to all neural networks, but when applicable, it It can make hyperparameter search much easier and speed up the test process. We will explain this technique in the next video.

### 3.4 Normalizing activations in a network (Normalizing activations in a network)

After the rise of deep learning, one of the most important ideas is its algorithm, called **Batch** normalization, created by two researchers , **Sergey Loffe** and **Christian Szegedy** . **Batch** normalization will make your parameter search problem easier, make the neural network's choice of hyperparameters more stable, the range of hyperparameters will be larger, the work effect will be good, and your training will be easier. Even the deep web. Let's take a look at how **Batch** normalization works.

When training a model, such as **logistic** regression, you may remember that normalizing input features can speed up the learning process. You calculated the average, subtracted the average from the training set, calculated the variance, and then normalized your data set based on the variance. In the previous video, we saw how this is how to take the outline of the learning problem from a very long Things become more rounded, and it is easier to optimize the algorithm. So this is valid for **logistic** regression and normalized input eigenvalues of neural networks.

What about deeper models? Not only did you enter the eigenvalues, but this layer has activation values, this layer has activation values, and so on. If you want to train these parameters, such as, wouldn t the normalized mean and variance be good? In order to make the training more efficient. In the **logistic** regression example, we saw how to normalize,,, will help you train and more effectively.

So the question is, for any hidden layer, can we normalize the value, in this case, for example, the value, but it can be any hidden layer, train at a faster speed, because it is the next The input value of one layer, so it will affect the training of. Simply put, this is the role of **Batch** normalization. Although strictly speaking, what we really normalize is not, but there are some debates in the deep learning literature about whether the value should be normalized before the activation function, or whether the value should be normalized after the activation function is applied. In practice, normalization is often done, so this is the version I introduced, and I recommend it as the default choice. The following is how to use **Batch** normalization.

In a neural network, some intermediate values are known. Suppose you have some hidden unit values. From to, these come from the hidden layer. So it will be more accurate to write this way, that is, the hidden layer, from 1 to, but I want to write this way The square brackets are omitted to simplify the notation on this line. So you know these values, as follows, you have to calculate the average, emphasize that all these are for the layer, but I omit the square brackets, and then use the formula you use to calculate the variance, and then you will take each value , To standardize it, the method is as follows, subtract the mean and then divide by the standard deviation. In order to make the value stable, it is usually used as the denominator to prevent the situation.

So now we have standardized these values to include the mean value 0 and the standard unit variance, so each component contains the mean value 0 and the variance 1, but we don t want the hidden unit to always contain the mean value 0 and variance 1. Maybe it makes sense to have different distributions of hidden units, so all we have to do is to calculate, we call it,, where and are the learning parameters of your model, so we use gradient descent or some other similar gradient descent algorithm, such as **Momentum** or **Nesterov** , **Adam** , you will update the sum, just as you update the weights of the neural network.

Please note that the function of the sum is that you can set the average value at will. In fact, if, if equal to this denominator term (the denominator in), is equal to, the value here is medium, then the function is that it will accurately transform This equation, if these hold (), then.

Through the reasonable setting and normalization process, that is, the four equations are basically just calculating the identity function. By assigning and other values, you can construct hidden unit values with other averages and variances.

Therefore, the way the network matches this unit may have been used before, etc., but now it will be used instead to facilitate subsequent calculations in the neural network. If you want to put it back to clearly indicate where it is located, you can put it here.

So what I hope you have learned is how normalized input features help neural networks learn. The role of **Batch** normalization is the normalization process it applies to, not just the input layer, but even the neural network. The deep hidden layer in the network. You use **Batch** to normalize the average and variance of some hidden unit values, but one difference between the training input and these hidden unit values is that you may not want the hidden unit values to be average 0 and variance 1.

For example, if you have a **sigmoid** activation function, and you don t want your values to always be concentrated here, you want to make them have a larger variance, or an average value that is not 0, in order to make better use of the non-linear **sigmoid** function. Instead of making all the values concentrated in this linear version, that's why with the two parameters, you can ensure that all the values can be any value you want to assign, or its role is to ensure that the hidden unit has been Standardize the mean and variance. There, the mean and variance are controlled by two parameters, that is, the sum. The learning algorithm can be set to any value, so its real function is to standardize the mean and variance of the hidden unit value, that is, there is a fixed mean and variance. The mean and variance can be It is 0 and 1, and can also be other values, it is controlled by the and two parameters.

I hope you can learn how to use **Batch** normalization, at least as far as a single layer of a neural network is concerned. In the next video, I will teach you how to match **Batch** normalization to a neural network or even a deep neural network. For many different layers of a neural network, how to make it applicable? Later, I will tell you why **Batch** normalization helps train neural networks. So if you think the reason why **Batch** normalization works is still a bit mysterious, then follow me. In the next two videos, we will figure it out.

### 3.5 Fitting Batch Norm into a neural network (Fitting Batch Norm into a neural network)

You have seen those equations. It can be **batch** normalized in a single hidden layer . Next, let's see how it fits in deep network training.

Suppose you have such a neural network, as I said before, you can think of each unit as being responsible for calculating two things. 1. it calculates z first, then applies it to the activation function and then calculates a, so I can think that each circle represents a two-step calculation process. In the same way, for the next level, that is peace and waiting. So if you do not apply **Batch** normalization, you will fit the input to the first hidden layer, and then calculate it first, which is controlled by the two parameters and. Then, generally speaking, you will fit the activation function to the calculation. But the method of **batch** normalization is to **batch** normalize the value , referred to as **BN** , this process will be controlled by the and two parameters, this operation will give you a new normalized value (), and then enter it into the activation function to get ,which is.

Now, you have performed the calculation in the first layer. At this time, the **Batch** normalization occurs between the calculation and of z. Next, you need to apply the value to calculate. This process is controlled by the sum. Similar to what you did on the first layer, you will perform **Batch** normalization. Now we abbreviate it as **BN** . This is controlled by the **Batch** normalization parameter of the next layer , that is, sum, now you get it, and then pass The activation function is calculated and so on.

So it needs to be emphasized that **Batch** normalization occurs between calculation and. The intuition is that instead of applying values that are not normalized, it is better to use normalized values, which is the first layer (). The second layer is the same, instead of applying unnormalized values, it is better to use the normalized variance and mean. So, the parameters of your network will be,, and etc. We will remove these parameters. But now, imagine the parameters, to,, we will add some other parameters to this new network,,, and so on. For each layer where **Batch** normalization is applied . To clarify, please note that these (, etc.) here have nothing to do with hyperparameters. The reason will be explained in the next slide. The latter is used for **Momentum** or to calculate the weighted average of each index. The author of **Adam** 's paper used to represent hyperparameters in the paper. Authors of **Batch** normalized papers use this parameter (, etc.), but these are two completely different. I decided to use it in both cases so that you can read those original papers, but the **Batch** normalized learning parameters, etc. are different from those used in the **Momentum** , **Adam** , and **RMSprop** algorithms.

So now, this is the new parameter of your algorithm, and then you can use any optimization algorithm you want, such as using gradient descent to execute it.

For example, for a given layer, you would calculate and then update the parameter to. You can also use **Adam** or **RMSprop** or **Momentum** to update the parameter sum instead of just applying gradient descent.

Even in the previous video, I have explained how **batch** normalization works, calculate the mean and variance, subtract the mean, and divide by the variance. If they use a deep learning programming framework, usually you don t have to **batch** yourself. The normalization step is applied to the **Batch** normalization layer. Therefore, the exploration framework can be written as one line of code, for example, in **TensorFlow** framework, you can use this function (

**Batch**normalization, we will explain later, but in practice, you don't have to operate all these specific details yourself, but knowing how it works, you can better understand the role of the code. But in the deep learning framework, the process of

**Batch**normalization is often something like a line of code.

So, so far, we have talked about **Batch** normalization, as if you were training on the entire training site, or as if you were using **Batch** gradient descent.

In practice, **Batch** normalization is usually used together with the **mini-batch of the** training set . The way you apply **Batch** normalization is that you use the first **mini-batch** (), and then calculate, which is the same as what we did on the previous slide, apply the parameter sum, and use this **mini-batch** (). Then, continue with the second **mini-batch** (), and then **batch** normalization will subtract the mean, divide by the standard deviation, and rescale by the sum, so that we get, and all of these are in the first **mini-batch** On the basis of, you then apply the activation function to get it. Then use the sum calculation, and so on, so all you do is to perform a one **-** step gradient descent method on the first **mini-batch** ().

Similar work, you will calculate on the second **mini-batch** (), and then use **batch** normalization to calculate, so in this step of **batch** normalization, you use the data in the second **mini-batch** () To normalize, the **Batch** normalization step here is the same. Let s take a look at the example in the second **mini-batch** (). The mean and variance calculated on the **mini-batch** are re-scaled to get, and many more.

Then do the same on the third **mini-batch** () to continue training.

Now, I want to clarify a detail of this parameter. Earlier I said that the parameters of each layer are sums and sums . Please note that the calculation method is as follows, but what **Batch** normalization does is that it depends on this **mini-batch** , first normalize it, and the result is a mean value of 0 Sum standard deviation, and then rescaled by Sum, but this means that no matter what the value is, it must be subtracted, because in the process of **Batch** normalization, you have to calculate the mean value, and then subtract the mean value If any constant is added to the **mini-batch** in this example , the value will not change, because any added constant will be offset by the mean subtraction.

So, if you are using **Batch** normalization, you can actually eliminate this parameter (), or you can temporarily set it to 0, then the parameter becomes, and then you calculate the normalized,, you will finally Use parameters in order to determine the value, this is the reason.

So to summarize, because the **Batch** normalization exceeds the average value of this layer, this parameter is meaningless, so you must remove it and replace it. This is a control parameter that will affect the transfer or bias conditions.

Finally, remember the number of dimensions, because in this example, the number of dimensions will be, and the dimension of is, if it is the number of hidden units in the l layer, so is the dimension of the sum, because this is the number of your hidden layers, you have The hidden unit, so the sum is used to scale the mean and variance of each hidden layer to the value the network wants.

Let us sum up on how to use **Batch** normalized to apply gradient descent method, assuming you are using **mini-batch** gradient descent method, you run the **batch** number **for** the cycle, you will be in **mini-batch** application forward on the **prop** , every All hidden layers are applied with forward **prop** , replaced by **Batch** normalization. Next, it ensures that in this **mini-batch** , the value has a normalized mean and variance. After the normalized mean and variance, it is. Then, you use the reverse **prop to** calculate the sum, and all the parameters of the l layer, and . Although strictly speaking, because you want to remove, this part has actually been removed. Finally, you update these parameters:, as before, and so for.

If you have calculated the gradient as follows, you can use the gradient descent method. This is what I wrote here, but it also applies to the gradient descent method with **Momentum** , **RMSprop** , and **Adam** . Instead of using gradient descent to update the **mini-batch** , you can use these other algorithms to update, as we discussed in the video in the previous few weeks, you can also apply some other optimization algorithms to update the **batch** normalization added to The sum parameter in the algorithm.

I hope that you can learn how to apply **Batch** normalization from scratch , if you want. If you use one of the deep learning programming frameworks, we will talk later. Hopefully, you can directly call other people's programming frameworks, which will make the use of **Batch** normalization easy.

Now, just in case **Batch** normalization still seems a bit mysterious, especially if you still don t know why it can accelerate training so significantly, let s move on to the next video to discuss in detail why **Batch** normalization is so effective and what it is doing. what.

### 3.6 Why does Batch Norm work? (Why does Batch Norm work?)

Why does **batch** normalization work?

One reason is that you have seen how to normalize the input eigenvalues so that the mean value is 0 and the variance is 1. How does it accelerate learning? There are some eigenvalues from 0 to 1 instead of 1 to 1000, through Normalizing all input feature values to obtain values in a similar range can speed up learning. So the reason for the role of **Batch** normalization is intuitively that it is doing similar work, but not only for the input value here, but also the value of the hidden unit. This is just the tip of the iceberg of the effect of **Batch** normalization. There are also some in-depth principles, which will help you have a deeper understanding of the role of **Batch** normalization. Let us take a look.

The second reason that **Batch** normalization is effective is that it can make the weights more lagging or deeper than your network. For example, the weights of the 10th layer can withstand changes more than the weights of the previous layers in the neural network. , Such as the first layer, in order to explain what I mean, let's take a look at this most vivid example.

This is the training of a network, maybe a shallow network, such as **logistic** regression or a neural network, maybe a shallow network, like this regression function. Or a deep network, built on our famous cat face recognition detection, but suppose you have trained the data set on all black cat images, if you want to apply this network to colored cats, in this case, the positive The example of is not just the black cat on the left, but also the cats of other colors on the right, so your **cosfa** may not be very suitable.

If in the image, your training set looks like this, your positive examples are here, and the negative examples are there (left image), but you are trying to unify them into one data set. Maybe the positive examples are here and the negative examples are here. There (picture on the right). You may not be able to expect that the modules that are well trained on the left will also run well on the right. Even if there is the same function that works well, you don t want your learning algorithm to discover the green decision boundary. If you only look at the data on the left.

So the idea of changing the distribution of your data has a weird name " **Covariate shift** ". The idea is this. If you have already learned the mapping, if the distribution changes, then you may need to retrain your learning. algorithm. This approach is also applicable if the real function from to the mapping remains unchanged, as in this example, because the real function is whether the picture is a cat, the need to train your function becomes more urgent, if the real function also changes , The situation is even worse.

How does the problem of " **covariate shift** " apply to neural networks? Imagine a deep network like this. Let's take a look at the learning process from this layer (the third layer). This network has learned the parameter sum. From the perspective of the third hidden layer, it obtains some values from the previous layer, and then it needs to do something to make the desired output value close to the true value.

Let me cover the left part first. From the perspective of the third hidden layer, it gets some values called,,, but these values can also be eigenvalues,,,, the job of the third hidden layer is Find a way to map these values to, you can imagine doing some truncation, so these parameters and or and or and, maybe learning these parameters, so the network is doing well, mapping from the left I wrote with a black pen to the output value.

Now we uncover the left side of the network, this network also has parameters, and, if these parameters change, the values of these will also change. So from the perspective of the third hidden layer, the values of these hidden units are constantly changing, so it has the problem of " **Covariate shift** ", which we talked about in the previous slide.

**What Batch** normalization does is that it reduces the number of changes in the distribution of these hidden values. If it is to plot the distribution of these hidden unit values, maybe this is a re-adjusted value, which is actually, I want to plot two values instead of four values, so that we can imagine **2D** , **Batch** normalization is talking about The values can be changed, they will indeed change. When the neural network updates the parameters in the previous layer, **Batch** normalization can ensure that the mean and variance of will remain the same no matter how it changes, so even if the value of changes, at least their mean and The variance will also be a mean value of 0, a variance of 1, or not necessarily a mean value of 0, a variance of 1, but a value determined by the sum. If the neural network chooses it, it can be forced to mean 0, variance 1, or any other mean and variance. But what it does is that it limits the update of the parameters in the previous layer, which will affect the extent of the value distribution. The third layer sees this situation, so it is learned.

**Batch** normalization reduces the problem of input value changes, it does make these values more stable, and the later layers of the neural network will have a more solid foundation. Even if the input distribution is changed a little, it will change even less. What it does is that the current layer keeps learning. When it changes, the degree of adaptation of the latter layer is reduced. You can think of it this way. It weakens the connection between the role of the parameters of the previous layer and the role of the latter layer. Each layer can learn by itself, slightly independent of other layers, which helps to accelerate the learning of the entire network.

So, I hope this can give you better intuition. The point is that **Batch** normalization means that, especially from the perspective of one of the back layers of the neural network, the front layer will not move so much left and right because they are the same Is limited by the mean and variance, so this will make the later learning work easier.

**Batch** normalizing there is a role, it has a slight effect of regularization, **Batch** normalized Africa intuitive thing is that each **mini-batch** , I would say that **mini-batch** value, in **mini- batch** calculation, scaled by the mean and variance, because the **mini-batch** mean and variance calculations, rather than on the entire data set, the mean and variance have some small noise, because it is only in your **mini-batch** on Calculations such as 64 or 128 or 256 or larger training examples. Because the mean and variance are a little bit noisy, because they are only estimated from a small part of the data. The scaling process from to, the process also has some noise, because it is calculated using the mean and variance of some noise.

So similar to **dropout** , it adds noise to the activation value of each hidden layer. **Dropout** has a way to increase noise. It makes a hidden unit multiplied by 0 with a certain probability, and multiplied by 1 with a certain probability, so Your **dropout** contains a **lot** of noise because it is multiplied by 0 or 1.

In contrast, **Batch** normalization contains several levels of noise because of the additional noise caused by the scaling of the standard deviation and the subtraction of the mean. The estimated values of the mean and standard deviation here are also noisy, so similar to **dropout** , **Batch** normalization has a slight regularization effect, because noise is added to the hidden unit, which forces the rear unit to not rely too much on any hidden unit , Similar to **dropout** , it adds noise to the hidden layer, so it has a slight regularization effect. Because the added noise is very small, it is not a huge regularization effect. You can use **Batch** normalization together with **dropout** , if you want to get a more powerful regularization effect from **dropout** .

Perhaps another slightly non-intuitive effect is that if you apply a larger **mini-batch** , yes, for example, you use 512 instead of 64. By applying a larger **min-batch** , you reduce the noise and therefore reduce The regularization effect, which is a strange property of **dropout** , is that the application of a larger **mini-batch** can reduce the regularization effect.

Speaking of this, I will treat **Batch** normalization as a kind of regularization. This is really not its purpose, but sometimes it will have additional expected or undesired effects on your algorithm. But don't treat **Batch** normalization as regularization. Think of it as a way to normalize your hidden unit activation values and speed up learning. I think regularization is almost an unexpected side effect.

So I hope this will give you a better understanding of the work of **Batch** normalization. Before we end the discussion of **Batch** normalization, I want to make sure you know one more detail. **Batch** normalization can only process one **mini-batch** data at a time. It calculates the mean and variance on the **mini-batch** . So when you test, you try to make predictions, try to evaluate the neural network, you may not have **mini-batch** examples, you may only be able to perform one simple example at a time, so when testing, you need to do something different to make sure you The forecast is meaningful.

In the next and final **Batch** normalization video, let us talk about some details you need to pay attention to in order to make your neural network apply **Batch** normalization to make predictions.

### 3.7 Batch Norm at test time (Batch Norm at test time)

**Batch** normalization processes your data one by one in the form of **mini-batch** , but during testing, you may need to process each sample one by one. Let's take a look at how to adjust your network to do this.

Recall that during training, these are the equations used to perform **Batch** normalization. In a **mini-batch** , you sum the values of the **mini-batch** and calculate the mean, so here you only add up all the samples in a **mini-batch** . I use m to represent the number of samples in this **mini-batch** . Instead of the entire training set. Then calculate the variance, and then calculate, that is, use the mean and standard deviation to adjust, and add it for numerical stability. It is obtained by using and adjusting again.

Please note that the sum used for adjustment calculation is calculated on the entire **mini-batch** , but during testing, you may not be able to process 6428 or 2056 samples in a **mini-batch** at the same time, so you need to use other methods to get the sum , And if you have only one sample, the mean and variance of a sample are meaningless. So in fact, in order to use your neural network for testing, you need to estimate the sum separately. In a typical **batch** normalization application, you need to use an exponentially weighted average to estimate. This average covers all **mini-batch** . I will explain in detail next.

We choose the layer, suppose we have **mini-batch,,,** ... and the corresponding value, etc., then when training for the layer, you will get it, I will write it as the first **mini-batch** and this layer Yes, (). When you train the second **mini-batch** , in this layer and this **mini-batch** , you will get the second () value. Then in the third **mini-batch of** this hidden layer , you get the third () value. Just as we used the exponential weighted average to calculate the mean value of,, at that time, we were trying to calculate the exponential weighted average of the current temperature. You will track the latest average of the mean vector you see in this way, so this exponential weighted average is It becomes your estimate of the mean value of this hidden layer. Similarly, you can use exponential weighted average to track the values you see in the first **mini-batch** of this layer , the values you see in the second **mini-batch** , and so on. Therefore , while training the neural network with different **mini-batch** , you can get the real-time value of the average of each layer you are viewing.

Finally, when testing, corresponding to this equation (), you only need to use your value to calculate, use the exponentially weighted average of the sum, and use the latest value you have on hand to make adjustments, and then you can use the sum we just calculated on the left The sum parameters you get during neural network training are used to calculate the value of your test sample.

To sum up, during training, the sum is calculated on the entire **mini-batch** and contains samples such as 64 or 28 or a certain number of samples, but during testing, you may need to process the samples one by one. The method is based on your There are many ways to estimate the sum of the training set. In theory, you can run the entire training set in the final network to get the sum, but in practice, we usually use exponential weighted average to track what you see during the training process. The value of the sum. You can also use an exponential weighted average, sometimes called a moving average, to roughly estimate the sum, and then use the sum value in the test to adjust the hidden unit value you need. In practice, no matter what method you use to estimate the sum, this process is relatively robust, so I don t worry about your specific operation method, and if you are using a certain deep learning framework, there will usually be a default The method of estimating the sum should have a better effect in the same way. But in practice, any reasonable way of estimating the mean and variance of your hidden unit values should be effective in testing.

**That s all for Batch** normalization. Using **Batch** normalization, you can train deeper networks and make your learning algorithm run faster. Before ending this week s course, I would like to share with you some more about deep learning. The idea of the framework, let us discuss this topic together in the next video.

### 3.8 Softmax regression (Softmax regression)

So far, the classification examples we have talked about have used binary classification. This classification has only two possible labels 0 or 1. This is a cat or not a cat. If we have multiple possible types What about? There is a general form of **logistic** regression, called **Softmax** regression, which allows you to make predictions when trying to identify a certain category, or one of multiple categories, not just to identify two categories, let s take a look.

Suppose you don t just need to recognize cats, but you want to recognize cats, dogs, and chickens. I add cats to category 1, dogs to category 2, and chickens to category 3. If they don t belong to any of the above categories, they are classified into "others". "Or "none of the above" category, I call it category 0. The picture shown here and its corresponding classification is an example. This picture is a chicken, so it is category 3, cat is category 1, dog is category 2, I guess this is a koala, so all of the above If it doesn't match, that is class 0, the next class 3, and so on. We will use symbols to indicate, I will use capitals to indicate the total number of categories that your input will fall into. In this example, we have 4 possible categories, including "other" or "none of the above "this kind. When there are 4 categories, the number indicating the category is from 0 to, in other words, 0, 1, 2, 3.

In this example, we will build a neural network with 4 output layers, or one output unit. Therefore, the output layer is the number of units in the layer, which is equal to 4, or generally equal. We want the number of the output layer unit to tell us how high the probability of each of these 4 types is, so the first node here (the last output of the first square + circle) should be output or we want it Output the probability of the "other" category. In the case of input, this (the second square of the last output + circle) will output the probability of a cat. In the case of input, this will output the probability of a dog (the third square + circle output at the end). In the case of input, the probability of outputting a chick (the fourth square of the final output + circle), I abbreviated the chick as **bc** ( **baby chick** ). So here will be a dimensional vector, because it must output four numbers, giving you these four probabilities, because they should add up to 1, and the four numbers in the output should add up to 1.

The standard model for your network to do this **requires the Softmax** layer and the output layer to generate output. Let me write down the formula, and then look back, and I will have a little sense of the effect of **Softmax** .

In the last layer of the neural network, you will calculate the linear part of each layer as usual. This is the variable of the last layer. Remember that this is the uppercase layer. As usual, the calculation method is that after you calculate it, you need Apply the **Softmax** activation function, this activation function is somewhat different for the **Softmax** layer, and its function is like this. 1. we need to calculate a temporary variable, we call it t, which is equal to, and this applies to each element, and here, in our example, is a 4 1, four-dimensional vector, which is to find all elements The power is also a 4 1 dimensional vector, and then the output is basically a vector, but it will be normalized so that the sum is 1. Therefore, in other words, is also a 4 1 dimensional vector, and the first element of this four-dimensional vector, I write it down, in case the calculation here is not clear and easy to understand, we will give an example to explain in detail soon.

Let s look at an example and explain it in detail. Suppose you have calculated that is a four-dimensional vector. Assuming that, what we have to do is use this element exponentiation method to calculate, so if you click the calculator, you will get the following values We only need to normalize these items to get the vector from the vector so that the sum is 1. If you add up all the elements and add up these four numbers, you get 176.3, finally.

For example, the first node here will output. In this way, for this picture, if this is the value you get (), the probability that it is class 0 is 84.2%. The next node output is a 4.2% chance. The next one is. The last one is that 11.4% of the probability belongs to class 3, which is the chicken group, right? This is the possibility that it belongs to class 0, class 1, class 2, and class 3.

The output of the neural network, that is, is a 4 1 dimensional vector. The elements of this 4 1 vector are the four numbers we calculated (), so this algorithm calculates the four probabilities that sum to 1 through the vector .

If we summarize the calculation steps from to, the entire calculation process, from calculating the power to obtaining the temporary variables, and then normalizing, we can summarize this as a **Softmax** activation function. Suppose, the difference of this activation function is that this activation function needs to input a 4 1 dimensional vector, and then output a 4 1 dimensional vector. Before, our activation functions all accept single-line numerical input, such as **Sigmoid** and **ReLu** activation functions, input a real number, and output a real number. The special feature of the **Softmax** activation function is that, because it needs to normalize all possible outputs, it needs to input a vector and finally output a vector.

So what else can the **Softmax** classifier represent? Let me give you a few examples. You have two inputs. They are directly input to the **Softmax** layer. It has three or four or more output nodes. Output. I will show you a neural network without hidden layers. What you do is to calculate, and the output, or in other words, is the **Softmax** activation function. This neural network without hidden layers should give you an idea of what the **Softmax** function can represent.

In this example (left picture), the original input is only sum, and the **Softmax** layer of each output classification can represent this type of decision boundary. Please note that these are several linear decision boundaries, but this allows it to divide the data into three In the category, in this chart, what we have done is to select the training set shown in this figure and use the three output labels of the data to train the **Softmax** classifier. The color in the figure shows the output threshold of the **Softmax** classifier. Input The coloring is based on the one with the highest probability of the three outputs. Therefore, we can see that this is the general form of **logistic** regression, with similar linear decision boundaries, but there are more than two categories. The categories are not only 0 and 1, but can be 0, 1, or 2.

This is (middle picture) another example of the decision boundary that the **Softmax** classifier can represent. It is trained on a data set with three classifications, and there is another one here (right picture). Right, but intuition tells us that the decision boundary between any two categories is linear. That s why you see, for example, the decision boundary between yellow and red categories is linear, and the boundary between purple and red. It is also a linear boundary. The boundary between purple and yellow is also a linear decision boundary, but it can use these different linear functions to divide the space into three categories.

Let's take a look at more classification examples, in this example (left picture), so this green classification and **Softmax** can still represent these types of linear decision boundaries between multiple classifications. Another example (middle image) is class, and the last example (right image) is, which shows what the **Softmax** classifier can do without hidden layers. Of course, deeper neural networks will have, and then some hidden Units, and more hidden units, etc., you can learn more complex nonlinear decision boundaries to distinguish a variety of different categories.

I hope you understand what the **Softmax** layer or **Softmax** activation function in the neural network does. In the next video, let s take a look at how you can train a neural network that uses the **Softmax** layer.

### 3.9 Training a Softmax classifier (Training a Softmax classifier)

In the previous video, we learned about the **Softmax** layer and the **Softmax** activation function. In this video, you will have a deeper understanding of **Softmax** classification and learn how to train a model that uses the **Softmax** layer.

Recall the example we gave before. The output layer is calculated as follows. We have four categories, which can be 4 1 dimensional vectors. We calculated temporary variables,, and exponentiated the elements. Finally, if your output layer The activation function of is the **Softmax** activation function, then the output will be like this:

Simply put, it is normalized with a temporary variable, so that the sum is 1, so this becomes, you notice that the largest element in the vector is 5, and the largest probability is the first probability.

The source of the name **Softmax** is compared with the so-called **hardmax** . **Hardmax** will turn the vector into this vector. The **hardmax** function will observe the element, and then put 1 in the position of the largest element in the middle, and put 0 in other positions, so this is a **hard max** , That is, the output of the largest element is 1, and the other outputs are all 0. In contrast, the mapping from these probabilities by **Softmax is** more gentle. I don't know if this is a good name, but at least this is the idea behind the name **softmax,** which is the opposite of **hardmax** .

One thing I did not go into detail, but it has been mentioned before, is **Softmax** regression or **Softmax** activation function **logistic** promote the activation function to the class, rather than just two categories, if the result is, so **Softmax** actually changed back **logistic** Regression, I will not give a proof in this video, but the general idea of proof is this, if, and you apply **Softmax** , then the output layer will output two numbers, if so, maybe output 0.842 and 0.158, yes Right? These two numbers must add up to 1, because their sum must be 1. In fact, they are redundant. Maybe you don t need to calculate two, but only one of them. The result is how you finally calculate that number. Back to the way **logistic** regression calculates a single output. This is not a proof, but we can conclude from it that **Softmax** regression extends **logistic** regression to more than two categories.

Next, let's look at how to train a neural network with a **Softmax** output layer. Specifically, we first define the loss function that will be used to train the neural network. For example, let s take a look at the target output of a sample in the training set. The true label is, using the example mentioned in the previous video, it means that this is a picture of a cat because it belongs to class 1. Now we Assuming that the output of your neural network is a vector including the probability that the sum is 1, you can see that the sum is 1. This is,. For this sample, the performance of the neural network is poor. This is actually a cat, but it is only assigned a 20% probability of being a cat, so it performs poorly in this example.

So what loss function do you want to use to train this neural network? In **Softmax** classification, the loss function we generally use is that we look at the above single sample to better understand the whole process. Note that in this sample, because these are all 0, only if you look at this summation, all items with a value of 0 are equal to 0, and only left in the end, because when you add up all of them according to the subscripts, all The terms are all 0, except when and because, so it is equal to.

This means that if your learning algorithm tries to make it smaller, because the gradient descent method is used to reduce the loss of the training set, the only way to make it smaller is to make it smaller. To do this, You need to make it as large as possible, because these are probabilities, so it cannot be greater than 1, but it does make sense, because in this example it is a picture of a cat, you need the probability of this output to be as large as possible ( The second element in).

In summary, what the loss function does is find the true category in your training set, and then try to make the corresponding probability of that category as high as possible. If you are familiar with the maximum likelihood estimation in statistics, this is actually the maximum likelihood A form of estimation. But if you don t know what that means, don t worry, the algorithmic thinking we ve just talked about is enough.

This is the loss of a single training sample, what about the loss of the entire training set? That is, the cost of setting parameters, as well as the cost of various forms of deviation. You can roughly guess its definition, which is the sum of the loss of the entire training set. Take the prediction of your training algorithm for all training samples. All add up,

So all you have to do is to use the gradient descent method to minimize the loss here.

Finally, there is an implementation detail. Note that because is a 4 1 vector, it is also a 4 1 vector. If you implement vectorization, the matrix is capitalized. For example, if the above sample is your first training sample, then the matrix , Then this matrix is ultimately a dimensional matrix. Similarly, this is actually (), or the output of the first training sample, so it is also a dimensional matrix.

Finally, let s take a look at how to implement the gradient descent method when there is a **Softmax** output layer. This output layer will be calculated. It is dimensional. In this example, it is 4 1. Then you use the **Softmax** activation function to get or say, then The loss can be calculated from this. We have already talked about how to implement the forward propagation steps of the neural network to obtain these outputs and calculate the loss. What about the back propagation step or the gradient descent method? In fact, the key step or key equation required to initialize backpropagation is this expression. You can subtract this 4 1 vector from this 4 1 vector. You can see that these are all 4 1 vectors. When you have In the case of 4 classifications, in general, this is in line with our general definition. This is the partial derivative of the loss function (). If you are proficient in calculus, you can derive it yourself, or if you are proficient in calculus, you can try Derive it yourself, but if you need to use this formula from scratch, it s just as useful.

With this, you can calculate and then start the process of backpropagation to calculate all the derivatives needed in the entire neural network.

But in this week s preliminary exercise, we will start to use a deep learning programming framework. For these programming frameworks, usually you only need to focus on getting forward propagation right, as long as you specify it as a programming framework, forward propagation , It will figure out how to backpropagate, and it will help you achieve backpropagation, so this expression is worth remembering (), if you need to start from scratch, implement **Softmax** regression or **Softmax** classification, but in fact in this week s preliminary exercise You won't use it, because the programming framework will help you figure out the derivative calculations.

**That's it for Softmax** classification. With it, you can use the learning algorithm to divide the input into more than two categories, but into different categories. ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? Next, I want to show you some deep learning programming frameworks that can make you more efficient in implementing deep learning algorithms. Let us discuss them in the next video.

### 3.10 Deep Learning frameworks

You have learned to implement deep learning algorithms using **Python** and **NumPy** almost from scratch , and I am glad you did, because I hope you understand what these deep learning algorithms are actually doing. But you will find that unless you apply more complex models, such as convolutional neural networks, or recurrent neural networks, or when you start to apply very large models, it will become less and less practical, at least for most people. , It is not realistic to start from scratch all by oneself.

Fortunately, there are many good deep learning software frameworks that can help you implement these models. By analogy, I guess you know how to do matrix multiplication, you should also know how to program two matrices to multiply, but when you are building a large application, you probably don t want to use your own matrix multiplication function, but To access a numerical linear algebra library, it will be more efficient, but it is useful if you understand how two matrices are multiplied. I think deep learning is very mature now. Using some deep learning frameworks will be more practical and will make your work more effective. Let's take a look at the frameworks.

There are many deep learning frameworks that can make it easier to implement neural networks. Let's talk about the main ones. Each framework is aimed at a certain user or development group. I think each framework here is a reliable choice for a certain type of application. Many people have written articles comparing these deep learning frameworks and how well these deep learning frameworks have developed. , And because these frameworks tend to evolve continuously and are improving every month, if you want to see the discussion about the advantages and disadvantages of these frameworks, I leave it to you to search online, but I think many frameworks are fast Progress, getting better and better, so I will not make a strong recommendation, but share with you the criteria for recommending a framework.

An important criterion is the ease of programming, which includes both the development and iteration of neural networks, as well as the configuration of the product, for the actual use of hundreds of millions or even hundreds of millions of users, depending on what you want to do.

The second important criterion is running speed, especially when training large data sets. Some frameworks allow you to run and train neural networks more efficiently.

There is another standard that people don't often mention, but I think it is very important, that is whether the framework is really open, if a framework is really open, it not only needs to be open source, but also needs good management. Unfortunately, in the software industry, some companies have a history of open source software, but the company maintains full control over the software. When a few years have passed and people start to use their software, some companies have begun to gradually close the resources that were once open. , Or transfer functions to their exclusive cloud services. So one thing I will pay attention to is whether you can believe that this framework can remain open source for a long time, rather than under the control of a company, it may choose to stop open source for some reason in the future, even if the software is now Released in open source form. But at least in the short term, it depends on your language preference, whether you prefer **Python** , **Java** or **C++** or whatever. It also depends on the application you are developing, whether it is computer vision, natural language processing or line. Advertise, wait, I think multiple frameworks here are good choices.

The program framework is here. By providing a higher degree of abstraction than the numerical linear algebra library, each program framework here can make you more efficient in developing deep machine learning applications.

### 3.11 TensorFlow

Welcome to the last video of this week. There are many great deep learning programming frameworks. One of them is **TensorFlow** . I am looking forward to helping you start learning to use **TensorFlow** . I want to show you the basic structure of **TensorFlow** programs in this video . Then let you practice by yourself, learn more details, and apply it to this week's programming exercises. This week's programming exercises will take some time to do, so please make sure to leave some spare time.

Let me raise an enlightening question first. Suppose you have a loss function that needs to be minimized. In this example, I will use this highly simplified loss function. This is the loss function. Maybe you have noticed that the function is actually, You expand this quadratic way to get the above expression, so the smallest value is 5, but suppose we don t know this, and you only have this function, let s see how to minimize it with **TensorFlow** . Because a very similar program structure can be used to train neural networks. There can be some complex loss functions that depend on all the parameters of your neural network, and similarly, you can use **TensorFlow to** automatically find the value that minimizes the sum of the loss function. But let's start with the simpler example on the left.

I run **Python** in my **Jupyter notebook** ,

```
import numpy as np
import tensorflow as tf
#Import TensorFlow
w = tf.Variable( 0 ,dtype = tf.float32) #Next
, let us define the parameter w. In TensorFlow, you need to use tf.Variable() to define the parameter
#Then we define the loss function:
cost = tf.add(tf.add(w** 2 ,tf.multiply(- 10 .,w)), 25 )
#Then we define the loss function J
Then we write:
train = tf.train.GradientDescentOptimizer( 0.01 ).minimize(cost)
#(Let us use a learning rate of 0.01, the goal is to minimize the loss).
#The last few lines below are idiomatic expressions:
init = tf.global_variables_initializer()
session = tf.Session() #This opens a TensorFlow session.
session.run(init) #To initialize global variables.
#Then let TensorFlow evaluate a variable, we need to use:
session.run(w)
#The above line initializes w to 0 and defines the loss function. We define train as the learning algorithm. It uses the gradient descent method optimizer to minimize the loss function, but in fact we have not run the learning algorithm, so #above This line initializes w to 0 and defines the loss function. We define train as the learning algorithm. It uses the gradient descent method optimizer to minimize the loss function, but in fact we have not run the learning algorithm yet, so session.run(w) After evaluating w, let me::
print(session.run(w))
Copy code
```

So if we run this, it evaluates to 0 because we haven't run anything yet.

#Now let us enter: $session.run(train), what it does is run a one-step gradient descent method. #Next, after running a step gradient descent method, let us evaluate the value of w, and then print: print(session.run(w)) #After the one-step gradient descent method, w is now 0.1. Copy code

Now we run 1000 iterations of gradient descent:

This is 1000 iterations of running gradient descent, and finally it becomes 4.99999. Remember we said minimize, so the optimal value is 5. This result is very close.

I hope this will give you an understanding of the general structure of the **TensorFlow** program. When you do programming exercises and use more **TensorFlow** code, you will become familiar with some of the functions I use here. There is one thing to note here. We want The optimized parameter, so it is called a variable. Note that all we need to do is to define a loss function and use these

**TensorFlow**knows how to

By the way, if you think this way of writing is not good, **TensorFlow** actually overloads the general addition and subtraction operations, etc., so you can also write it in a better-looking form.

Once called a **TensorFlow** variable, squaring, multiplication, and addition and subtraction operations are all overloaded, so you don't have to use the above unsightly syntax.

Another feature of **TensorFlow** , I want to tell you, is that this example minimizes a fixed function. What if the function you want to minimize is a training set function? No matter what training data you have, when you train a neural network, the training data will change, so how do you add the training data to the **TensorFlow** program?

I will define it and think of it as playing the role of training data. In fact, the training data has a sum, but in this example only, define it as:

**placeholder**function tells

**TensorFlow**that you will provide the value for it later.

Let's define another array,

Okay, I hope there are no syntax errors. Let's run it again and hope to get the same result as before.

Now if you want to change the coefficient of this quadratic function, suppose you put:

To:

Now this function becomes, if I re-run it, I hope I get a value of 10 that minimizes it. Let s take a look. Very good. After 1000 iterations of gradient descent, we get a value close to 10.

When you are doing programming exercises, you see more that the **placeholder** in **TensorFlow** is a variable that you will assign later. This method is convenient for adding training data to the loss equation, and adding data to the loss equation uses this syntax. When You run training iterations, using

**mini-batch**gradient descent, in each iteration, you need to insert a different

**mini-batch**, then each iteration, you use

**mini-batches where the**loss function needs data.

Hope this gives you an idea of what **TensorFlow** can do. What makes it so powerful is that you only need to explain how to calculate the loss function, and it can be derived, and with one or two lines of code, you can use the gradient optimizer, **Adam** optimizer or Other optimizers.

This is the code just now. I have sorted it out a bit. Although these functions or variables look a bit mysterious, you will become familiar with them when you practice programming a few more times.

There is one last point that I want to mention. These three lines (the blue braces) conform to the expression habit in **TensorFlow** . Some programmers use this form instead, and the effect is basically the same.

But this **with** structure is also used in many **TensorFlow** programs. Its meaning is basically the same as the one on the left, but the **with** command in **Python** is more convenient to clean up to prevent errors or exceptions when executing this inner loop. So you will also see this way of writing in programming exercises. So what exactly does this code do? Let's look at this equation:

The core of the **TensorFlow** program is to calculate the loss function, and then **TensorFlow** automatically calculates the derivative and how to minimize the loss. Therefore, what this equation or this line of code does is to let **TensorFlow** build a calculation graph. What the calculation graph does is take, take, Then square it, then multiply it, and you get it, and so on, and finally build the whole calculation, and finally you get the loss function.

The advantage of **TensorFlow** is that by using this calculation loss, the calculation graph basically realizes forward propagation. **TensorFlow** has built-in all necessary reverse functions. Recall a set of forward functions and a set of reverse functions when training a deep neural network. And programming frameworks such as **TensorFlow** have built-in the necessary reverse functions. This is why the forward function is calculated by the built-in function, and it can also automatically use the reverse function to achieve back propagation, even if the function is very complicated. You calculate the derivative, which is why you don t need to explicitly implement backpropagation, which is one of the reasons that programming frameworks can help you become efficient.

If you look at the instructions for **TensorFlow** , I just pointed out that the instructions for **TensorFlow** use a set of symbols that are not the same as mine to draw the calculation graph. It uses **,,** and then it is not a value. Think about it here, **TensorFlow** usage instructions tend to Yu only writes operators, so here is the squaring operation, and the two together point to the multiplication operation, and so on, and then at the last node, I guess it should be an addition operation that will add to get the final value.

For the sake of this course, I think the first way to calculate the graph will be easier to understand, but if you look at the instructions for using **TensorFlow** , if you see the calculation graph in the instructions, you will see another way of representation, node Both use operations to mark instead of values, but the two presentations express the same calculation graph.

In the programming framework, you can do many things with one line of code. For example, you don t want to use the gradient descent method, but you want to use the **Adam** optimizer. You only need to change this line of code and you can quickly replace it and replace it with a better one. optimization. All modern deep learning programming frameworks support this feature, allowing you to easily write complex neural networks.

I hope I helped you understand the typical structure of a **TensorFlow** program and summarize the content of this week. You learned how to systematically organize the hyperparameter search process. We also talked about **Batch** normalization and how to use it to speed up neural networks. Finally, we talked about the deep learning programming framework. There are many great programming frameworks. In this last video, we focus on **TensorFlow** . With it, I hope you enjoy this week's programming exercises to help you become more familiar with these concepts.

Reference