Sharing my thoughts here.
There’s a very import concept that’s missing in this course doc, which is background conversion rate, or background CTR, or overall CTR. Here background means the measurement of the global CTR for the whole data set.
Try to imagine such a scenario, where we have a training set of 1000 samples, 20 positive, and 980 negative. Here we can say background CTR of this dataset is 20/1000=2%. This represents the overall performance of the “truth” we’ve fed the model.
Such pre-existing distribution of the labels, kinda acts as a snapshot of the current user’s behaviors. And when user’s behavior changes, such background CTR will change.
On that note, as we are optimizing the model to get better prediction on the positive labels, and if such positive labels are sooo rare, the loss will tend to be very high in most cases, that’s too harsh for the model. That’s why we want to moderate the loss by a factor, that takes consideration of the overall distribution of the labels across the training set, here comes normalized cross entropy. The normalization here is the very moderation that I mentioned above.
As to your points:
- what is the relationship between the CTR and the user’s behavior changes?
CTR, or more specifically, background CTR is a snapshot of overall behavior of users, while it’s hard to represent a drift of some cohort of users unless we split the dataset further.
- why the less sensitivity to CTR will help to adapt to behavior changing.
By introducing losses like normalized cross entropy, we are taking the drift of overall distribution into consideration when calculating loss.
Hope that helps, and if there’s anything sounds weird above, happy to discuss!