educative.io

Training Data Generation

'Let’s assume that the human rater will be presented with 100,000 queries, each having ten results. They will then be asked to rate the results for each query. As a result, we would have ten million training examples.
Why is it 10 million? Shouldn’t it be 1 million or since it is pairwise comparison: 45 comparisons (9+8+7+6+5+4+3+2+1) for the 10 results and for 100,000 queries → 4.5 million comparisons or training examples.


Course: Grokking the Machine Learning Interview - Learn Interactively
Lesson: Training Data Generation

Hi @Rishabh_Sheoran !!
Let’s break down the numbers for the calculation of training examples in the context of human rater evaluations:

  1. Pointwise Approach (Assuming 100,000 Queries with 10 Results Each):

    • In a pointwise approach, each query-result pair is an individual training example.
    • If there are 100,000 queries and each query has 10 results, then the total number of training examples is simply the product of the two: ( 100,000 \text{ queries} \times 10 \text{ results per query} = 1,000,000 \text{ training examples} ).
  2. Pairwise Approach (Assuming 100,000 Queries with 10 Results Each):

    • In a pairwise approach, we consider pairs of results for each query.
    • For 10 results, there are ( \binom{10}{2} = 45 ) unique pairs (since the order matters in ranking, and there are 45 ways to choose pairs from 10 items).
    • Therefore, for 100,000 queries, the total number of pairwise comparisons would be ( 100,000 \times 45 = 4,500,000 \text{ training examples} ).

So, in the case of the pairwise approach, you are correct that with 100,000 queries and 10 results per query, there should be 4.5 million pairwise comparisons, not 10 million. The statement in your lesson about having 10 million training examples seems more aligned with a pointwise approach where each query-result pair is treated as an individual training example. For the pairwise approach with the given numbers, 4.5 million is the correct calculation for the number of training examples.
I hope it helps. Happy Learning :blush: