Embedding generation and KNN confusion?

Could the content creator please explain why the neural network in question is set up as a double tower? Why not have all the features together? i.e. learn directly the optimal mapping between all the features which are non linear, as supposed to some combination of a sub-combination (of media and user)?

Secondly, why do we use the embedding with KNN? why not simply use the neural network output on a new example?