Why is sharding by tweet-id better than sharding by word?

The author argues that if we shard based on the word, then it could be that some “hot” words could result in that shard becoming larger. At the same time, one could always consider that this index limits the number of tweets for a particular word (for e.g. sorted by tweet popularity, and tweets that are lower in priority are not returned by the search).
In my opinion, the disadvantage when sharding by tweet-id, is that we are optimizing during tweet storage (we know exactly which shard this tweet should be indexed in), with the result that tweet searching is de-optimized (we have to search on all shards). I would have thought that optimizing for search is better in this case.
Any opinions?