Heap order in Top K Frequent Words

Isaac_Haseley · March 4, 2024, 10:11pm

The proposed solution involves a max heap. Since we’re returning the k most frequent words, should we use a min heap instead?

Course: Grokking Coding Interview Patterns in Python - Learn Interactively
Lesson: Top K Frequent Words - Grokking Coding Interview Patterns in Python

Dian_Us_Suqlain · March 15, 2024, 8:43am

Hello @Isaac_Haseley,

The proposed solution uses a max heap because the in the problem statement, we are asked to sort the frequencies from highest to lowest. With a max heap, we can extract the maximum element (highest frequency) whenever the heap size exceeds k, effectively maintaining the k most frequent words in descending order of frequency. On the other hand, with a min heap, we extract the minimum element (lowest frequency) whenever the heap size exceeds k, ensuring that we only keep the k most frequent words.

I hope this explanation helps. Please feel free to share further suggestions and feedback. We’d be happy to help!

Happy learning!

Isaac_Haseley · March 18, 2024, 6:29pm

Hey @Dian_Us_Suqlain!

With a max heap, we can extract the maximum element (highest frequency) whenever the heap size exceeds k, effectively maintaining the k most frequent words in descending order of frequency.

If we’re extracting the most frequent element, then will our heap of size k will no longer contain that most frequent element, and therefore not contain the k most frequent elements so far?

On the other hand, with a min heap, we extract the minimum element (lowest frequency) whenever the heap size exceeds k, ensuring that we only keep the k most frequent words.

This sounds like what we want, what do you think?

Ethan_Davis · March 29, 2024, 4:32am

@Isaac_Haseley Consider a custom comparator since frequencies and words sort in opposite order.

from collections import Counter
from heapq import heappush, heappop

def top_k_frequent(words, k):
    """
    n: number of unique words
    time complexity: O(nlogk)
    space complexity: O(n)
    """
    min_heap = list()
    counter = Counter(words)
    for word, freq in counter.items():
        heappush(min_heap, FreqWord(freq, word))
        if len(min_heap) > k:
            heappop(min_heap)
    return [heappop(min_heap).word for i in range(k)][::-1]
    
class FreqWord:
    def __init__(self, freq, word):
        self.freq = freq
        self.word = word
    def __lt__(self, other):
        if self.freq != other.freq:
            return self.freq < other.freq
        else:
            return other.word < self.word

Isaac_Haseley · March 30, 2024, 1:00am

Hey @Ethan_Davis!

Yes! I took this approach. It seems to contradict what the course and @Dian_Us_Suqlain recommend due to the min heap, hence the forum post

Dian_Us_Suqlain · March 30, 2024, 1:46am

Yes, @Isaac_Haseley, that is indeed true, and I do appreciate your insights. We can solve this problem using either of the approaches: max heap and min heap, each with pros and cons.

Using a max heap (as currently required in the challenge lesson), we can insert all word-frequency pairs and then extract the top k most frequent words by popping from the heap k times at the end of the insertion process. This method focuses on directly finding the highest frequencies without maintaining a fixed heap size.

On the other hand, the min heap approach allows us to dynamically maintain a heap of size k during the insertion process, ensuring it contains the top k most frequent words by removing the least frequent word when the heap size exceeds k. I realize now that this approach is efficient regarding time complexity. The custom comparison logic, as discussed by @Ethan_Davis for the heap, ensures that the order is maintained by frequency and, for ties, by lexicographical order.

We’ll make the necessary changes as suggested by you, and I do apologize for the oversight earlier that the min heap approach may offer a more straightforward solution regarding time efficiency and adherence to the problem’s sorting criteria. Thank you for pointing out this out. Your feedback helps us improve our solutions and explanations.

Let me know if there’s more to discuss on this topic!