Python program to find duplicate words in a file

Python program to find duplicate words in a file:

In this post, we will learn how to find the duplicate words in a file in Python. Python provides different inbuilt methods to work with files. We can use these methods to open a file, read the content of a file and also write content to a file.

We will write a program that takes the path of a file as the input and prints out all duplicate words in that file.

Before moving to the program, let’s check the algorithm first.

Algorithm:

This program will follow the below algorithm:

  • Open the file in read mode.
  • Initialize two empty set. One to hold all words and another to hold all duplicate words. We are using set because it can’t hold duplicate values.
  • Iterate through the lines of the file with a loop.
  • For each line, get the list of words by using split.
  • Iterate through the words of each line by using a loop. Check if the current word is in the first set or not.
    • If yes, add it to the second set as it is a duplicate word.
    • If it is not found, add it to the first set as this is not found before.
  • Once the loops are completed, print the content of the second set, which includes only duplicate words.

Python program:

Let’s write down the program:

words_set = set()
duplicate_set = set()

with open('input.txt') as input_file:
    file_content = input_file.readlines()

for lines in file_content:
    words = lines.split()
    for word in words:
        if word in words_set:
            duplicate_set.add(word)
        else:
            words_set.add(word)

for word in duplicate_set:
    print(word)

Here,

  • words_set and duplicate_set are two set to hold the words and duplicate words of the file.
  • The first with block reads the content of the file. The readlines method returns the lines of the file in a list and this value is stored in the file_content variable.
  • The for loop iterates through the lines in the list and gets the words in each line by using split().
  • The inner for loop iterates through the words of each line. For each word, it checks if it is in words_set or not. If yes, it adds that word to duplicate_set as it is a duplicate. Else, it adds it to words_set.
  • Once the loops are completed, it uses another loop to print the words of duplicate_set.

For example, if the input.txt holds the following text:

hello world
hello universe
hello again
hello world !!

It will print the below output:

hello
world

Method 2: By using a dictionary:

If you run the above program, each time it will print the output in a different order. Because the order is not maintained in a set. If you want to maintain the order, you can use a dictionary.

Dictionaries are used to hold key-value pairs. For this example, the key will be the word and the value will be its number of occurrences in the file.

The program will iterate through the words and if it is not added to the dictionary, it will add it with value 0. Also, it will increment the value by 1.

To find the duplicate words, it will iterate through the dictionary to find out all words with value greater than 1.

Below is the complete program:

words_dict = {}

with open('input.txt') as input_file:
    file_content = input_file.readlines()

for lines in file_content:
    words = lines.split()
    for word in words:
        if word not in words_dict:
            words_dict[word] = 0
        words_dict[word] += 1

for word, count in words_dict.items():
    if count > 1:
        print(word)

If you run this program, it will print the duplicate words in the same order these are found in the file.

You might also like: