Python tutorial to remove duplicate lines from a text file

Python tutorial to remove duplicate lines from a text file :

In this tutorial, we will learn how to remove duplicate lines from a text file using python. The program will first read the lines of an input text file and write the lines to one output text file. While writing, we will constantly check for any duplicate in the file. If any line is previously written, we will skip that line. For example, for the following text file :

First Line
Second Line
First Line
First Line
First Line

The output will be :

First Line
Second Line

Let’s take a look into the algorithm first :

1. First , open the input file in read mode because we are only reading the content of this file.
2. Open the output file in write mode because we are writinng contents to this file.
3. Read line by line from the input file and check if any line similar to this was written to the output file.
4. If not , then write this line to the output file, and save the hash value of the line. We will check each line’s hash value instead of checking and storing the full line. This will save us a lot of space.
5. If already added, skip that line.
6. After everything is completed, the output file will contain all the contents of the input file without any duplicate lines.

Python program to remove duplicate lines from a text (.txt) file :

import hashlib

output_file_path = "C:/out.txt"
input_file_path = "C:/in.txt"

completed_lines_hash = set()

output_file = open(output_file_path, "w")

for line in open(input_file_path, "r"):
	hashValue = hashlib.md5(line.rstrip().encode('utf-8')).hexdigest()
	if hashValue not in completed_lines_hash:

Explanation :

The commented numbers in the above program denote the step number below :

1. First, save the path of the input and output file paths in two variables.
2. Create one Set variable. We are using Set because it can hold only unique variables. No duplicate variables can be added to a Set.
3. Open the output file in write mode.
4. Start one for loop to read from the input file line by line.
5. Find the hash value of the current line. We are removing any space and new line from the end of the line before calculating the hash.
6. Check if this hash value is already in the Set variable or not. If not, means the line is not printed to the output file yet. Put the line to the output file and add the hash value to the Set variable.
7. Finally, close the output text file.

Sample Output :

For the following input file :

This is a line
This is a line
This is 3 a line
This is 4 a line
This is5  a line
This is a line

The output will be :

This is a line
This is 3 a line
This is 4 a line
This is5  a line

9 Replies to “Python tutorial to remove duplicate lines from a text file”

  1. Hi, thank you for this amazing post.

    I have a question. Im trying to do like that:


    First Line
    Second Line
    First Line
    First Line
    First Line


    first line second

    How can i do that?

    1. Hi Nevan,
      Could u plz explain me a little bit on what you want to achieve? It seems like changing the first line to all lowercase and then adding the second line first word ?

      1. Sorry i wasnt so clear.

        I have so large corpus data. I want to delete duplicates words from it. So every repeated words will be written only once.

        I run following code but it didnt finish and more than 12 hours already passed. I dont know if there is better way to do that?

        def unique_list(l):
        ulist = []
        [ulist.append(x) for x in l if x not in ulist]
        return ulist

        with open(‘outfile.txt’, ‘r’, encoding=’utf-8′) as myfile:
        data =
        cleaned_file = ‘ ‘.join(unique_list(data.split()))
        with open(‘cleaned_output.txt’, ‘w’, encoding=’utf-8′) as cleaned:

        1. Sorry for replying late. I am not sure why it is not working. Python should work fine for a large data set. Maybe will help ?

          1. Hi again,

            I solved the problem. I used sets instead of list. It worked amazing. It finished less than 5 mins.

          2. That’s great.. 🙂 👍

  2. what if i want to remove lines that have only part of the line the same.

    lets say i have a txt file with thousands of lines, some of the lines start with the same substring and these i want to remove..

    i have written some code similar to this one, but it doesnt work and i cant figure out how to do it

    1. in that case, don’t check for hash value. For each line just check if the substring exist or not for the current line. If yes, remove that line

  3. hey your script works fine, but it also deletes the blank lines, is there a way we don’t delete the blank lines ?

Leave a Reply