Python tutorial to remove duplicate lines from a text file

Published by admin on

Python tutorial to remove duplicate lines from a text file :

In this tutorial, we will learn how to remove the duplicate lines from a text file using python. The program will first read the lines of an input text file and write the lines to one output text file. While writing, we will constantly check for any duplicate line in the file. If any line is previously written, we will skip that line. For example, for the following text file :

First Line
Second Line
First Line
First Line
First Line

The output will be :

First Line
Second Line

Let’s take a look into the algorithm first :

1. First, open the input file in ‘read’ mode because we are only reading the content of this file.


2. Open the output file in write mode because we are writing contents to this file.


3. Read line by line from the input file and check if any line similar to this was written to the output file.


4. If not, then write this line to the output file, and save the hash value of the line. We will check each line’s hash value instead of checking and storing the full line. This will save us a lot of space.


5. If already added, skip that line.


6. After everything is completed, the output file will contain all the contents of the input file without any duplicate lines.

 

Python program to remove duplicate lines from a text (.txt) file:

The source code is available here.

python remove duplicate lines from file

 

Explanation:

The commented numbers in the above program denote the step number below :

1. First of all, save the path of the input and output file paths in two variables. Change these values to your own input and output file path. You can drag and drop one file on the terminal to find out the path.


2. Create one Set variable. We are using Set because it can hold only unique variables. No duplicate variables can be added to a Set.


3. Open the output file in write mode. For opening a file in write mode, ‘w’ is used. We are opening the output file in write mode because we are going to write to this file. open() method is used to open a file.


4. Start one for loop to read from the input file line by line. We are opening the file in read mode. ‘r’ is used to read the file in read mode.


5. Find the hash value of the current line. We are removing any space and a new line from the end of the line before calculating the hash. hashlib library is used to find out the hash value of a line.


6. Check if this hash value is already in the Set variable or not. If not, means the line is not printed to the output file yet. Put the line to the output file and add the hash value to the Set variable.


7. Finally, close the output text file.

 

Sample Output :

python remove duplicate lines from file

Conclusion :

I hope that you have found this article helpful. Try to run the program and drop one comment below if you have any queries.

Categories: python

9 Comments

Nevan · June 18, 2018 at 5:18 pm

Hi, thank you for this amazing post.

I have a question. Im trying to do like that:

Input:

First Line
Second Line
First Line
First Line
First Line

Output:

first line second

How can i do that?

    admin · June 18, 2018 at 5:28 pm

    Hi Nevan,
    Could u plz explain me a little bit on what you want to achieve? It seems like changing the first line to all lowercase and then adding the second line first word ?

      Talat · June 19, 2018 at 6:14 pm

      Sorry i wasnt so clear.

      I have so large corpus data. I want to delete duplicates words from it. So every repeated words will be written only once.

      I run following code but it didnt finish and more than 12 hours already passed. I dont know if there is better way to do that?

      def unique_list(l):
      ulist = []
      [ulist.append(x) for x in l if x not in ulist]
      return ulist

      with open(‘outfile.txt’, ‘r’, encoding=’utf-8′) as myfile:
      data = myfile.read()
      cleaned_file = ‘ ‘.join(unique_list(data.split()))
      with open(‘cleaned_output.txt’, ‘w’, encoding=’utf-8′) as cleaned:
      cleaned.write(cleaned_file)

        admin · June 21, 2018 at 3:34 am

        Sorry for replying late. I am not sure why it is not working. Python should work fine for a large data set. Maybe http://pandas.pydata.org/ will help ?

          Nevan · June 21, 2018 at 12:23 pm

          Hi again,

          I solved the problem. I used sets instead of list. It worked amazing. It finished less than 5 mins.

          admin · June 21, 2018 at 7:53 am

          That’s great.. 🙂 👍

lukas · July 13, 2018 at 1:25 pm

what if i want to remove lines that have only part of the line the same.

lets say i have a txt file with thousands of lines, some of the lines start with the same substring and these i want to remove..

i have written some code similar to this one, but it doesnt work and i cant figure out how to do it

    admin · July 13, 2018 at 8:32 am

    in that case, don’t check for hash value. For each line just check if the substring exist or not for the current line. If yes, remove that line

thrinath · October 19, 2018 at 8:58 pm

hey your script works fine, but it also deletes the blank lines, is there a way we don’t delete the blank lines ?

Leave a Reply

Your email address will not be published. Required fields are marked *