Python tutorial to remove duplicate lines from a text file
Python tutorial to remove duplicate lines from a text file :
In this tutorial, we will learn how to remove the duplicate lines from a text file using python. The program will first read the lines of an input text file and write the lines to one output text file. While writing, we will constantly check for any duplicate line in the file. If any line is previously written, we will skip that line. For example, for the following text file :
First Line Second Line First Line First Line First Line
The output will be :
First Line Second Line
Let’s take a look into the algorithm first :
1. First, open the input file in ‘read’ mode because we are only reading the content of this file.
2. Open the output file in write mode because we are writing contents to this file.
3. Read line by line from the input file and check if any line similar to this was written to the output file.
4. If not, then write this line to the output file, and save the hash value of the line. We will check each line’s hash value instead of checking and storing the full line. This will save us a lot of space.
5. If already added, skip that line.
6. After everything is completed, the output file will contain all the contents of the input file without any duplicate lines.
Python program to remove duplicate lines from a text (.txt) file:
import hashlib #1 output_file_path = "C:/out.txt" input_file_path = "C:/in.txt" #2 completed_lines_hash = set() #3 output_file = open(output_file_path, "w") #4 for line in open(input_file_path, "r"): #5 hashValue = hashlib.md5(line.rstrip().encode('utf-8')).hexdigest() #6 if hashValue not in completed_lines_hash: output_file.write(line) completed_lines_hash.add(hashValue) #7 output_file.close()
The source code is available here.
The commented numbers in the above program denote the step number below :
1. First of all, save the path of the input and output file paths in two variables. Change these values to your own input and output file path. You can drag and drop one file on the terminal to find out the path.
2. Create one Set variable. We are using Set because it can hold only unique variables. No duplicate variables can be added to a Set.
3. Open the output file in write mode. For opening a file in write mode, ‘w’ is used. We are opening the output file in write mode because we are going to write to this file. open() method is used to open a file.
4. Start one for loop to read from the input file line by line. We are opening the file in read mode. ‘r’ is used to read the file in read mode.
5. Find the hash value of the current line. We are removing any space and a new line from the end of the line before calculating the hash. hashlib library is used to find out the hash value of a line.
6. Check if this hash value is already in the Set variable or not. If not, means the line is not printed to the output file yet. Put the line to the output file and add the hash value to the Set variable.
7. Finally, close the output text file.
Sample Output :
I hope that you have found this article helpful. Try to run the program and drop one comment below if you have any queries.