Python tutorial to remove duplicate lines from a text file :
In this tutorial, we will learn how to remove duplicate lines from a text file using python. The program will first read the lines of an input text file and write the lines to one output text file. While writing, we will constantly check for any duplicate in the file. If any line is previously written, we will skip that line. For example, for the following text file :
First Line Second Line First Line First Line First Line
The output will be :
First Line Second Line
Let’s take a look into the algorithm first :
1. First , open the input file in read mode because we are only reading the content of this file.
2. Open the output file in write mode because we are writinng contents to this file.
3. Read line by line from the input file and check if any line similar to this was written to the output file.
4. If not , then write this line to the output file, and save the hash value of the line. We will check each line’s hash value instead of checking and storing the full line. This will save us a lot of space.
5. If already added, skip that line.
6. After everything is completed, the output file will contain all the contents of the input file without any duplicate lines.
Python program to remove duplicate lines from a text (.txt) file :
import hashlib #1 output_file_path = "C:/out.txt" input_file_path = "C:/in.txt" #2 completed_lines_hash = set() #3 output_file = open(output_file_path, "w") #4 for line in open(input_file_path, "r"): #5 hashValue = hashlib.md5(line.rstrip().encode('utf-8')).hexdigest() #6 if hashValue not in completed_lines_hash: output_file.write(line) completed_lines_hash.add(hashValue) #7 output_file.close()
The commented numbers in the above program denote the step number below :
1. First, save the path of the input and output file paths in two variables.
2. Create one Set variable. We are using Set because it can hold only unique variables. No duplicate variables can be added to a Set.
3. Open the output file in write mode.
4. Start one for loop to read from the input file line by line.
5. Find the hash value of the current line. We are removing any space and new line from the end of the line before calculating the hash.
6. Check if this hash value is already in the Set variable or not. If not, means the line is not printed to the output file yet. Put the line to the output file and add the hash value to the Set variable.
7. Finally, close the output text file.
Sample Output :
For the following input file :
This is a line This is a line This is 3 a line This is 4 a line This is5 a line This is a line
The output will be :
This is a line This is 3 a line This is 4 a line This is5 a line