In this tutorial, we will learn how to remove the duplicate lines from a text file using python. The program will first read the lines of an input text file and write the lines to one output text file.
While writing, we will constantly check for any duplicate line in the file. If any line is previously written, we will skip that line. For example, for the following text file :
First Line Second Line First Line First Line First Line
The output will be :
First Line Second Line
Let’s take a look into the algorithm first :
- First, open the input file in read mode because we are only reading the content of this file.
- Open the output file in write mode because we are writing content to this file.
- Read line by line from the input file and check if any line similar to this line was written to the output file.
- If not, then write this line to the output file, and save the hash value of the line to a set. We will check each line’s hash value instead of checking and storing the full line. This is space-efficient and a better approach for a large file.
- If the hash value is already added to the set, skip that line.
- After everything is completed, the output file will contain all the contents of the input file without any duplicate lines.
import hashlib #1 output_file_path = "C:/out.txt" input_file_path = "C:/in.txt" #2 completed_lines_hash = set() #3 output_file = open(output_file_path, "w") #4 for line in open(input_file_path, "r"): #5 hashValue = hashlib.md5(line.rstrip().encode('utf-8')).hexdigest() #6 if hashValue not in completed_lines_hash: output_file.write(line) completed_lines_hash.add(hashValue) #7 output_file.close()
The source code is available here.
The commented numbers in the above program denote the step number below:
- First of all, save the path of the input and output file paths in two variables. Change these values to your own input and output file path. You can drag and drop one file on the terminal to find out the path.
- Create one Set variable. We are using Set because it can hold only unique variables. No duplicate variables can be added to a Set.
- Open the output file in write mode. For opening a file in write mode, ‘w’ is used. We are opening the output file in write mode because we are going to write to this file. open() method is used to open a file.
- Start one for loop to read from the input file line by line. We are opening the file in read mode. ‘r’ is used to read the file in read mode.
- Find the hash value of the current line. We are removing any space and a new line from the end of the line before calculating the hash. hashlib library is used to find out the hash value of a line.
- Check if this hash value is already in the Set variable or not. If not, this means the line is not printed to the output file yet. Put the line to the output file and add the hash value to the Set variable.
- Finally, close the output text file.
I hope that you have found this article helpful. Try to run the program and please contact us if you have any queries.
- Python 3 program to count the number of blank spaces in a file
- Python program to count the total number of lines in a file
- Python program to rename a directory or file
- Python program to delete all files with a specific extension in a folder
- Python program to rename a file or directory
- Python program to remove special characters from all files in a folder