We all have those annoying and tedious little tasks in our everyday workflows that seem to crop up over and over again. I often find myself needing to verify the integrity of files or wrangling data obtained from various different systems and in multiple formats. In this post, I will share five Python code snippets for working with files that I often use to help simplify common tasks and save time.
Most of the code snippets in this post use a few simple bits of code together with the humble Python open
built-in function, which requires two parameters, namely the file name and the mode for opening the file. Various operations can then be performed on the file, before the file is then closed. For example, the below code opens a file called “file-1.csv”, reads and prints the contents of the file to screen and then closes.
file = "file-1.csv" # basic file open and close f = open(file, 'r') print(f.read()) f.close()
A problem with the above code, is that any error in working with the file, might exit the Python script before the file can be safely closed. A better way to deal with opening a file, is to use the with
context manager, which always safely closes the file.
file = "file-1.csv" # better way using context manager with open(file, 'r') as f: print(f.read())
In a previous post, I covered some basic examples of interacting with excel using python. But what if we need to perform some tasks with our raw data files or directories before we import the data into Excel or a data analytics application?
Let’s have a look at a few helpful code snippets for working with basic text files:
Concatenating files
It often happens that I have a number of text files with the same layout structure that I need to consolidate into a single file. For example, CSV files spanning a number of files, that would make more sense to work with as a single source data file.
The code below takes in a list of files to the infile
variable and concatenates them in the order that they appear in the list, removing blank rows in the process and writing to the filename specified by the outfile
variable:
# name of output files outfile = 'heroes.csv' # list of files to concatenate in order of concat infiles = ["file-1.csv", "file-2.csv", "file-3.csv"] # function to concatenate files def file_append (infile, outfile): with open (infile, 'r') as a: with open (outfile, 'a') as b: for line in a.readlines(): # only write non-blank lines and make line endings consistent if line.strip() != '': b.write(line.strip() + '\n') # process files and concatenate for filename in infiles: print (f"concatenating {filename}") file_append(filename, outfile)
Counting the number of rows in a file
When extracting data from an external system, it is common to count the number of rows in the extracted data to see if it matches with the expected export result. For smaller files, this can be easily determined by opening the file in a text editor. However, for larger files this can be a bit tedious, especially where the row count of multiple files needs to be checked.
The code below counts the number of rows in a text file, by passing the filename as argument to the script rowcount.py
in the command line. To count the rows in the output file generated in the previous Python snippet above, and assuming that the script and the target file are in the same directory, we would issue the following command at the terminal python3 rowcount.py heroes.csv
.
import sys def count_rows(filename): count = 0 with open(filename, 'r') as a: for row in a: count += 1 print(count) if __name__ == "__main__": count_rows(sys.argv[1])
Hashing a file
You may have noticed when downloading a file from the internet that it often has a hash checksum associated with the download. This hash is unique to the particular file and is used to verify that the downloaded file is identical to the original. Checksums are also useful when transferring data between systems to ensure that the data integrity is in tact between source and destination. The hash function used may differ between different data sources, but MD5 and SHA256 are common.
Linux, Mac and Windows systems have built-in hash checking functions, but I sometimes find it easier to use these simple Python code snippets to achieve the same result.
To use the script md5_checksum.py
, in a terminal with both the script and target file in the same directory, execute the command python3 md5_checksum.py heroes.csv
, where heroes.csv
is the target file. This should return a MD5 hash of the file to the terminal output, for example e4e2b4114006b1df9b3d470db83270a6
(your result will differ).
import sys from hashlib import md5 def md5_checksum(file): with open(file, 'rb') as f: return md5(f.read()).hexdigest() if __name__ == "__main__": print(md5_checksum(sys.argv[1]))
Similarly, to calculate a SHA256 hash of the input file, the terminal command would be python3 sha256_checksum.py heroes.csv
:
import sys from hashlib import sha256 def sha256_checksum(file): with open(file, 'rb') as f: return sha256(f.read()).hexdigest() if __name__ == "__main__": print(sha256_checksum(sys.argv[1]))
Get a list of files and directories
When working with multiple files, it may not be efficient to to put all the files we want to work with in our Python code.
For example, I sometimes find that where I have a lot of files to iterate through, or where I need to run code multiple times against different filenames, it becomes too tedious to keep updating the file names in the code.
In such instances, it makes far more sense to be able to dynamically get a list of files or directories from within the code and to then work with those file objects.
The Python code snippet below will extract the files and directory names in the same directory as the script and return the list of files as files
and the list of directories as directories
.
The code can be modified to only extract filetypes that are applicable to the use case, such as filenames ending in “.txt” or “.csv” (see the example at the end of this post).
Any actions that need to be performed on either the files or directories, can then be done by iterating through the list.
import os # set current directory currentdir = os.sys.path[0] # blank lists to store files and directories files = [] directories = [] # parse the current directory and add file names to the list for name in os.listdir(currentdir): if os.path.isfile(name): files.append(name) elif os.path.isdir(name): directories.append(name) print(files) print(directories)
Putting it all together
I hope you found the above code snippets helpful. Feel free to play around with modifying and mixing the code into your own scripts.
The below example combines some of the code from above to create a script that will concatenate all “.csv” files in the current directory, and print the number of rows of the output file to the terminal:
import os # function to concatenate files def file_append (infile, outfile): with open (infile, 'r') as a: with open (outfile, 'a') as b: for line in a.readlines(): # only write non-blank lines and make line endings consistent if line.strip() != '': b.write(line.strip() + '\n') # function to count rows def count_rows(filename): count = 0 with open(filename, 'r') as a: for row in a: count += 1 return count # set current directory currentdir = os.sys.path[0] # blank lists to store files files = [] # name of output files outfile = "heroes.csv" # parse the current directory and add file names of type ".csv" to the list for name in os.listdir(currentdir): if os.path.isfile(name): if name.endswith(".csv"): files.append(name) # process files and concatenate for filename in files: print (f"concatenating {filename}") file_append(filename, outfile) print(count_rows(outfile))
All of the above Python code snippets for working with files, can be found in my Github repository.