Before I start this week’s “byte” (which is my first experiment with a series actually!) I want to thank everyone who made the last post so successful. Nothing gives me more pleasure than to see people having fun with something I wrote. That is the greatest reward. So, a BIG THANK YOU!
This week, and in the following few weeks, we will look into an alternative way of dealing with file I/O.
You may ask, “Why?” Given the fact that Python has a simple yet elegant way to dealing with files, why would one need to look into a different way?
To answer this question we will need a really LARGE text file. I am not going to download a file rather (this one is only for Linux users, windows folks has to download a large file from somewhere) I will generate one with random text. To do that, I just use the following command—
tr -dc "A-Za-z 0-9" < /dev/urandom | fold -w100|head -n 1000000 > bigfile.txt
In my Ubuntu machine it generates about a 100MB size test file.
Now, I am going to perform the simplest experiment possible. I will read the whole file, all at once, and then return the content. I design a simple function to do that—
def read_file_normally(filename):
with open(filename, "r") as f:
return f.read()
And then I try to benchmark the performance—
import timeit
timeit.repeat('read_file_normally("bigfile.txt")', setup='from __main__ import read_file_normally', number=100)
On my machine the outputs look like the following—
[8.12128299199685,
8.05440250999527,
8.057159557996783,
8.04197331299656,
8.028334376998828]
So an average of 8 seconds to finish one call to this function. The larger the file size, the bigger the number. Feel free to generate even bigger files with the command above and repeat the experiment. (I was using a IPython shell for this work, FYI)
Can we do better?
Enter, mmap.
What mmap is, what kind of underlying system architecture is it using, what is the memory management policy, and why it works almost always better on a 64 bit system and 64 bit Python implementation are some of the topics we are going to discuss in the coming weeks. But for now, let’s just enjoy the magic.
We design a second function to read the same file, only this time we “memory map” this file—
import mmap
def read_file_mmap(filename):
with open(filename, "r") as f:
with mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ)as mf:
return mf.read()
We perform the exact same benchmark again—
import timeit
timeit.repeat('read_file_mmap("bigfile.txt")', setup='from __main__ import read_file_mmap', number=100)
And here is the output
[4.305824749986641,
4.3376961079920875,
4.288884380992386,
4.287004768004408,
4.274775956990197
OH!! We have effectively cut down the time almost 50%!! That is huge performance gain.
Windows users— You have a slightly different version of mmap, please check the official docs here, However, this code works in my windows 10 virtualbox and the time is around 55 seconds for the normal version and 15 seconds for the mmap version (almost a 4 times speed up! As the French people say, “Pas Mal” :P)
Hope you enjoyed the article and you are looking forward to know more in the coming weeks. Meanwhile, please do like, share and subscribe. See you soon :)
From around the web
Handling exceptions in Python like a pro - https://blog.guilatrova.dev/handling-exceptions-in-python-like-a-pro/
MUM: A new AI milestone for understanding information - https://blog.google/products/search/introducing-mum/
Hacker's guide to deep-learning side-channel attacks: the theory - https://elie.net/blog/security/hacker-guide-to-deep-learning-side-channel-attacks-the-theory/
What Facebook Software Engineers Actually Do - https://miraan.co.uk/posts/what-facebook-software-engineers-actually-do/