Huge Email Clean-up and Verification




"There will always be a time that your company will have a dirty stack of email addresses from way back that you need to clean and scrub to see which email addresses are still alive and kicking."
- Son Goku 1884








In this guide, we will tackle how we can clean-up email addresses over 700,000 rows of user accounts, that is 300MB worth of data.

I am the company’s web developer so at first, I looked if a web service can fulfill the requirement. The server can surely handle the processing BUT the upload time will be long and boring. Since I like things fast and instant, I decided to go with Python 2.7.

Our Goals are as follows:
1. To check if the email address string is without typos.
2. The email address should not be a role email nor a disposable one.
3. The email server actually exist and has a Mail Exchanger Record or MX Record.
4. Then if all passed, we will determine if the email address actually is deliverable, actually exists, or is active.

For the lazy ones, you may get skim.py from my Gitlab snippet.


But for the hardcore ones who lets their hands get dirty for the sake of knowledge, continue reading below.



Skimming the Sea S.V. (CSV)

So my colleague just gave me this enormous 350MB CSV file that contains 700k plus rows of data. What is he thinking!? Anyway, My script’s job is to skim through the file looking for valid email addresses. Too much for a manual labor, right?

Actual Coding FTW!

For this task, I created a python script that uses the following libraries:
 
import dns.resolver
import csv
import re
 
We import the CSV library so that the script will be able to read our huge CSV file, While the RE library is to ensure that we will be able to use Regular Expression. This is to check if the email string is valid.
The DNS on the other hand, will be used to determine if the domain does have an MX Record. (we will use this in Part 2. For now, just let it sit there.)

Now under those imports, the variables are declared:

 

#path to csv file
csvfile = 'path/to/your/HugeEmailList.csv'
#Column index number of the Email address.
emailcol = 7 
#Same as above, but for Username. This is optional.
usercol = 5 
#Final file to be generated for valid emails
validEmails = 'desired/path/to/the/valid_emails.csv' 
#Final file to be generated for invalid emails
invalidEmails = 'desired/path/to/the/invalid_emails.csv' 


So as the variable says, we will pluck the email addresses and usernames from the CSV file and separate the valid and the invalid ones. Simple.
Before we get in to the next part, let’s first create the file outputs of the script. This will be the placeholder of the final output.

f = open(invalidEmails,"w")
f.write("username,email\n");
f.close()
f = open(validEmails,"w")
f.write("username,email\n");
f.close()

Now let’s declare a function that will handle the email string validation:

Python indentation will be intact, each box is a part of a whole.

def isValidEmail(email):
    if len(email) > 7:
        if re.match(r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$", email) != None:
            roles = ['postmaster','admin','webmaster','info','support','help','sales','marketing','developer']

Here we defined and opened a function called isValidEmail(email) which needs 1 string parameter, obviously it needs the email string and obviously it will return a Boolean. Valid emails will always have more than 7 characters, ie. gmail.com is already 8 characters… That’s the reason of the len condition.
We then used a simple Regular Expression matching to check if the email string resembles an email address (string-String_string123@string.string). If you have a better RegEx, please do post it on the comments oh enlightened one.


Said no normie programmer ever

 

Anyhow…

As you see, we don’t want Role emails in our list. There is a HUGE portion of websites that uses these emails as honeypot spam trap. The thing for them is “All emails that sends to these role emails are automatically a spammer.” Which is not wrong. So we will invalidate any email that has these words in them just to keep our domain name’s reputation good.


Now from this ancient list, let’s put the disposable email servers in to a huge array:

 

“Add more, I insist. The list above is 5 years old when this blog post was made and I’m too lazy to look for new ones. You can’t be the only lazy person here.”
- Einstein

            disposables = ['example.com','0-mail.com','0815.ru','0clickemail.com','0wnd.net','0wnd.org','10minutemail.com','20minutemail.com','2prong.com','30minutemail.com','3d-painting.com','4warding.com','4warding.net','4warding.org','60minutemail.com','675hosting.com','675hosting.net','675hosting.org','6url.com','75hosting.com','75hosting.net','75hosting.org','7tags.com','9ox.net','a-bc.net','afrobacon.com','ajaxapp.net','amilegit.com','amiri.net','amiriindustries.com','anonbox.net','anonymbox.com','antichef.com','antichef.net','antispam.de','baxomale.ht.cx','beefmilk.com','binkmail.com','bio-muesli.net','bobmail.info','bodhi.lawlita.com','bofthew.com','brefmail.com','broadbandninja.com','bsnow.net','bugmenot.com','bumpymail.com','casualdx.com','centermail.com','centermail.net','chogmail.com','choicemail1.com','cool.fr.nf','correo.blogos.net','cosmorph.com','courriel.fr.nf','courrieltemporaire.com','cubiclink.com','curryworld.de','cust.in','dacoolest.com','dandikmail.com','dayrep.com','deadaddress.com','deadspam.com','despam.it','despammed.com','devnullmail.com','dfgh.net','digitalsanctuary.com','discardmail.com','discardmail.de'.... this list goes on, you should update it...

So we just declared a huge pile of disposable email servers. “Disposable” as the word plainly state, any email from these servers are GARBAGE and USELESS for the marketing team at the least. I love these servers since I use them for app testing and of course for Stuff… BUT here in our script, they are not allowed to stay. It is also wise to add EXAMPLE.COM and ASDF.COM as well as QWERTY.COM just to screen those lazy gibberish already and not allowing them to even be checked for MX Records.


 


So the script continues…

 

            server = email.split('@')[1]
            for role in roles:
                if role in email:
                    return False
            if server in disposables:
                return False
            return true
        else
            return false


Completing our isValidEmail() function, we now split the email address in to 2 since it passed the RegEx condition. We will now have the LOCAL-PART and the Domain name.
[image]
I bet you didn’t know what Local-part was…
For each role in the roles array, we’ll check if there’s a match inside our email address. If there is, then the email is invalid right and we will then proceed hitting it with a bat or a chair. But if it is not a role email, we will then check if its Domain name is inside our Disposable domains array. If all are good, the function will return TRUE! And there shall be a way, and a life.


Okay… Now we have the heart of the script done, we can now proceed on the next part of the script.

 

Reading and Iterating the CSV Data

 

“Again, indentation will be intact. Each box is still part of the whole script.”
- Bach


with open(csvfile, 'rb') as csvfile:
    print " Skimming "+str(csvfile)+" CSV File, EMAIL column is index "+str(emailcol)+". Hit Ctrl+C and modify.\n A file named "+str(validEmails)+" will be created.\n\n"
    emailreader = csv.reader(csvfile, delimiter=',')
    line = 0
    for row in emailreader:
        line = line+1
        try:
            email = row[emailcol]
            username = row[usercol]
            usermail = email.strip()
            email = email.split('@')
            re.sub('\s+', '', username)
            username = username.strip()


The script just opened the CSV and is RB Reading the file in Binary. It then proceed to use CSV.reader() method, the delimiter parameter value is the one that separate your CSV columns. You should check your file how the columns are separated. We declared a variable named “line” just for the sole purpose of error logging.
[image]
Wow! You can actually make the script say something using the command “print”



Now for each row inside emailreader, we start a try-catch condition.
We then tried to clean the username string, this was optional but I did it anyway because I can.


            if email[0] != "email":
                try:
                    with open(validEmails, 'ab') as validCSV:
                        fieldnames = ['username', 'email']
                        writer = csv.DictWriter(validCSV, fieldnames=fieldnames,delimiter=',')
                        if isValidEmail(usermail):
                            writer.writerow({'username': username, 'email': usermail})
                            print "added "+usermail
                        else:
                            with open(invalidEmails, 'a') as invalidCSV:
                                fieldnames = ['username', 'email']
                                writer = csv.DictWriter(invalidCSV, fieldnames=fieldnames,delimiter=',')                                
                                writer.writerow({'username': username, 'email': usermail})

Now in the condition above, we have if email[0] != “email”. We asked this because in our CSV file, the first row was the declaration of the column names and it so happened that our email column name is “email“. Just a co-incidence really. So, we don’t want that row.

We then started a try-catch condition again, here we tried to open the validEmails file created earlier as a CSV file and we declared it as “validCSV”, we then declared our anticipated fieldnames “Username” and “Email”.

With this, we opened the validCSV variable for writing. To clarify, the validEmails variable is the path to the valid_emails.txt file and the validCSV is the variable where we will write the data we manipulated.

We then use the isValidEmail() function to ask if is the email passes our criteria. If it does, we will then write the value in to the CSV file. We do this by passing the values using JSON string format. Else, if it is invalid, we will write the value to the invalidEmails CSV file. It’s not that complicated.


                except Exception, e:
                    d = open("error.log","a")
                    d.write("Line number "+str(line)+": "+str(e)+"\n");
                    print "Error 1"
                    print e
                    d.close()               
                    pass
                continue          
        except Exception, e:
            d = open("error.log","a")
            d.write("Line number "+str(line)+": "+str(e)+"\n");
            print "Error 2"
            print e
            d.close()       
            pass
        continue
    print "Task Done!"
 

The code chunk above is just to catch errors. Only because I wanted to log the errors in to a log file. Just a simple file open-write-close commands.

The PASS statement allows you to handle the condition without the loop being distrupted; all code will continue unless a break or other statement occurs.

The CONTINUE statement gives you the option to skip over the part of a loop where an external condition is triggered, but to go on to complete the rest of the loop. The current iteration of the loop will be disrupted, but the program will return to the top of the loop.
 
We then printed “Task Done!” to indicate that the objective of our endeavor has come to a desirable result. And you get to keep your job 50%

The SKIM.PY script above has cleaned our CSV file from 700,000 plus accounts to just 352,972 valid email strings. Those damned fake emails man, they just don’t die, don't they?

Why 50%?

Take note, we only had the valid email strings. Not exactly a working, active nor an existing email. In the next guide, we shall tackle how to check if an email is working or not.


Comments

Popular posts from this blog

Terraform: Merge a List ofObjects into One