Monday, June 22, 2009

Email Scraper - In Python with urllib and regular expressions

The past few weeks I have been messing around on the site Most of my work at Sandia lately has been writing (documents in English), with some IT work / configuration thrown in. This site seems to be a cool way to make some extra cash ($20 so far, only one success) and have fun writing programs.

One potential customer wanted someone to manually go to a bunch of web pages and harvest contact information. I started an email conversation with the guy, mentioning that I think there was a better way to automate this. Currently, I have a quick prototype I put together using urllib and regular expressions in python. If he picks up the project, I think I can find/create a better regular expression for email and clean up the data. Right now, I wanted to mess with writing some sort of email harvester; I just thought it would be fun (I have no aspirations towards becoming a spammer).

The code takes a list of fully qualified URLS, one per line. Here is the list the potential customer gave me.

Here are the results after running my program:

brian@ubuntu-bind:~/tmp/other_programs/rent_a_coder/web_grabber$ time ./

real 0m9.527s
user 0m0.312s
sys 0m0.208s

Not too great, but pretty good for about an hour and a couple of questions to my friend Aaron, who is awesome at Python. If I ever seriously want to write an email scraper (either for myself or a customer), I'll get a better regular expression, clean the output up, make it multithreaded and dump the email addresses to a database.

I may or may not ever actually post the code to this one, depending on how the sales go. If you would like to see the code, leave me a comment with how to get in touch with you.

No comments: