Monday, June 22, 2009

Email Scraper - In Python with urllib and regular expressions




The past few weeks I have been messing around on the site rentacoder.com. Most of my work at Sandia lately has been writing (documents in English), with some IT work / configuration thrown in. This site seems to be a cool way to make some extra cash ($20 so far, only one success) and have fun writing programs.

One potential customer wanted someone to manually go to a bunch of web pages and harvest contact information. I started an email conversation with the guy, mentioning that I think there was a better way to automate this. Currently, I have a quick prototype I put together using urllib and regular expressions in python. If he picks up the project, I think I can find/create a better regular expression for email and clean up the data. Right now, I wanted to mess with writing some sort of email harvester; I just thought it would be fun (I have no aspirations towards becoming a spammer).

The code takes a list of fully qualified URLS, one per line. Here is the list the potential customer gave me.

http://weprintbarcodes.com
http://accstation.com
http://escan3d.com
http://edealsdepot.com
http://sandboxthreads.com
http://wildlifewonders.com
http://foreverbamboo.com
http://topsecretautomaticmoney.com
http://armormount.com
http://myjones.com

Here are the results after running my program:

brian@ubuntu-bind:~/tmp/other_programs/rent_a_coder/web_grabber$ time ./grabber.py
customerservice@weprintbarcodes.com
href="mailto:feedback@edealsdepot.com">Contact
freebies@sandboxthreads.com
src="https://p10.secure.hostingprod.com/@sandboxthreads.com/ssl/ecomby_128bit2.gif"
src="https://p10.secure.hostingprod.com/@sandboxthreads.com/ssl/paypal.gif"
Sculpture","http://ep.yimg.com/ip/I/wildlifegifts_2055_31879747","795","-@NULL@-");var
href="mailto:info@wildlifewonders.com">info@wildlifewonders.com

real 0m9.527s
user 0m0.312s
sys 0m0.208s
brian@ubuntu-bind:~/tmp/other_programs/rent_a_coder/web_grabber$


Not too great, but pretty good for about an hour and a couple of questions to my friend Aaron, who is awesome at Python. If I ever seriously want to write an email scraper (either for myself or a customer), I'll get a better regular expression, clean the output up, make it multithreaded and dump the email addresses to a database.

I may or may not ever actually post the code to this one, depending on how the rentacoder.com sales go. If you would like to see the code, leave me a comment with how to get in touch with you.

No comments: