Monday, June 22, 2009

Email Scraper - In Python with urllib and regular expressions




The past few weeks I have been messing around on the site rentacoder.com. Most of my work at Sandia lately has been writing (documents in English), with some IT work / configuration thrown in. This site seems to be a cool way to make some extra cash ($20 so far, only one success) and have fun writing programs.

One potential customer wanted someone to manually go to a bunch of web pages and harvest contact information. I started an email conversation with the guy, mentioning that I think there was a better way to automate this. Currently, I have a quick prototype I put together using urllib and regular expressions in python. If he picks up the project, I think I can find/create a better regular expression for email and clean up the data. Right now, I wanted to mess with writing some sort of email harvester; I just thought it would be fun (I have no aspirations towards becoming a spammer).

The code takes a list of fully qualified URLS, one per line. Here is the list the potential customer gave me.

http://weprintbarcodes.com
http://accstation.com
http://escan3d.com
http://edealsdepot.com
http://sandboxthreads.com
http://wildlifewonders.com
http://foreverbamboo.com
http://topsecretautomaticmoney.com
http://armormount.com
http://myjones.com

Here are the results after running my program:

brian@ubuntu-bind:~/tmp/other_programs/rent_a_coder/web_grabber$ time ./grabber.py
customerservice@weprintbarcodes.com
href="mailto:feedback@edealsdepot.com">Contact
freebies@sandboxthreads.com
src="https://p10.secure.hostingprod.com/@sandboxthreads.com/ssl/ecomby_128bit2.gif"
src="https://p10.secure.hostingprod.com/@sandboxthreads.com/ssl/paypal.gif"
Sculpture","http://ep.yimg.com/ip/I/wildlifegifts_2055_31879747","795","-@NULL@-");var
href="mailto:info@wildlifewonders.com">info@wildlifewonders.com

real 0m9.527s
user 0m0.312s
sys 0m0.208s
brian@ubuntu-bind:~/tmp/other_programs/rent_a_coder/web_grabber$


Not too great, but pretty good for about an hour and a couple of questions to my friend Aaron, who is awesome at Python. If I ever seriously want to write an email scraper (either for myself or a customer), I'll get a better regular expression, clean the output up, make it multithreaded and dump the email addresses to a database.

I may or may not ever actually post the code to this one, depending on how the rentacoder.com sales go. If you would like to see the code, leave me a comment with how to get in touch with you.

2 comments:

Smith said...

Very nice posting. Your article us quite informative. Thanks for the same. Our service also helps you to market your products with various marketing strategies, Thanks for the sharing such nice blog bulk mailing software

PoL said...

Check out our professional PowerPoint templates for sale at affordable prices. They're easy to use and completely customizable.