in reply to Reaped: Perl Programs that can retrieve email addresses from web pages

You might start with HTML::Parser or HTML::TokeParser (the latter is a "simple" version of the former).

Grabbing all <a href="mailto:*"> type flags is the only "ethical" way to grab emails from the web-- and even then, email addresses often show up in such tags without the explicit consent of the owner of the email address. Make sure to validate them, since many of us hate getting email at autostripped addresses and butcher our tags accordingly. My suggestion is to drop all invalid addresses, since that is a signal that these addresses are meant only for human consumption.

Update: I want to second davorg's sentiment below. Only once, a very long time ago, did I ever receive an email from a web bot that was acceptable-- it told me about some broken links on my page... and then, of course, tried to sell me something. Which got the sender into the killfile pretty quickly.
  • Comment on Re: Perl Programs that can retrieve email addresses from web pages
  • Download Code

Replies are listed 'Best First'.
Re: Re: Perl Programs that can retrieve email addresses from web pages
by davorg (Chancellor) on Jan 03, 2001 at 20:36 UTC
    My suggestion is to drop all invalid addresses, since that is a signal that these addresses are meant only for human consumption.

    My suggestion would be not to harvest email addresses from web pages at all as people should be entitled to put email addresses on a web page without worrying about being attacked by spammers.

    If you want to get email addresses then ask visitors to your website to register - but allow them to opt out of spamming.

    --
    <http://www.dave.org.uk>

    "Perl makes the fun jobs fun
    and the boring jobs bearable" - me

(kudra: mailto is not opt-in) Re: Re: Perl Programs that can retrieve email addresses from web pages
by kudra (Vicar) on Jan 03, 2001 at 20:48 UTC
    I have to disagree with the implication that if an email address isn't mauled that it is ethical to grab it from a mailto. In practice, giving your email address may be akin begging for junk mail, but in theory (and ethics), I think that's different from requesting junk email.

    Update: strredwolf, exactly what I meant. I have my address on my site because I want people to be able to use it, not because I want junk mail.

    Update ichimunki, I think we do agree. I was writing my comment at the same time as Dave wrote his, and his sums up my point well enough.

      Sociting comments is one thing. Getting junk mail which burns up time , money, and bandwidth is another. I get too many junk mails in relation to the "Hey! Good artwork!" or "Have you tried this technique?" or "I want to commission you!" e-mails. Are they being drowned out?

      --
      $Stalag99{"URL"}="http://stalag99.keenspace.com";

      I don't really want to get into an ethics debate, and I thought I was pretty clear about what I thought of using harvested emails for the purpose of spam. Soliciting is soliciting, whether electronic, by phone, snail mail, or door-to-door. I am not interested in using the "ethics" club to bludgeon free speech, whether it's opinions I don't like or offers to buy more crap. I should be able to request in any medium that I not be contacted again, once that initial solicitation has been made.

      I also think sending anonymous spam should be a felony-- I put it on the same level as cracking passwords without permission-- attempts to subvert systems for unauthorized use. Other than mauling my email address to inhibit simple (or even the new-and-improved) harvesting, I cannot think of a single way to post information in public, and not expect the public to use that information if they want. Does robots.txt have an email solicitation "opt-in" flag?
        No, robots.txt only specifies pages on a server, and even theough it's a standard, some broken peices of cruft programming don't care for it. Wget cares, though. ;)

        --
        $Stalag99{"URL"}="http://stalag99.keenspace.com";