Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Parse file for email address

by sri1230 (Novice)
on Jan 22, 2010 at 08:56 UTC ( [id://818921]=perlquestion: print w/replies, xml ) Need Help??

sri1230 has asked for the wisdom of the Perl Monks concerning the following question:

I have a big file which has a bunch of text etc and a few phone numbers and email address somewhere in there. I am trying to grab the email address(which may not have hyperlink). If i try to use regex its very slow. any better idea? If there are multiple...i just want to grab the first one i hit.

Replies are listed 'Best First'.
Re: Parse file for email address
by zentara (Archbishop) on Jan 22, 2010 at 12:37 UTC
    The obvious thing is to look for the @ symbol. When you find it, extract the string thru it's word boundaries, then run the string thru Email::Validate..... and while there, look at Email::Address

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku
      A portion of file is . I need to grab the first valid email address i find . In this case careers@....
      <div style="clear:both; border-top:1px solid #999; padding:1em + 0;"> <div class="rel100"> <div class="padRow"> <label class="leftlabel" style="font-size:1.15em; +width: 7em;">Description:</label> <div class="clear" style="clear:both;"></div> <div style="display:block; margin-left:8.65em;"> <div class="article" id="detailDescription" st +yle="margin-top:0;padding-top:0;">Job Description:<br>We are currentl +y seeking <b style="background:url(/assets/images/detail/default/high +lite.gif); font-weight: bold;">Java</b> Developers interested in lear +ning new technologies in the enterprise integration and content manag +ement arena.<br><br>Responsibilities:<br>Successful candidate will de +sign and develop software solutions that enable customers to develop +enterprise integration solutions that deliver content to business pro +cesses designed for net-centric operations. Other duties will includ +e participating as part of an agile development team within an enterp +rise SOA center of excellence for a major federal agency, participati +ng as part of agile scrum development team, working with system engin +eers to develop user stories and acceptance tests. Candidate will be + expected to elaborate details of acceptance tests, develop UML diagr +ams of fundamental design features, unit test code development, and d +evelop performance benchmarks for load testing.<br><br>Requirements: +<br>At least 2 years experience with relation database design, prefer +ably Oracle and SQL. Experience with installation and configuration +of web application servers: Tomcat, BEA Weblogic, and IBM Websphere r +equired. Preferred experience includes the following: <br>* Agil +e or Rational Unified Process (RUP)<br>* .NET and/or C++<br>* J +avascript and Browser clent development<br><br>Educational Requiremen +ts:<br>BS/MS in Computer Science or related field. (Experience can s +ubstitute for educational requirements.)<br><br>Flatirons Solutions p +rovides expert consulting and systems integration services to comme +rcial and government clients. For additional information, refer to w +ww.FlatironsSolutions.com<br><br>Flatirons Solutions is a successful +small business providing industry-leading solutions and outstanding c +ustomer satisfaction to our clients. In addition to interesting and +challenging work, we provide outstanding benefits including medical/d +ental/vision, Short-term & long-term Disability, 401(k) with employer + matching contributions, and much more.<br><br>Flatirons Solutions is + an Equal Opportunity Employer. <br>PLEASE RESPOND TO: careers@flatir +onssolutions.com</div> </div> <div class="clear" style="clear:both;">hr@flatiron +ssolutions.com</div> </div> </div> </div>
        When you download your file, the email address in question is hidden in html, so you first need to strip the html. I use lynx below, but there are Perl modules. Once you have the text, split into lines, just test each line.

        This dosn't account for emails that somehow get newlines in them, like thru bad cut and pasting, and the regex may leave something to be desired.

        The script below prints out 2 addresses in the array, select the first array element

        OUTPUT: careers@flatironssolutions hr@flatironssolutions
        #!/usr/bin/perl use warnings; use strict; my $html= $ARGV[0]; my $content = `lynx --dump $html`; #print "$content\n"; my @lines = split(/\n/,$content); #print "@lines\n"; my @addrs; while(<@lines>){ if( my ($num)= $_ =~ /(\b\w{1,}\Q@\E\w{1,}\b)/ ){ #print "$num \n"; push @addrs, $num; } } print "@addrs\n";

        I'm not really a human, but I play one on earth.
        Old Perl Programmer Haiku
Re: Parse file for email address
by amir_e_a (Hermit) on Jan 22, 2010 at 09:50 UTC

    Can you show an example of the file?

    What do you mean by "slow"? Do you need to do this often or only once?

    How big is the file?

    Can you show the regular expression that you used?

Re: Parse file for email address
by Anonymous Monk on Jan 22, 2010 at 09:03 UTC
    1) show your code 2) use something from cpan, it has "mail" and "find" in the name
Re: Parse file for email address
by RyuMaou (Deacon) on Jan 22, 2010 at 15:31 UTC
    Have you looked in the Code Catacombs? There are a couple of scripts in there for finding e-mail in files. I know because I wrote them. They're not perfect, and I'm not sure if they'll be faster or slower than anything you've already tried, but it might be a place to start.

    (I will warn you, though, I'm a Network Admin, not a Perl Programmer, so the code reflects the, um, "utilitarian" nature of the effort and the speed with which I needed a solution.)
      Thank you both. The file are larger and i have many that get processed ina loop. It takes forever to check line by line. I also tried to look in Code Catacombs did not find anything that does what i am looking for. Please let me know if you have any other thoughts. Basically i would like to LWP "get" a web page and look for the first valid email address in the web page source.
        Oh, from the description of your initial question, I thought you had the files already, which is what my scripts were all about doing. Though, there is one in there for verifying the e-mail addresses after they've been gathered.

        What you're talking about, though, is an e-mail harvester. Because so many spammers use them, I doubt too many people are going to be willing to help with that.
        Good luck, though, and be sure to post your results for everyone to see!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://818921]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (7)
As of 2024-04-23 12:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found