sri1230 has asked for the wisdom of the Perl Monks concerning the following question:
I have a big file which has a bunch of text etc and a few phone numbers and email address somewhere in there. I am trying to grab the email address(which may not have hyperlink). If i try to use regex its very slow. any better idea? If there are multiple...i just want to grab the first one i hit.
Re: Parse file for email address
by zentara (Archbishop) on Jan 22, 2010 at 12:37 UTC
|
The obvious thing is to look for the @ symbol. When you find it, extract the string thru it's word boundaries, then run the string thru Email::Validate..... and while there, look at Email::Address
| [reply] |
|
A portion of file is . I need to grab the first valid email address i find . In this case careers@....
<div style="clear:both; border-top:1px solid #999; padding:1em
+ 0;">
<div class="rel100">
<div class="padRow">
<label class="leftlabel" style="font-size:1.15em;
+width: 7em;">Description:</label>
<div class="clear" style="clear:both;"></div>
<div style="display:block; margin-left:8.65em;">
<div class="article" id="detailDescription" st
+yle="margin-top:0;padding-top:0;">Job Description:<br>We are currentl
+y seeking <b style="background:url(/assets/images/detail/default/high
+lite.gif); font-weight: bold;">Java</b> Developers interested in lear
+ning new technologies in the enterprise integration and content manag
+ement arena.<br><br>Responsibilities:<br>Successful candidate will de
+sign and develop software solutions that enable customers to develop
+enterprise integration solutions that deliver content to business pro
+cesses designed for net-centric operations. Other duties will includ
+e participating as part of an agile development team within an enterp
+rise SOA center of excellence for a major federal agency, participati
+ng as part of agile scrum development team, working with system engin
+eers to develop user stories and acceptance tests. Candidate will be
+ expected to elaborate details of acceptance tests, develop UML diagr
+ams of fundamental design features, unit test code development, and d
+evelop performance benchmarks for load testing.<br><br>Requirements:
+<br>At least 2 years experience with relation database design, prefer
+ably Oracle and SQL. Experience with installation and configuration
+of web application servers: Tomcat, BEA Weblogic, and IBM Websphere r
+equired. Preferred experience includes the following: <br>* Agil
+e or Rational Unified Process (RUP)<br>* .NET and/or C++<br>* J
+avascript and Browser clent development<br><br>Educational Requiremen
+ts:<br>BS/MS in Computer Science or related field. (Experience can s
+ubstitute for educational requirements.)<br><br>Flatirons Solutions p
+rovides expert consulting and systems integration services to comme
+rcial and government clients. For additional information, refer to w
+ww.FlatironsSolutions.com<br><br>Flatirons Solutions is a successful
+small business providing industry-leading solutions and outstanding c
+ustomer satisfaction to our clients. In addition to interesting and
+challenging work, we provide outstanding benefits including medical/d
+ental/vision, Short-term & long-term Disability, 401(k) with employer
+ matching contributions, and much more.<br><br>Flatirons Solutions is
+ an Equal Opportunity Employer. <br>PLEASE RESPOND TO: careers@flatir
+onssolutions.com</div>
</div>
<div class="clear" style="clear:both;">hr@flatiron
+ssolutions.com</div>
</div>
</div>
</div>
| [reply] [d/l] |
|
When you download your file, the email address in question is hidden in html, so you first need to strip the html. I use lynx below, but there are Perl modules. Once you have the text, split into lines, just test each line.
This dosn't account for emails that somehow get newlines in them, like thru bad cut and pasting, and the regex may leave something to be desired.
The script below prints out 2 addresses in the array, select the first array element
OUTPUT: careers@flatironssolutions hr@flatironssolutions
#!/usr/bin/perl
use warnings;
use strict;
my $html= $ARGV[0];
my $content = `lynx --dump $html`;
#print "$content\n";
my @lines = split(/\n/,$content);
#print "@lines\n";
my @addrs;
while(<@lines>){
if( my ($num)= $_ =~ /(\b\w{1,}\Q@\E\w{1,}\b)/ ){
#print "$num \n";
push @addrs, $num;
}
}
print "@addrs\n";
| [reply] [d/l] [select] |
Re: Parse file for email address
by amir_e_a (Hermit) on Jan 22, 2010 at 09:50 UTC
|
Can you show an example of the file?
What do you mean by "slow"? Do you need to do this often or only once?
How big is the file?
Can you show the regular expression that you used?
| [reply] |
Re: Parse file for email address
by Anonymous Monk on Jan 22, 2010 at 09:03 UTC
|
1) show your code
2) use something from cpan, it has "mail" and "find" in the name | [reply] |
Re: Parse file for email address
by RyuMaou (Deacon) on Jan 22, 2010 at 15:31 UTC
|
Have you looked in the Code Catacombs? There are a couple of scripts in there for finding e-mail in files. I know because I wrote them. They're not perfect, and I'm not sure if they'll be faster or slower than anything you've already tried, but it might be a place to start.
(I will warn you, though, I'm a Network Admin, not a Perl Programmer, so the code reflects the, um, "utilitarian" nature of the effort and the speed with which I needed a solution.) | [reply] |
|
Thank you both.
The file are larger and i have many that get processed ina loop. It takes forever to check line by line.
I also tried to look in Code Catacombs did not find anything that does what i am looking for. Please let me know if you have any other thoughts. Basically i would like to LWP "get" a web page and look for the first valid email address in the web page source.
| [reply] |
|
Oh, from the description of your initial question, I thought you had the files already, which is what my scripts were all about doing. Though, there is one in there for verifying the e-mail addresses after they've been gathered.
What you're talking about, though, is an e-mail harvester. Because so many spammers use them, I doubt too many people are going to be willing to help with that.
Good luck, though, and be sure to post your results for everyone to see!
| [reply] |
|
|