Re: Matching multiple patterns with regex
by Athanasius (Archbishop) on Oct 30, 2015 at 14:12 UTC
|
Hello Tiarcon, and welcome to the Monastery!
Just letting you know that you don’t have to use regular expressions for this. has dedicated modules for reading email. For example:
#! perl
use strict;
use warnings;
use Email::Simple;
my $file = 'YourEmailHere.eml';
open(my $fh, '<', $file)
or die "Cannot open file '$file' for reading: $!";
my $text = do { local $/; <$fh>; };
close $fh
or die "Cannot close file '$file': $!";
my $email = Email::Simple->new($text);
my $from = $email->header('From');
my $subject = $email->header('Subject');
print "FROM: $from\n";
print "SUBJECT: $subject\n";
See Email::Simple.
Hope that helps,
| [reply] [d/l] |
|
|
Thank you so much, I will give it a go
rgds
Tony
| [reply] |
Re: Matching multiple patterns with regex
by Corion (Patriarch) on Oct 30, 2015 at 13:17 UTC
|
When do you think your line contains both, From: and Subject:? You will need to correct your logic.
| [reply] [d/l] [select] |
A reply falls below the community's threshold of quality. You may see it by logging in. |
Re: Matching multiple patterns with regex
by Laurent_R (Canon) on Oct 30, 2015 at 14:51 UTC
|
You don't seem to understand the clues given to you by Corion.
You're basically looking for "From" and "Subject" on the same line, which usually does not occur in emails. Change your logical and to a logical or, and you'll match lines having "From" and lines having "Subject".
In other words, try to change:
while (my $line = <$fh>){
if ($line =~ /From\:/ && $line =~ /Subject\:/) {
print $line;
}
}
To something like:
while (my $line = <$fh>){
if ($line =~ /From\:/ or $line =~ /Subject\:/) {
print $line;
}
}
Update: And since, as mentioned above by ww above, there is no need to escape colons, you could simplify it this way:
while (my $line = <$fh>){
if ($line =~ /From:/ or $line =~ /Subject:/) {
print $line;
}
}
| [reply] [d/l] [select] |
Re: Matching multiple patterns with regex
by graff (Chancellor) on Oct 31, 2015 at 05:39 UTC
|
Athanasius has given the best answer so far, and that's the route you should take, but let's pretend for a moment that there wasn't a suitable module...
The regex approach needs to look for the "From:" and "Subject:" (and other field labels) only when they appear at the beginning of a line, and it can stop reading input as soon as all the desired fields are found.
my @fields = qw/From: Subject:/; # you can add more if/when you want
my $field_regex = join( "|", @fields );
my @field_lines;
while (<$fh>) {
push( @field_lines, $_ ) if ( /^(?:$field_regex) / );
last if @field_lines == @fields;
}
push @field_lines, "";
print join( "\n", sort @field_lines );
Note the regex uses the initial anchor character (^) and non-capturing grouping parens. (You could just as well use the simpler capturing parens, without the "?:" -- this would add a slight bit of extra processing, but not enough to worry about here.)
(updated to include parens around the args for the first "push" call, just because I like using parens around function args) | [reply] [d/l] |
|
|
my @fields = qw/From: Subject:/; # you can add more if/when you want
my $field_regex = join( "|", map { quotemeta($_) } @fields );
It may be unneeded for From: and Subject:, but code may change over time, and adding quotemeta now prevents future bugs.
Alexander
--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
| [reply] [d/l] [select] |
|
|
Actually, I wouldn't recommend quotemeta in tasks like this. It's seldom or never the case that regex-magic characters will be needed as literals in the patterns being conjoined, but even if this comes up, it can still be better to escape them explicitly as needed (when assigning to the array), and allow some strings to use regex-magic where appropriate:
my @fields = ('From: ', 'Subject: ', 'Thread-\w+: ', 'What-if-there\'s
+-a-qmark\?');
# 3rd element matches "Thread-Topic: ", "Thread-Index: ", etc.
Obviously, in a context where strings are coming from a potentially tainted source (i.e. not from the source code itself), one must weigh the relative risk/benefit and coding-effort/ease-of-use trade-offs of prohibiting vs. allowing (and taint-checking) regex metacharacters in applications like this. | [reply] [d/l] |
Re: Matching multiple patterns with regex
by soonix (Chancellor) on Oct 31, 2015 at 11:56 UTC
|
don't have to add much, but if it's a "standard compliant" mail message:
RFC 2822 seems to state the "Subject:" and other keywords are case-INsensitive, so you should match /From:/i and /Subject:/i, respectively. Email::Simple, as recommended by Athanasius, does already provide for this and other tiny quirks.
| [reply] [d/l] [select] |
|
|
RFC 2822 seems to state the "Subject:" and other keywords are case-INsensitive
... and to make things worse, "2.2.3. Long Header Fields" allows to wrap lines so that the content is split over several lines. This is a single logical header line, split over several physical lines:
Subject: Mail
parsing looks
simple until you
have
read
the
RFCs
Additionally, RFC2047 allows encoding of non-ASCII characters.
Any mail parser should be able to handle that, too.
Alexander
--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
| [reply] [d/l] |