Pattern matching and deriving the data between the "(double quotes) in HTML tag

sp4rperl has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Pattern matching and deriving the data between the "(double quotes) in HTML tag
by davido (Cardinal) on Dec 05, 2016 at 05:23 UTC

This looks like an XML element. You might consider an XML parser too heavy for simply grabbing a couple of dates, but parsing libraries exist because XML is not as simple as people wish it were. Regular expressions, as powerful as they are, become the basis for fragile solutions when employed as simple XML parsers.

One problem is that regular expressions alone often are guided to examine a document as a string of characters, without considering its semantic meaning. XML parsers deal with the semantics, and consequently facilitate more reliable parsing.

Here's an example using XML::Twig:

use strict;
use warnings;
use XML::Twig;

my $xml = q{<timeLimit endTime="2016-12-28T23:59:59" startTime="2016-0
+9-30T00:00:00"></timeLimit>};

my $t = XML::Twig->new(
    twig_handlers => {
        timeLimit => sub {
            my $atts = $_->atts;
            foreach (keys %$atts) {
                /^(?:start|end)Time$/ && do {print "$_ => $atts->{$_}\
+n"; next;};
            }
        },
    },
);

$t->parse($xml);
[download]

The output is:

endTime => 2016-12-28T23:59:59
startTime => 2016-09-30T00:00:00
[download]

To get output similar to what your script seemed to be attempting, you might do it this way:

my @time_limits;

my $t = XML::Twig->new(
    twig_handlers => {
        timeLimit => sub {
            my $atts = $_->atts;
            if (exists $atts->{startTime} && exists $atts->{endTime}) 
+{
                push @time_limits, [$atts->{startTime}, $atts->{endTim
+e}];
            }
        },
    },
);

$t->parse($xml);
print "[$_->[0]], [$_->[1]]\n" foreach @time_limits;
[download]

This produces the following:

[2016-09-30T00:00:00], [2016-12-28T23:59:59]
[download]

Notice how it's now not a double-quote issue at all; it's a matter of deciding on a way to drill down to the specific attributes you are interested in and keep track of their content. By side-stepping the regex parsing altogether, we've also avoided issues such as whitespace, newlines showing up mid-element, embedded quotes, and a number of other problems that eventually break regexp-based approaches to scraping XML.

If this is actually HTML as your title states, then use one of the many capable HTML parsers, also on CPAN.

Dave

[reply]
[d/l]
[select]

Re: Pattern matching and deriving the data between the "(double quotes) in HTML tag
by AnomalousMonk (Archbishop) on Dec 05, 2016 at 05:00 UTC

Further to Athanasius's reply: NB: It's not quite right to say that the /g modifiers in the m//g matches are doing nothing. In fact, they're actively screwing you over (even if you get beyond the improper scoping of the lexical in the if-block).

Because the m//g matches are being called in scalar context in both cases in the OPed code, the /g modifier acts to leave the match position string pointer where it is after the first (successful) match, and to start matching from that position in the second match.

The first thing you search for in the string is 'startTime' followed by some stuff. Later, you search the same string for 'endTime' and some stuff, but you'll never find it because 'endTime' appears before 'startTime' and the regex engine (under the influence of the /g modifiers) has already passed by it in the string. This can be demonstrated by printing the pos match position of the string after the first match. (I've left out the chomp statements because I assume they really are useless.)

c:\@Work\Perl\monks>perl -wMstrict -le
"my $timeLimit =
   'xxx endTime=\"2016-12-28T23:59:59\" startTime=\"2016-09-30T00:00:0
+0\" yyy';
 ;;
 $timeLimit =~ m/startTime=\"(.*?)\"/g;
 my $startTime = $1;
 print 'match position after 1st match: ', pos $timeLimit;
 ;;
 my $endTime;
 if ($timeLimit =~ m/endTime/)
 {
     $timeLimit =~ m/endTime=\"(.*?)\"/g;
     $endTime = $1;
     print 'match position after 2nd match: ', pos $timeLimit;
 }
 print \"[$startTime],[$endTime]\n\";
"
match position after 1st match: 65
Use of uninitialized value in print at -e line 1.
match position after 2nd match:
Use of uninitialized value $endTime in concatenation (.) or string at 
+-e line 1.
[2016-09-30T00:00:00],[]
[download]

/g

Update: FWIW, my own preference in cases like this is to extract sub-strings from strings in list context and at the same time to generate an "extraction success" flag for possible later use:

c:\@Work\Perl\monks>perl -wMstrict -le
"my $timeLimit =
   'xxx endTime=\"2016-12-28T23:59:59\" startTime=\"2016-09-30T00:00:0
+0\" yyy';
 ;;
 my $got_start = my ($start) = $timeLimit =~ m/startTime=\"(.*?)\"/;
 my $got_end   = my ($end)   = $timeLimit =~ m/endTime=\"(.*?)\"/;
 ;;
 print qq{start [$start], end [$end]} if $got_start and $got_end;
"
start [2016-09-30T00:00:00], end [2016-12-28T23:59:59]
[download]

Give a man a fish: <%-{-{-{-<

[reply]
[d/l]
[select]

Re^2: Pattern matching and deriving the data between the "(double quotes) in HTML tag

by sp4rperl (Initiate) on Dec 06, 2016 at 03:45 UTC

Hi AnomalousMonk, Thanks for the comment. I understood that the problem was with the g - global modifier in the patters m//g. I achieved the desired output: 2016-09-30T00:00:00,2016-12-28T23:59:59 by using the below code.

#/usr/bin/perl
use strict;
use warnings;
my $timeLimit = '<timeLimit endTime="2016-12-28T23:59:59" startTime="2
+016-09-30T00:00:00"></timeLimit>';
$timeLimit =~ m/startTime="(.*?)"/g;
my $startTime = $1;
chomp($timeLimit);
my $endTime ='';
if ($timeLimit =~ m/endTime/)
{
        $timeLimit =~ m/endTime="(.*?)"/;
        $endTime = $1;
}
print "[$startTime],[$endTime]\n";
[download]

[reply]
[d/l]

Re^3: Pattern matching and deriving the data between the "(double quotes) in HTML tag

by AnomalousMonk (Archbishop) on Dec 06, 2016 at 13:32 UTC

It's good that you've found a solution to your problem, but you should realize that the /g match modifier in the
$timeLimit =~ m/startTime="(.*?)"/g;
statement does nothing more than pose a potential pitfall for future code, either in execution or development. Why leave it in?

Also, please pay attention to other replies advocating an XML-parsing approach to what is essentially XML.

And if you choose to stick with regexes, please consider kcott's wise advice here about using ([^"]*) to capture the unescaped body of a double-quoted string.

Give a man a fish: <%-{-{-{-<

[reply]
[d/l]
[select]

Re: Pattern matching and deriving the data between the "(double quotes) in HTML tag
by kcott (Archbishop) on Dec 05, 2016 at 06:42 UTC

G'day sp4rperl,

Welcome to the Monastery.

I see tybalt89 has provided a fix for your specific problem and Athanasius has provided an explanation of that fix along with some additional information.

As a general rule for matching between delimiters, consider simply finding the start delimiter and then matching everything which follows that isn't the end delimiter. So, your captures would look like ([^"]*). I find this:

Makes it obvious what you want to capture (in this case, everything that isn't a double quote).
Means you don't have to worry about greediness (i.e. adding the '?' after '.*').
Avoids '.' not capturing a newline because an 's' modifier was forgotten (in fact, you haven't used an 's' modifier but your subsequent use of chomp suggests you thought you might capture a newline at the end). See perlre: Modifiers for more on this; also look at the 'm' modifier.

Here's some quick examples showing same/different delimiter pairs matching some/no enclosed text:

$ perl -E 'my ($s, $e) = qw{" "}; q{a"b"c} =~ /$s([^$e]*)/; say "|$1|"
+'
|b|
$ perl -E 'my ($s, $e) = qw{" "}; q{a""c} =~ /$s([^$e]*)/; say "|$1|"'
||
$ perl -E 'my ($s, $e) = qw{< >}; q{a<b>c} =~ /$s([^$e]*)/; say "|$1|"
+'
|b|
$ perl -E 'my ($s, $e) = qw{< >}; q{a<>c} =~ /$s([^$e]*)/; say "|$1|"'
||
[download]

Here's a few more examples, with embedded newlines, showing:

([^"]*) capturing text as is.
(.*?) capturing nothing as is.
(.*?) capturing text when the 's' modifier is added.

$ perl -E 'my ($s, $e) = qw{" "}; qq{a"b\n"c} =~ /$s([^$e]*)/; say "|$
+1|"'
|b
|
$ perl -E 'my ($s, $e) = qw{" "}; qq{a"b\n"c} =~ /$s(.*?)$e/; say "|$1
+|"'
||
$ perl -E 'my ($s, $e) = qw{" "}; qq{a"b\n"c} =~ /$s(.*?)$e/s; say "|$
+1|"'
|b
|
[download]

When dealing with data where the enclosed text may include an escaped delimiter (e.g. "abc\"xyz") neither the (.*?) nor the ([^"]*) will work (for that example, both will capture 'abc\'). In these cases, you'll need a somewhat more complex regular expression: see perlre: Quantifiers and search for 'the typical "match a double-quoted string" problem'. [Note: You won't have this issue with HTML.]

— Ken

[reply]
[d/l]
[select]

Re: Pattern matching and deriving the data between the "(double quotes) in HTML tag
by Athanasius (Archbishop) on Dec 05, 2016 at 03:56 UTC

Hello sp4rperl, and welcome to the Monastery!

To elaborate a little on tybalt89’s answer: By declaring $endTime with my, you make it a lexical variable whose scope is limited to the enclosing block. So when the print statement is reached, $endTime no longer refers to that lexical variable, but rather to an (undeclared) package global of the same name. If you begin your script with:

use strict;
[download]

then Perl will give you an error message describing the problem. It’s also a very good idea to add:

use warnings;
[download]

to the top of every script. Note also that the /g modifiers on your regular expressions do nothing useful (Update: thanks to AnomalousMonk for the correction below), as in each case you’re looking for a single match only. And you need only one regular expression for the endTime match:

use strict;
use warnings;

my $timeLimit = '<timeLimit endTime="2016-12-28T23:59:59" startTime="2
+016-09-30T00:00:00"></timeLimit>'; 
my ($startTime) = $timeLimit =~ /startTime="(.*?)"/;
chomp($startTime);

if (my ($endTime) = $timeLimit =~ /endTime="(.*?)"/) 
{
    chomp($endTime); 
    print "[$startTime],[$endTime]\n";
}
else
{
    print "[$startTime]\n";
}
[download]

(I’m assuming that chomp($timeLimit); is a mistake for chomp($startTime);.)

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re: Pattern matching and deriving the data between the "(double quotes) in HTML tag
by haukex (Archbishop) on Dec 05, 2016 at 09:22 UTC

Hi sp4rperl,

Don't parse HTML with regexes. (Update: Ok, to put it a different way, the set of XML/HTML data where it might be appropriate to use a regex instead of a module is pretty small. To justify using a regex, you'd have to be absolutely certain of all of your input data. Also, your input data would have to be fairly large to justify an argument that using a regex is faster than a full parser. Unless that's the case here, if you're unsure about how to get a regex to work, then why not let a module take that off your hands. Also, in case this is a worry, Yes, even you can use CPAN.)

The following are all legal variations on that same exact tag (the last example depends on whether this is XML, which I'm guessing because AFAIK timeLimit is not an HTML tag). Mix and match these as you please, but your parser would have to handle all of them:

<timeLimit endTime="2016-12-28T23:59:59" startTime="2016-09-30T00:00:0
+0"></timeLimit>
<!-- order -->
<timeLimit startTime="2016-09-30T00:00:00" endTime="2016-12-28T23:59:5
+9"></timeLimit>
<!-- quotes -->
<timeLimit endTime='2016-12-28T23:59:59' startTime='2016-09-30T00:00:0
+0'></timeLimit>
<!-- mixed quotes -->
<timeLimit endTime="2016-12-28T23:59:59" startTime='2016-09-30T00:00:0
+0'></timeLimit>
<!-- whitespace -->
<timeLimit  endTime = "2016-12-28T23:59:59"  startTime  =  "2016-09-30
+T00:00:00" ></timeLimit  >
<!-- newlines -->
<timeLimit
endTime="2016-12-28T23:59:59"
startTime="2016-09-30T00:00:00">
</timeLimit>
<!-- even more whitespace -->
<timeLimit  
  endTime  
  =  
  "2016-12-28T23:59:59"  
  startTime  
  =  
  "2016-09-30T00:00:00"  
  ></timeLimit  
  >
<!-- empty element tag -->
<timeLimit endTime="2016-12-28T23:59:59" startTime="2016-09-30T00:00:0
+0"/>
[download]

Now you might say that you assume your input isn't going to change. But can you really guarantee that in every case? What if who/whatever is generating this HTML/XML changes the output even a little bit? Also, since the appropriate modules are fairly easy to use, why not just use a module that can handle all of the above cases?

That's why using an XML/HTML parser is better than regexes. For example, what davido showed works on all of these examples. Here are two more examples, the first assuming this is HTML (HTML::Parser), the second using a different XML module, XML::LibXML.

use HTML::Parser;
my $p = HTML::Parser->new(
    api_version => 3,
    start_h => [\&start_tag, "tagname, attr"],
    case_sensitive => 1,
);
sub start_tag {
    my ($tag,$attr) = @_;
    if ($tag eq 'timeLimit') {
        print "start=$$attr{startTime} end=$$attr{endTime}\n";
    }
}
$p->parse($data);
$p->eof;

use XML::LibXML;
my $dom = XML::LibXML->load_xml(string => $data);
for my $node ($dom->findnodes('//timeLimit')) {
    my $start = $node->getAttribute('startTime');
    my $end = $node->getAttribute('endTime');
    print "s=$start e=$end\n";
}
[download]

Hope this helps,
-- Hauke D

[reply]
[d/l]
[select]

Re: Pattern matching and deriving the data between the "(double quotes) in HTML tag
by tybalt89 (Monsignor) on Dec 05, 2016 at 03:41 UTC

#!/usr/bin/perl

# http://perlmonks.org/?node_id=1177192

use strict;
use warnings;

my $timeLimit = '<timeLimit endTime="2016-12-28T23:59:59" startTime="2
+016-09-30T00:00:00"></timeLimi
+t>'; 
$timeLimit =~ m/startTime="(.*?)"/; 
my $startTime = $1; 
chomp($timeLimit); 
my $endTime;
if ($timeLimit =~ m/endTime/) 
{ 
    $timeLimit =~ m/endTime="(.*?)"/; 
    $endTime = $1; 
    chomp($endTime); 
} 
print "[$startTime],[$endTime]\n";
[download]

[reply]
[d/l]