Re: Pattern matching and deriving the data between the "(double quotes) in HTML tag

Hi sp4rperl,

Don't parse HTML with regexes. (Update: Ok, to put it a different way, the set of XML/HTML data where it might be appropriate to use a regex instead of a module is pretty small. To justify using a regex, you'd have to be absolutely certain of all of your input data. Also, your input data would have to be fairly large to justify an argument that using a regex is faster than a full parser. Unless that's the case here, if you're unsure about how to get a regex to work, then why not let a module take that off your hands. Also, in case this is a worry, Yes, even you can use CPAN.)

The following are all legal variations on that same exact tag (the last example depends on whether this is XML, which I'm guessing because AFAIK timeLimit is not an HTML tag). Mix and match these as you please, but your parser would have to handle all of them:

<timeLimit endTime="2016-12-28T23:59:59" startTime="2016-09-30T00:00:0
+0"></timeLimit>
<!-- order -->
<timeLimit startTime="2016-09-30T00:00:00" endTime="2016-12-28T23:59:5
+9"></timeLimit>
<!-- quotes -->
<timeLimit endTime='2016-12-28T23:59:59' startTime='2016-09-30T00:00:0
+0'></timeLimit>
<!-- mixed quotes -->
<timeLimit endTime="2016-12-28T23:59:59" startTime='2016-09-30T00:00:0
+0'></timeLimit>
<!-- whitespace -->
<timeLimit  endTime = "2016-12-28T23:59:59"  startTime  =  "2016-09-30
+T00:00:00" ></timeLimit  >
<!-- newlines -->
<timeLimit
endTime="2016-12-28T23:59:59"
startTime="2016-09-30T00:00:00">
</timeLimit>
<!-- even more whitespace -->
<timeLimit  
  endTime  
  =  
  "2016-12-28T23:59:59"  
  startTime  
  =  
  "2016-09-30T00:00:00"  
  ></timeLimit  
  >
<!-- empty element tag -->
<timeLimit endTime="2016-12-28T23:59:59" startTime="2016-09-30T00:00:0
+0"/>
[download]

Now you might say that you assume your input isn't going to change. But can you really guarantee that in every case? What if who/whatever is generating this HTML/XML changes the output even a little bit? Also, since the appropriate modules are fairly easy to use, why not just use a module that can handle all of the above cases?

That's why using an XML/HTML parser is better than regexes. For example, what davido showed works on all of these examples. Here are two more examples, the first assuming this is HTML (HTML::Parser), the second using a different XML module, XML::LibXML.

use HTML::Parser;
my $p = HTML::Parser->new(
    api_version => 3,
    start_h => [\&start_tag, "tagname, attr"],
    case_sensitive => 1,
);
sub start_tag {
    my ($tag,$attr) = @_;
    if ($tag eq 'timeLimit') {
        print "start=$$attr{startTime} end=$$attr{endTime}\n";
    }
}
$p->parse($data);
$p->eof;

use XML::LibXML;
my $dom = XML::LibXML->load_xml(string => $data);
for my $node ($dom->findnodes('//timeLimit')) {
    my $start = $node->getAttribute('startTime');
    my $end = $node->getAttribute('endTime');
    print "s=$start e=$end\n";
}
[download]

Hope this helps,
-- Hauke D

Comment on Re: Pattern matching and deriving the data between the "(double quotes) in HTML tag Select or Download Code