Hi sp4rperl,

Don't parse HTML with regexes. (Update: Ok, to put it a different way, the set of XML/HTML data where it might be appropriate to use a regex instead of a module is pretty small. To justify using a regex, you'd have to be absolutely certain of all of your input data. Also, your input data would have to be fairly large to justify an argument that using a regex is faster than a full parser. Unless that's the case here, if you're unsure about how to get a regex to work, then why not let a module take that off your hands. Also, in case this is a worry, Yes, even you can use CPAN.)

The following are all legal variations on that same exact tag (the last example depends on whether this is XML, which I'm guessing because AFAIK timeLimit is not an HTML tag). Mix and match these as you please, but your parser would have to handle all of them:

<timeLimit endTime="2016-12-28T23:59:59" startTime="2016-09-30T00:00:0 +0"></timeLimit> <!-- order --> <timeLimit startTime="2016-09-30T00:00:00" endTime="2016-12-28T23:59:5 +9"></timeLimit> <!-- quotes --> <timeLimit endTime='2016-12-28T23:59:59' startTime='2016-09-30T00:00:0 +0'></timeLimit> <!-- mixed quotes --> <timeLimit endTime="2016-12-28T23:59:59" startTime='2016-09-30T00:00:0 +0'></timeLimit> <!-- whitespace --> <timeLimit endTime = "2016-12-28T23:59:59" startTime = "2016-09-30 +T00:00:00" ></timeLimit > <!-- newlines --> <timeLimit endTime="2016-12-28T23:59:59" startTime="2016-09-30T00:00:00"> </timeLimit> <!-- even more whitespace --> <timeLimit endTime = "2016-12-28T23:59:59" startTime = "2016-09-30T00:00:00" ></timeLimit > <!-- empty element tag --> <timeLimit endTime="2016-12-28T23:59:59" startTime="2016-09-30T00:00:0 +0"/>

Now you might say that you assume your input isn't going to change. But can you really guarantee that in every case? What if who/whatever is generating this HTML/XML changes the output even a little bit? Also, since the appropriate modules are fairly easy to use, why not just use a module that can handle all of the above cases?

That's why using an XML/HTML parser is better than regexes. For example, what davido showed works on all of these examples. Here are two more examples, the first assuming this is HTML (HTML::Parser), the second using a different XML module, XML::LibXML.

use HTML::Parser; my $p = HTML::Parser->new( api_version => 3, start_h => [\&start_tag, "tagname, attr"], case_sensitive => 1, ); sub start_tag { my ($tag,$attr) = @_; if ($tag eq 'timeLimit') { print "start=$$attr{startTime} end=$$attr{endTime}\n"; } } $p->parse($data); $p->eof; use XML::LibXML; my $dom = XML::LibXML->load_xml(string => $data); for my $node ($dom->findnodes('//timeLimit')) { my $start = $node->getAttribute('startTime'); my $end = $node->getAttribute('endTime'); print "s=$start e=$end\n"; }

Hope this helps,
-- Hauke D


In reply to Re: Pattern matching and deriving the data between the "(double quotes) in HTML tag by haukex
in thread Pattern matching and deriving the data between the "(double quotes) in HTML tag by sp4rperl

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.