sp4rperl has asked for the wisdom of the Perl Monks concerning the following question:
#/usr/bin/perl my $timeLimit = '<timeLimit endTime="2016-12-28T23:59:59" startTime="2 +016-09-30T00:00:00"></timeLimit>'; $timeLimit =~ m/startTime="(.*?)"/g; my $startTime = $1; chomp($timeLimit); if ($timeLimit =~ m/endTime/) { $timeLimit =~ m/endTime="(.*?)"/g; my $endTime = $1; chomp($endTime); } print "[$startTime],[$endTime]\n";
Hello fellow monks!! I want to get the date that is enclosed in between the "". Please help me in understanding what is the change that is needed in the code to obtain the desired output. desired output: [2016-09-30T00:00:00],[016-12-28T23:59:59] unexpected output obtained: [2016-09-30T00:00:00],[]
|
|---|
| Replies are listed 'Best First'. | |||
|---|---|---|---|
|
Re: Pattern matching and deriving the data between the "(double quotes) in HTML tag
by davido (Cardinal) on Dec 05, 2016 at 05:23 UTC | |||
This looks like an XML element. You might consider an XML parser too heavy for simply grabbing a couple of dates, but parsing libraries exist because XML is not as simple as people wish it were. Regular expressions, as powerful as they are, become the basis for fragile solutions when employed as simple XML parsers. One problem is that regular expressions alone often are guided to examine a document as a string of characters, without considering its semantic meaning. XML parsers deal with the semantics, and consequently facilitate more reliable parsing. Here's an example using XML::Twig:
The output is:
To get output similar to what your script seemed to be attempting, you might do it this way:
This produces the following:
Notice how it's now not a double-quote issue at all; it's a matter of deciding on a way to drill down to the specific attributes you are interested in and keep track of their content. By side-stepping the regex parsing altogether, we've also avoided issues such as whitespace, newlines showing up mid-element, embedded quotes, and a number of other problems that eventually break regexp-based approaches to scraping XML. If this is actually HTML as your title states, then use one of the many capable HTML parsers, also on CPAN. Dave | [reply] [d/l] [select] | ||
|
Re: Pattern matching and deriving the data between the "(double quotes) in HTML tag
by AnomalousMonk (Archbishop) on Dec 05, 2016 at 05:00 UTC | |||
Further to Athanasius's reply: NB: It's not quite right to say that the /g modifiers in the m//g matches are doing nothing. In fact, they're actively screwing you over (even if you get beyond the improper scoping of the lexical in the if-block). Because the m//g matches are being called in scalar context in both cases in the OPed code, the /g modifier acts to leave the match position string pointer where it is after the first (successful) match, and to start matching from that position in the second match. The first thing you search for in the string is 'startTime' followed by some stuff. Later, you search the same string for 'endTime' and some stuff, but you'll never find it because 'endTime' appears before 'startTime' and the regex engine (under the influence of the /g modifiers) has already passed by it in the string. This can be demonstrated by printing the pos match position of the string after the first match. (I've left out the chomp statements because I assume they really are useless.) Removing either — or better yet, both! — of the confounding and potentially very confusing /g modifiers will get you what you want (if the lexical scoping problem is addressed too, of course). Update: FWIW, my own preference in cases like this is to extract sub-strings from strings in list context and at the same time to generate an "extraction success" flag for possible later use:
Give a man a fish: <%-{-{-{-< | [reply] [d/l] [select] | ||
by sp4rperl (Initiate) on Dec 06, 2016 at 03:45 UTC | |||
Hi AnomalousMonk, Thanks for the comment. I understood that the problem was with the g - global modifier in the patters m//g. I achieved the desired output: 2016-09-30T00:00:00,2016-12-28T23:59:59 by using the below code.
| [reply] [d/l] | ||
by AnomalousMonk (Archbishop) on Dec 06, 2016 at 13:32 UTC | |||
It's good that you've found a solution to your problem, but you should realize that the /g match modifier in the Also, please pay attention to other replies advocating an XML-parsing approach to what is essentially XML. And if you choose to stick with regexes, please consider kcott's wise advice here about using ([^"]*) to capture the unescaped body of a double-quoted string. Give a man a fish: <%-{-{-{-< | [reply] [d/l] [select] | ||
|
Re: Pattern matching and deriving the data between the "(double quotes) in HTML tag
by kcott (Archbishop) on Dec 05, 2016 at 06:42 UTC | |||
G'day sp4rperl, Welcome to the Monastery. I see tybalt89 has provided a fix for your specific problem and Athanasius has provided an explanation of that fix along with some additional information. As a general rule for matching between delimiters, consider simply finding the start delimiter and then matching everything which follows that isn't the end delimiter. So, your captures would look like ([^"]*). I find this: Here's some quick examples showing same/different delimiter pairs matching some/no enclosed text:
Here's a few more examples, with embedded newlines, showing:
When dealing with data where the enclosed text may include an escaped delimiter (e.g. "abc\"xyz") neither the (.*?) nor the ([^"]*) will work (for that example, both will capture 'abc\'). In these cases, you'll need a somewhat more complex regular expression: see perlre: Quantifiers and search for 'the typical "match a double-quoted string" problem'. [Note: You won't have this issue with HTML.] — Ken | [reply] [d/l] [select] | ||
|
Re: Pattern matching and deriving the data between the "(double quotes) in HTML tag
by Athanasius (Archbishop) on Dec 05, 2016 at 03:56 UTC | |||
Hello sp4rperl, and welcome to the Monastery! To elaborate a little on tybalt89’s answer: By declaring $endTime with my, you make it a lexical variable whose scope is limited to the enclosing block. So when the print statement is reached, $endTime no longer refers to that lexical variable, but rather to an (undeclared) package global of the same name. If you begin your script with:
then Perl will give you an error message describing the problem. It’s also a very good idea to add:
to the top of every script. Note also that the /g modifiers on your regular expressions do nothing useful (Update: thanks to AnomalousMonk for the correction below), as in each case you’re looking for a single match only. And you need only one regular expression for the endTime match:
(I’m assuming that chomp($timeLimit); is a mistake for chomp($startTime);.) Hope that helps,
| [reply] [d/l] [select] | ||
|
Re: Pattern matching and deriving the data between the "(double quotes) in HTML tag
by haukex (Archbishop) on Dec 05, 2016 at 09:22 UTC | |||
Hi sp4rperl, Don't parse HTML with regexes. (Update: Ok, to put it a different way, the set of XML/HTML data where it might be appropriate to use a regex instead of a module is pretty small. To justify using a regex, you'd have to be absolutely certain of all of your input data. Also, your input data would have to be fairly large to justify an argument that using a regex is faster than a full parser. Unless that's the case here, if you're unsure about how to get a regex to work, then why not let a module take that off your hands. Also, in case this is a worry, Yes, even you can use CPAN.) The following are all legal variations on that same exact tag (the last example depends on whether this is XML, which I'm guessing because AFAIK timeLimit is not an HTML tag). Mix and match these as you please, but your parser would have to handle all of them:
Now you might say that you assume your input isn't going to change. But can you really guarantee that in every case? What if who/whatever is generating this HTML/XML changes the output even a little bit? Also, since the appropriate modules are fairly easy to use, why not just use a module that can handle all of the above cases? That's why using an XML/HTML parser is better than regexes. For example, what davido showed works on all of these examples. Here are two more examples, the first assuming this is HTML (HTML::Parser), the second using a different XML module, XML::LibXML.
Hope this helps, | [reply] [d/l] [select] | ||
|
Re: Pattern matching and deriving the data between the "(double quotes) in HTML tag
by tybalt89 (Monsignor) on Dec 05, 2016 at 03:41 UTC | |||
| [reply] [d/l] | ||