in reply to Some portion of the text missing
Perhaps what the regex is doing is surprising you?
To debug something like this, capture and print the other terms in the regex.
#!/usr/bin/perl -w use strict; while (<DATA>) { if($_ =~ /(<.+>)(<.+>)(.*)/) { print "$1\n\n"; print "$2\n\n"; print "$3\n\n"; } }
What I called $3 is what you called $1. Remember that default for regex'es is "greedy", meaning that an expression will match the maximal length thing while still allowing the rest of the regex to match. So these <.+> terms mean to match as much stuff as possible between angle brackets. The second of these terms gets the last pair of angle bracket stuff (update:while still allowing first term to match), first term gets all angle bracket stuff preceding that and 3rd term in regex gets what is left after 2nd term.$1 is: <TITLE><![CDATA[<p>Dogs may not smarter than 6-year-olds, but researchers suggest canines might be on par with 2-year-olds.< Psychologist Stanley Coren says, "We do know that dogs understand far more than we credit them with, from about 165 words to 250 words." Eve +n better than understanding our words, dogs know our hand gestures and body postures. Dogs may, in fact, far exceed 2-year-olds when it comes to reading emotions.<BODY> $2 is: <![CDATA[<p>Developmentally, 2-year-olds are generally more interested in themselves, while dogs do care how their people feel, and instantly recognize a change in emotion.< "While your dog can't comprehend that you just received a traffic violation, he can tell that you're upset t +he second you walk through the door," Coren says. "In fact, dogs can dete +ct some subtle changes which even adults can't," adds Coren. "We can't smell cancer or predict seizures, as dogs can."< When I posted this story on my Facebook Fan page recently (<a href=" http://www.new.facebook.com/pages/ Steve-Dale/50057343596?ref=ts"> $3 is: www.new.f acebook.com/pages/Steve-Dale/50057343596?ref=ts, or simply type Steve Dale into the Facebook search), I received some interesting responses:< Kelle: "Heck, my Italian Greyhound is smarter than most college students."< Karen: "Depends on how you define smart.
Update: Try:
You are still going to get the same result for the (.*) term. What I called $3 above.#!/usr/bin/perl -w use strict; while (<DATA>) { if($_ =~ /(<.+>)(.*)/) { print "$1\n\n"; print "$2\n\n"; } }
Basically every char of text is "accounted for", nothing is "missing". We know what you called $1 matches. What are you trying to match?
Another Update with a minimal match example:
The below regex uses the ? modifier to say: match the shortest thing possible between the angle brackets. Which are the first two angle bracket things in your DATA. $3 would be everything else following.
#!/usr/bin/perl -w use strict; while (<DATA>) { if($_ =~ /(<.+?>)(<.+?>)(.*)/) { print "$1\n\n"; #prints <TITLE> print "$2\n\n"; #prints <![CDATA[<p> } }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Some portion of the text missing
by Anonymous Monk on Oct 15, 2009 at 05:53 UTC | |
by Marshall (Canon) on Oct 15, 2009 at 06:57 UTC |