*fixed*Problem with <> and regex

luxlunae has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Problem with <> and regex by choroba (Cardinal) on Mar 11, 2014 at 15:30 UTC
It seems you are trying to handle HTML with regexes. It is a painful way. Instead, take a look at a real parsers to help you: HTML::TreeBuilder, XML::LibXML. For example, in XML::XSH2, a wrapper around `XML::LibXML`, you can write just `open :F html file.html ; my $words = //span[@itemprop="author"]/text() ;` [download] لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l] [select]
Re^2: Problem with <> and regex by AnomalousMonk (Archbishop) on Mar 11, 2014 at 22:57 UTC
People often object that using a full-blown HTML/XML parser on "just a simple string" is overkill: it's "too much code". The reply to this is that a "simple string" all too often becomes complicated (*ML is, after all, a complicated spec), and then the overhead of maintaining a regex-based solution can explode. Do you know of a tutorial or discussion on this or any site along the lines of Dominus's Why it's stupid to `use a variable as a variable name' that addresses "Why It's Stupid to Parse HTML/XML With Regexes"?	[reply]
Re^3: Problem with <> and regex by choroba (Cardinal) on Mar 11, 2014 at 23:09 UTC
I usually link to this question on StackOverflow. Its top answer is quite funny, but some of the other answers are more informative. لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply]
Re: Problem with <> and regex by golux (Chaplain) on Mar 11, 2014 at 15:15 UTC
Hi luxlunae, The "<" part of your regex means "match any number of less-than (<), including zero". So the whole thing will get rid of any number of "<" immediately followed by a single ">". Closer (though still not correct) is: `$words =~ s/<.>//g;` [download] which means "get rid of "<" and ">" and anything between. The reason it's still not correct is because it will delete multiple <...> ... <...> from the line, including the text within it. (try it and see). That is to say, it matches (and deletes) this entire line: `<span class="author-name" itemprop="author">Romaxton</span>` [download] A real solution would be: `$words =~ s/<[^>]+>//g;` [download] where the "[^>]+" part means "1 or more of any character except greater-than ">". That regex should therefore get rid of all occurrences of <...> in the line, without removing non-tag text in between. Edit: it's worth pointing out another solution would be to use the "non-greedy" quantifier "?" in "still not correct" example I gave above: `$words =~ s/<.?>//g;` [download] which would have the effect of matching the shortest possible "<...>" each time, and thus avoid getting multiple pairs. Edit 2*: fixed misspelling of "$word" to "$words". say substr+lc crypt(qw $i3 SI$),4,5	[reply] [d/l] [select]
Re^2: Problem with <> and regex by luxlunae (Novice) on Mar 11, 2014 at 15:20 UTC
The first solution did indeed delete my entire line, but the second option just crashes the script :(. This is what fails: `sub clean { my ($words) = @_; print "WordsBefore: $words \n"; $word =~ s/<[^>]+>//g; print "WordsAfter: $words \n"; return $words; }` [download]	[reply] [d/l]
Re^3: Problem with <> and regex by golux (Chaplain) on Mar 11, 2014 at 15:23 UTC
How exactly is it "crashing your script"? Is it providing any error message? Any output? Edit: I just noticed that you're passing "$words", but then operating on "$word", which is probably your error. Granted you probably cut and paste what I wrote (so the error is actually mine -- sorry!). Change "$word" to "$words". You should also have: `use strict; use warnings;` [download] at the top of your script (maybe you do, and that's why your script was failing). If not, add them; they'll tell you what you're doing wrong in exactly this type of situation. say substr+lc crypt(qw $i3 SI$),4,5	[reply] [d/l]
Re: Problem with <> and regex by kcott (Archbishop) on Mar 11, 2014 at 23:21 UTC
G'day luxlunae, This matches your requirements :-) `#!/usr/bin/env perl -l use strict; use warnings; my $html = '<span class="author-name" itemprop="author">Romaxton</span +>'; my $re = qr{<[^>]+>([^<]*)<[^>]+>}; print "The idea is to reduce:\n", $html; $html =~ s/$re/$1/; print "to\n", $html;` [download] Output: `The idea is to reduce: <span class="author-name" itemprop="author">Romaxton</span> to Romaxton` [download] -- Ken	[reply] [d/l] [select]
Re: fixedProblem with <> and regex by Laurent_R (Canon) on Mar 11, 2014 at 22:44 UTC
If you really want to reduce: `<span class="author-name" itemprop="author">Romaxton</span>` [download] you could use something like this (untested): `s/<[^>]+(\w+)/$1/;` [download]	[reply] [d/l] [select]
Re^2: fixedProblem with <> and regex by AnomalousMonk (Archbishop) on Mar 11, 2014 at 22:54 UTC
That doesn't actually work: `c:\@Work\Perl>perl -wMstrict -le "my $s = '<span class=\"author-name\" itemprop=\"author\">Romaxton</sp +an>'; ;; $s =~ s/<[^>]+(\w+)/$1/; print qq{'$s'}; " 'r">Romaxton</span>'` [download]	[reply] [d/l]
Re^3: fixedProblem with <> and regex by Laurent_R (Canon) on Mar 12, 2014 at 18:43 UTC
Yes, you are right. I was not in a position to test when I posted and I missed parts of it. I was thinking about something like this (assuming the string is in $_): `s/<[^>]+>(\w+).*/$1/;` [download] which does work, but there are actually some easier ways, such as: `print $1 if />(\w+)</;` [download]	[reply] [d/l] [select]