luxlunae has asked for the wisdom of the Perl Monks concerning the following question:

So I'm stripping out information from some html and trying to remove information + tags from between tags.
$words =~ s/<*>//g;
The idea is to reduce:
<span class="author-name" itemprop="author">Romaxton</span>
to
Romaxton
Instead I end up with:
<span class="author-name" itemprop="author" Romaxton</span>
I've never been great with regex :(.

Replies are listed 'Best First'.
Re: Problem with <> and regex
by choroba (Cardinal) on Mar 11, 2014 at 15:30 UTC
    It seems you are trying to handle HTML with regexes. It is a painful way. Instead, take a look at a real parsers to help you: HTML::TreeBuilder, XML::LibXML.

    For example, in XML::XSH2, a wrapper around XML::LibXML, you can write just

    open :F html file.html ; my $words = //span[@itemprop="author"]/text() ;
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      People often object that using a full-blown HTML/XML parser on "just a simple string" is overkill: it's "too much code". The reply to this is that a "simple string" all too often becomes complicated (*ML is, after all, a complicated spec), and then the overhead of maintaining a regex-based solution can explode. Do you know of a tutorial or discussion on this or any site along the lines of Dominus's Why it's stupid to `use a variable as a variable name' that addresses "Why It's Stupid to Parse HTML/XML With Regexes"?

        I usually link to this question on StackOverflow. Its top answer is quite funny, but some of the other answers are more informative.
        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Problem with <> and regex
by golux (Chaplain) on Mar 11, 2014 at 15:15 UTC
    Hi luxlunae, The "<*" part of your regex means "match any number of less-than (<), including zero". So the whole thing will get rid of any number of "<" immediately followed by a single ">".

    Closer (though still not correct) is:

    $words =~ s/<.*>//g;

    which means "get rid of "<" and ">" and anything between. The reason it's still not correct is because it will delete multiple <...> ... <...> from the line, including the text within it. (try it and see). That is to say, it matches (and deletes) this entire line:

    <span class="author-name" itemprop="author">Romaxton</span>

    A real solution would be:

    $words =~ s/<[^>]+>//g;

    where the "[^>]+" part means "1 or more of any character except greater-than ">". That regex should therefore get rid of all occurrences of <...> in the line, without removing non-tag text in between.

    Edit: it's worth pointing out another solution would be to use the "non-greedy" quantifier "?" in "still not correct" example I gave above:

    $words =~ s/<.*?>//g;

    which would have the effect of matching the shortest possible "<...>" each time, and thus avoid getting multiple pairs.

    Edit 2: fixed misspelling of "$word" to "$words".

    say  substr+lc crypt(qw $i3 SI$),4,5
      The first solution did indeed delete my entire line, but the second option just crashes the script :(. This is what fails:
      sub clean { my ($words) = @_; print "WordsBefore: $words \n"; $word =~ s/<[^>]+>//g; print "WordsAfter: $words \n"; return $words; }
        How exactly is it "crashing your script"? Is it providing any error message? Any output?

        Edit:   I just noticed that you're passing "$words", but then operating on "$word", which is probably your error. Granted you probably cut and paste what I wrote (so the error is actually mine -- sorry!). Change "$word" to "$words".

        You should also have:

        use strict; use warnings;
        at the top of your script (maybe you do, and that's why your script was failing). If not, add them; they'll tell you what you're doing wrong in exactly this type of situation.
        say  substr+lc crypt(qw $i3 SI$),4,5
Re: Problem with <> and regex
by kcott (Archbishop) on Mar 11, 2014 at 23:21 UTC

    G'day luxlunae,

    This matches your requirements :-)

    #!/usr/bin/env perl -l use strict; use warnings; my $html = '<span class="author-name" itemprop="author">Romaxton</span +>'; my $re = qr{<[^>]+>([^<]*)<[^>]+>}; print "The idea is to reduce:\n", $html; $html =~ s/$re/$1/; print "to\n", $html;

    Output:

    The idea is to reduce: <span class="author-name" itemprop="author">Romaxton</span> to Romaxton

    -- Ken

Re: *fixed*Problem with <> and regex
by Laurent_R (Canon) on Mar 11, 2014 at 22:44 UTC
    If you really want to reduce:
    <span class="author-name" itemprop="author">Romaxton</span>
    you could use something like this (untested):
    s/<[^>]+(\w+)/$1/;

      That doesn't actually work:

      c:\@Work\Perl>perl -wMstrict -le "my $s = '<span class=\"author-name\" itemprop=\"author\">Romaxton</sp +an>'; ;; $s =~ s/<[^>]+(\w+)/$1/; print qq{'$s'}; " 'r">Romaxton</span>'
        Yes, you are right. I was not in a position to test when I posted and I missed parts of it. I was thinking about something like this (assuming the string is in $_):
        s/<[^>]+>(\w+).*/$1/;
        which does work, but there are actually some easier ways, such as:
        print $1 if />(\w+)</;