koober has asked for the wisdom of the Perl Monks concerning the following question:

Using Git for Windows, which includes Perl v5.24.1, I have successfully used this code to test that my match works for specific occurrences of this sequence of redundant html:

open $HTML, '+<', $libFile . '.html' or die 'Failed to open' . $libFil +e . '.html'; my $matchspotter = 0; my $remove = m/<div><div class="blue"><\/div><\/div>/; while ( my $line = <$HTML> ) { if( $line =~ m/<div><div class="blue"><\/div><\/div>/ ) { $matchspotter += 1; } #$line =~ s/$remove//; #print $HTML $line; } print $matchspotter; close $HTML or die 'Failed to close' . $libFile . '.html';

The match stops working when the variable $remove is used in the if statement, but I only use the variable to illustrate my intent. The commented out lines are the bit I'm struggling with.

Searching has suggested this is silly to try without a cpan solution, but i would particularly like to try. (I've not yet used any modules, I would like to share with other Git users, and searching for the solution narrowed to a cpan module reduces useful hits.)

Replies are listed 'Best First'.
Re: Strip specific html sequence
by haukex (Archbishop) on Dec 10, 2017 at 16:57 UTC

    Please see Parsing HTML/XML with Regular Expressions for why it is indeed not a good idea to do this without a proper parser, especially look at the "spoiler" for lots of cases of perfectly valid HTML that will not be fun to parse with a regex. Here's an example with Mojo::DOM:

    use warnings; use strict; use Mojo::DOM; my $html = <<'ENDHTML'; <html><head><title>Title</title></head> <body> <div><div> </div></div><div><div class="blue"></div></div> </body> </html> ENDHTML my $dom = Mojo::DOM->new($html); $dom->find('div > div.blue') ->each(sub{ $_->parent->remove }); print $dom; __END__ <html><head><title>Title</title></head> <body> <div><div> </div></div> </body> </html>

    I had a quick look at "Git for Windows", and it happens to include HTML::Parser. In the above thread, tangent showed an example with that module here, and because it's a fairly old but good module you will find lots of examples with it online as well. That Git distribution also appears to contain cpan as well, so you could try installing Mojo::DOM.

      That's a lot of good news to take in; I could have looked for that first, eh?. Many thanks. I get the hint and will abandon this path. I'm also late realizing that I could follow another path. The HTML is Perl generated anyway, this bad bit is generated by two separate lines, hence my supposed shortcut to clean them up afterwards. I can also investigate a look-ahead to prevent these bits being written.

Re: Strip specific html sequence
by poj (Abbot) on Dec 10, 2017 at 17:33 UTC
    open $HTML, '+<',...
    print $HTML $line;

    You can't edit an existing file like that, use separate file handles for read/write

    #!perl use strict; use warnings; my $libFile = 'testfile'; my $filename = $libFile.'.html'; my $remove = '<div><div class="blue"><\/div><\/div>'; open my $in, '<', $filename or die "Failed to read '$filename': $!"; my @array = <$in>; close $in; my $matchspotter = 0; for ( @array ) { if ( s/$remove//g ) { ++$matchspotter; } } print "\n$matchspotter lines replaced"; $filename = $libFile.'_chg.html'; open my $out, '>', $filename or die "Failed to write '$filename': $!"; print $out $_ for @array; close $out;
    poj
Re: Strip specific html sequence
by 1nickt (Canon) on Dec 10, 2017 at 14:14 UTC

    Hi, you are using the match operator in void context where you want to use a string or a compiled regular expression in a variable assignment:

    perl -wE ' my $remove = q{<div><div class="blue"></div></div>}; my $str = q{</div></div><div><div class="blue"></div></div>}; say $str =~ s/$remove//r;'
    Or: (update2: ++Laurent_R pointed out that I had the quote operators reversed for string and substring in my OP):
    perl -wE ' my $remove = qr{<div><div class="blue"></div></div>}; my $str = q{</div></div><div><div class="blue"></div></div>}; say $str =~ s/$remove//r;'
    See http://perldoc.perl.org/perlop.html#Quote-and-Quote-like-Operators.

    Note that if you were using warnings Perl would have told you about this:

    perl -wE ' my $remove = m/<div><div class="blue"><\/div><\/div>/;' Use of uninitialized value $_ in pattern match (m//) at -e line 1.

    (update) As far as your second question, you have to either get the result of the substitution in list context, or use the /r ("result") flag as I have showed above:

    $ perl -wE ' my $remove = q{foo}; my $str = q{barfoobaz}; say $str =~ +s/$remove//;' 1 $ perl -wE ' my $remove = q{foo}; my $str = q{barfoobaz}; ( my $x = $s +tr ) =~ s/$remove//; say $x;' barbaz $ perl -wE ' my $remove = q{foo}; my $str = q{barfoobaz}; say $str =~ +s/$remove//r;' barbaz

    Also note that you can use any character as quote delimiters in order to avoid "leaning toothpick syndrome" (<\/div>).

    Finally, also note that there are modules for working with HTML parsing and processing, and trying to do it yourself with regular expressions is not generally recommended as you are unlikely to anticipate and handle all the edge cases.

    Hope this helps!

    The way forward always starts with a minimal test.

      Thank you for replying. It has moved me forward a little. Upon using:

      my $remove = q{<div><div class="blue"></div></div>};

      the variable then works in the if statement.

      my $str = qr{$line};

      or

      my $str = q{$line}; <p>gives</p> <code>(?^:</div></div><div><div class="blue"></div></div> )

      to the console, and

      ( $line = $str ) =~ s/$remove//;

      gives

      (?^:</div></div> )

      You are right; I did get a warning before but misunderstood it. Now, adding r to the substition gives another warning:

       Useless use of non-destructive substitution (s///r) in void context at lr.pl line 76.

      So I'm still in void context, which is bad, right? And I now have this

      (?^: )

      to learn about. I also tried using

      while (<$HTML>)

      with

      $_

      and writing to a separate file, which is getting warmer, actually removing some of the correct things, but leaving behind

      (?^:</div></div> )

      I'm also still using print because say doesn't work for me; it asks for a package. If that little lot prompts no further clues to anyone I shall read on; thanks for your time on this.

        my $remove = q{<div><div class="blue"></div></div>};

        Don't use quoted string constructors to make regex patterns; use  qr// (update: to make honest-to-goodness regex objects) (see perlop, perlre, perlretut, and perlrequick). Using ordinary quoted string constructors sets you up for future puzzling bugs.

        my $str = q{$line};

        This is a meaningless statement; it just assigns a literal  $line to a string:

        c:\@Work\Perl\monks>perl -wMstrict -le "my $str = q{$line}; print qq{'$str'}; " '$line'

        my $str = qr{$line};

        The problem here is that you seem to be trying to make the entire line you've just read from the file into a pattern. You then remove a piece of the pattern with a substitution:

        c:\@Work\Perl\monks>perl -wMstrict -le "my $remove = qr{ now \s+ brown }xms; ;; my $line = qq{how now brown cow \n}; print qq{<$line>}; ;; my $str = qr{$line}; print $str; ;; ($line = $str) =~ s/$remove//; print qq{<$line>}; " <how now brown cow > (?^:how now brown cow ) <(?^:how cow )>
        Do you see where the extraneous  (?^: ... ) stuff comes from?

        Useless use of non-destructive substitution (s///r) in void context

        You have to use a  s///r substitution in a statement like
            my $new_line_changed = $old_line_not_changed =~ s/$remove//gr;
        (and I would recommend use of the  /g "global" modifier also).

        Update: Changed variable names in last code example to (hopefully!) clarify the point being made.


        Give a man a fish:  <%-{-{-{-<

        Hi, I made an error in the second example I showed above (pointed out to me by ++Laurent_R). I'll correct it in my earlier post. I committed the copy-pasta sin :-(

        When you compile a regexp using qr{} and then print it as a string, you get the output you showed here:

        $ perl -wE 'my $x = qr{ foo }; say $x' (?^u: foo )
        But again, that was only output in your program because I had string and match reversed in my example.

        say can be enabled with -E on the command line for one-liners, or with use feature 'say'; or use 5.010; in your program. It requires Perl 5.10 or newer.


        The way forward always starts with a minimal test.
Re: Strip specific html sequence
by koober (Novice) on Dec 10, 2017 at 14:02 UTC

    I should add that the match is not at the beginning of a line. When used in the variable, it 'matches' many random line start characters, but ignores the match itself! A typical line of HTML I want to target is:

    </div></div><div><div class="blue"></div></div>

    The desired result is:

    </div></div>

      Break it down to just the lines you want and fix those in isolation. SSCCE:

      use strict; use warnings; use Test::More tests => 2; my $input = '</div></div><div><div class="blue"></div></div>'; my $expected = '</div></div>'; my $remove = '</div><div><div class="blue"></div>'; like ($input, qr/$remove/, 'Match found'); # Just for illustration $input =~ s/$remove//; is ($input, $expected, 'Strip successful');
      Searching has suggested this is silly to try without a cpan solution

      Quite right. If you start to attempt to parse HTML in this fashion you are starting down a long and slippery slope which ends in the belly of the grue. Recant now while there is still time.