Strip specific html sequence

koober has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Strip specific html sequence by haukex (Archbishop) on Dec 10, 2017 at 16:57 UTC
Please see Parsing HTML/XML with Regular Expressions for why it is indeed not a good idea to do this without a proper parser, especially look at the "spoiler" for lots of cases of perfectly valid HTML that will not be fun to parse with a regex. Here's an example with Mojo::DOM: `use warnings; use strict; use Mojo::DOM; my $html = <<'ENDHTML'; <html><head><title>Title</title></head> <body> <div><div> </div></div><div><div class="blue"></div></div> </body> </html> ENDHTML my $dom = Mojo::DOM->new($html); $dom->find('div > div.blue') ->each(sub{ $_->parent->remove }); print $dom; __END__ <html><head><title>Title</title></head> <body> <div><div> </div></div> </body> </html>` [download] I had a quick look at "Git for Windows", and it happens to include HTML::Parser. In the above thread, tangent showed an example with that module here, and because it's a fairly old but good module you will find lots of examples with it online as well. That Git distribution also appears to contain cpan as well, so you could try installing Mojo::DOM.	[reply] [d/l]
Re^2: Strip specific html sequence by koober (Novice) on Dec 10, 2017 at 17:43 UTC
That's a lot of good news to take in; I could have looked for that first, eh?. Many thanks. I get the hint and will abandon this path. I'm also late realizing that I could follow another path. The HTML is Perl generated anyway, this bad bit is generated by two separate lines, hence my supposed shortcut to clean them up afterwards. I can also investigate a look-ahead to prevent these bits being written.	[reply]
Re: Strip specific html sequence by poj (Abbot) on Dec 10, 2017 at 17:33 UTC
open $HTML, '+<',... print $HTML $line; You can't edit an existing file like that, use separate file handles for read/write #!perl use strict; use warnings; my $libFile = 'testfile'; my $filename = $libFile.'.html'; my $remove = '<div><div class="blue"><\/div><\/div>'; open my $in, '<', $filename or die "Failed to read '$filename': $!"; my @array = <$in>; close $in; my $matchspotter = 0; for ( @array ) { if ( s/$remove//g ) { ++$matchspotter; } } print "\n$matchspotter lines replaced"; $filename = $libFile.'_chg.html'; open my $out, '>', $filename or die "Failed to write '$filename': $!"; print $out $_ for @array; close $out; [download] poj	[reply] [d/l]
Re: Strip specific html sequence by 1nickt (Canon) on Dec 10, 2017 at 14:14 UTC
Hi, you are using the match operator in void context where you want to use a string or a compiled regular expression in a variable assignment: `perl -wE ' my $remove = q{<div><div class="blue"></div></div>}; my $str = q{</div></div><div><div class="blue"></div></div>}; say $str =~ s/$remove//r;'` [download] Or: (update2: ++Laurent_R pointed out that I had the quote operators reversed for string and substring in my OP): `perl -wE ' my $remove = qr{<div><div class="blue"></div></div>}; my $str = q{</div></div><div><div class="blue"></div></div>}; say $str =~ s/$remove//r;'` [download] See http://perldoc.perl.org/perlop.html#Quote-and-Quote-like-Operators. Note that if you were using warnings Perl would have told you about this: `perl -wE ' my $remove = m/<div><div class="blue"><\/div><\/div>/;' Use of uninitialized value $_ in pattern match (m//) at -e line 1.` [download] (update) As far as your second question, you have to either get the result of the substitution in list context, or use the `/r` ("result") flag as I have showed above: `$ perl -wE ' my $remove = q{foo}; my $str = q{barfoobaz}; say $str =~ +s/$remove//;' 1 $ perl -wE ' my $remove = q{foo}; my $str = q{barfoobaz}; ( my $x = $s +tr ) =~ s/$remove//; say $x;' barbaz $ perl -wE ' my $remove = q{foo}; my $str = q{barfoobaz}; say $str =~ +s/$remove//r;' barbaz` [download] Also note that you can use any character as quote delimiters in order to avoid "leaning toothpick syndrome" (`<\/div>`). Finally, also note that there are modules for working with HTML parsing and processing, and trying to do it yourself with regular expressions is not generally recommended as you are unlikely to anticipate and handle all the edge cases. Hope this helps! The way forward always starts with a minimal test.	[reply] [d/l] [select]
Re^2: Strip specific html sequence by koober (Novice) on Dec 10, 2017 at 17:27 UTC
Thank you for replying. It has moved me forward a little. Upon using: `my $remove = q{<div><div class="blue"></div></div>};` the variable then works in the if statement. `my $str = qr{$line};` or `my $str = q{$line}; <p>gives</p> <code>(?^:</div></div><div><div class="blue"></div></div> )` [download] to the console, and `( $line = $str ) =~ s/$remove//;` gives `(?^:</div></div> )` [download] You are right; I did get a warning before but misunderstood it. Now, adding r to the substition gives another warning: `Useless use of non-destructive substitution (s///r) in void context at lr.pl line 76.` So I'm still in void context, which is bad, right? And I now have this `(?^: )` [download] to learn about. I also tried using `while (<$HTML>)` with `$_` and writing to a separate file, which is getting warmer, actually removing some of the correct things, but leaving behind `(?^:</div></div> )` [download] I'm also still using print because say doesn't work for me; it asks for a package. If that little lot prompts no further clues to anyone I shall read on; thanks for your time on this.	[reply] [d/l] [select]
Re^3: Strip specific html sequence by AnomalousMonk (Archbishop) on Dec 10, 2017 at 18:15 UTC
`my $remove = q{<div><div class="blue"></div></div>};` Don't use quoted string constructors to make regex patterns; use `qr//` (update: to make honest-to-goodness regex objects) (see perlop, perlre, perlretut, and perlrequick). Using ordinary quoted string constructors sets you up for future puzzling bugs. `my $str = q{$line};` This is a meaningless statement; it just assigns a literal `$line` to a string: `c:\@Work\Perl\monks>perl -wMstrict -le "my $str = q{$line}; print qq{'$str'}; " '$line'` [download] `my $str = qr{$line};` The problem here is that you seem to be trying to make the entire line you've just read from the file into a pattern. You then remove a piece of the pattern with a substitution: `c:\@Work\Perl\monks>perl -wMstrict -le "my $remove = qr{ now \s+ brown }xms; ;; my $line = qq{how now brown cow \n}; print qq{<$line>}; ;; my $str = qr{$line}; print $str; ;; ($line = $str) =~ s/$remove//; print qq{<$line>}; " <how now brown cow > (?^:how now brown cow ) <(?^:how cow )>` [download] Do you see where the extraneous `(?^: ... )` stuff comes from? `Useless use of non-destructive substitution (s///r) in void context` You have to use a `s///r` substitution in a statement like `my $new_line_changed = $old_line_not_changed =~ s/$remove//gr;` (and I would recommend use of the `/g` "global" modifier also). Update: Changed variable names in last code example to (hopefully!) clarify the point being made. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^3: Strip specific html sequence by 1nickt (Canon) on Dec 10, 2017 at 18:01 UTC
Hi, I made an error in the second example I showed above (pointed out to me by ++Laurent_R). I'll correct it in my earlier post. I committed the copy-pasta sin :-( When you compile a regexp using `qr{}` and then print it as a string, you get the output you showed here: `$ perl -wE 'my $x = qr{ foo }; say $x' (?^u: foo )` [download] But again, that was only output in your program because I had string and match reversed in my example. say can be enabled with `-E` on the command line for one-liners, or with `use feature 'say';` or `use 5.010;` in your program. It requires Perl 5.10 or newer. The way forward always starts with a minimal test.	[reply] [d/l] [select]
Re: Strip specific html sequence by koober (Novice) on Dec 10, 2017 at 14:02 UTC
I should add that the match is not at the beginning of a line. When used in the variable, it 'matches' many random line start characters, but ignores the match itself! A typical line of HTML I want to target is: `</div></div><div><div class="blue"></div></div>` [download] The desired result is: `</div></div>` [download]	[reply] [d/l] [select]
Re^2: Strip specific html sequence by hippo (Archbishop) on Dec 10, 2017 at 16:42 UTC
Break it down to just the lines you want and fix those in isolation. SSCCE: `use strict; use warnings; use Test::More tests => 2; my $input = '</div></div><div><div class="blue"></div></div>'; my $expected = '</div></div>'; my $remove = '</div><div><div class="blue"></div>'; like ($input, qr/$remove/, 'Match found'); # Just for illustration $input =~ s/$remove//; is ($input, $expected, 'Strip successful');` [download] Searching has suggested this is silly to try without a cpan solution Quite right. If you start to attempt to parse HTML in this fashion you are starting down a long and slippery slope which ends in the belly of the grue. Recant now while there is still time.	[reply] [d/l]