Regular expression question

prenaud has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regular expression question by choroba (Cardinal) on Sep 26, 2019 at 11:09 UTC
Regular expressions can't parse HTML reliably. It's safer to parse the HTML with a proper parser and remove the element there. For example, in XML::XSH2, you can do `open :r :F html file.html ; delete //svg ; save :F html :b ;` [download] `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re: Regular expression question by marto (Cardinal) on Sep 26, 2019 at 11:28 UTC
I'd second the suggestion of using a parser over regular expressions for this. Mojo::DOM makes this fairly painless: `use strict; use warnings; use Mojo::DOM; # Either load from file, URL, whatever, assign the html to a variable my $html = 'HTML here'; my $dom = Mojo::DOM->new( $html ); # find all svgs, remove each one found: $dom->find('svg')->each( sub{ $_->remove } ); #print to screen, or whatever you want to do print "$dom\n";` [download]	[reply] [d/l]
Re: Regular expression question by hippo (Archbishop) on Sep 26, 2019 at 11:22 UTC
choroba is right (++). For something as complex as an `<svg>` element and its subtree, using regex will just be a rabbit hole. There are plenty of proper parsers around, so choose one of those instead. But do have a read of the literature, eg: Parsing HTML The Cthulhu Way for an appreciation of doing things properly and of when you don't have to.	[reply] [d/l]
Re: Regular expression question by clueless newbie (Curate) on Sep 27, 2019 at 00:12 UTC
Have you looked at the Marpa::R2::HTML parser? `use 5.014; use warnings; use <a href="">Marpa::R2::HTML qw(html); my $with_table = <<"HTML"; <!DOCTYPE html> <html> <body> <svg width="100" height="100"> <circle cx="50" cy="50" r="40" stroke="green" stroke-width="4" fill= +"yellow" /> </svg> </body> </html> HTML my $no_table = html( \$with_table, { svg => sub { return q{} } }); say $$no_table;` [download] yields `<!DOCTYPE html> <html> <body> </body> </html>` [download] It's pretty tolerant --- if we drop the closing </svg> tag ... well, it will return the same answer.	[reply] [d/l] [select]
Re: Regular expression question by FreeBeerReekingMonk (Deacon) on Sep 26, 2019 at 19:39 UTC
Quick'n'dirty, ey? Not my problem if you get in trouble, ok? Data file: `$ cat svg.html <html> this <svg>Lorem Ipsum<foo>bar</foo> asd</svg>IS SPARTA <svg some=property>:{P</svg> </html>` [download] `$ perl -0 -pi.bak -e 's{<svg[^>]>.?</svg>}{}sgi' svg.html ; cat svg. +html <html> this IS SPARTA </html>` [download]	[reply] [d/l] [select]
Re: Regular expression question by Anonymous Monk on Sep 26, 2019 at 19:11 UTC
#!/usr/bin/perl # Using perl to remove svg tags from html # Because someone asked for an example # Not because this is advisable use strict; use warnings; my $htm = do { local $/; <DATA> }; $htm =~ s, <svg # begin svg tag [^>]* # svg tag attributes > # end svg tag .*? # inside svg tag (BRITTLE!) </svg> # close svg tag ,,gsx; # replace with nothing g=global, s=single-line, x=extended print $htm; __DATA__ <!DOCTYPE html> <html> <body> <svg width="100" height="100"> <circle cx="50" cy="50" r="40" stroke="green" stroke-width="4" fill= +"yellow" /> </svg> </body> </html> [download]	[reply] [d/l]