prenaud has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I have a question about regular expression. I have a file containing a web page and I want to eliminate all the text contained between the tag <svg> and the tag </svg> but I can't find a regular expression matching that. Could you help me ? Thanks by advance for your help. Patrick

Replies are listed 'Best First'.
Re: Regular expression question
by choroba (Cardinal) on Sep 26, 2019 at 11:09 UTC
    Regular expressions can't parse HTML reliably. It's safer to parse the HTML with a proper parser and remove the element there. For example, in XML::XSH2, you can do
    open :r :F html file.html ; delete //svg ; save :F html :b ;
    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Regular expression question
by marto (Cardinal) on Sep 26, 2019 at 11:28 UTC

    I'd second the suggestion of using a parser over regular expressions for this. Mojo::DOM makes this fairly painless:

    use strict; use warnings; use Mojo::DOM; # Either load from file, URL, whatever, assign the html to a variable my $html = 'HTML here'; my $dom = Mojo::DOM->new( $html ); # find all svgs, remove each one found: $dom->find('svg')->each( sub{ $_->remove } ); #print to screen, or whatever you want to do print "$dom\n";
Re: Regular expression question
by hippo (Archbishop) on Sep 26, 2019 at 11:22 UTC

    choroba is right (++). For something as complex as an <svg> element and its subtree, using regex will just be a rabbit hole. There are plenty of proper parsers around, so choose one of those instead. But do have a read of the literature, eg: Parsing HTML The Cthulhu Way for an appreciation of doing things properly and of when you don't have to.

Re: Regular expression question
by clueless newbie (Curate) on Sep 27, 2019 at 00:12 UTC
    Have you looked at the Marpa::R2::HTML parser?
    use 5.014; use warnings; use <a href="">Marpa::R2::HTML qw(html); my $with_table = <<"HTML"; <!DOCTYPE html> <html> <body> <svg width="100" height="100"> <circle cx="50" cy="50" r="40" stroke="green" stroke-width="4" fill= +"yellow" /> </svg> </body> </html> HTML my $no_table = html( \$with_table, { svg => sub { return q{} } }); say $$no_table;
    yields
    <!DOCTYPE html> <html> <body> </body> </html>
    It's pretty tolerant --- if we drop the closing </svg> tag ... well, it will return the same answer.
Re: Regular expression question
by FreeBeerReekingMonk (Deacon) on Sep 26, 2019 at 19:39 UTC
    Quick'n'dirty, ey? Not my problem if you get in trouble, ok?

    Data file:

    $ cat svg.html <html> this <svg>Lorem Ipsum<foo>bar</foo> asd</svg>IS SPARTA <svg some=property>:{P</svg> </html>

    $ perl -0 -pi.bak -e 's{<svg[^>]*>.*?</svg>}{}sgi' svg.html ; cat svg. +html <html> this IS SPARTA </html>
Re: Regular expression question
by Anonymous Monk on Sep 26, 2019 at 19:11 UTC
    #!/usr/bin/perl # Using perl to remove svg tags from html # Because someone asked for an example # Not because this is advisable use strict; use warnings; my $htm = do { local $/; <DATA> }; $htm =~ s, <svg # begin svg tag [^>]* # svg tag attributes > # end svg tag .*? # inside svg tag (BRITTLE!) </svg> # close svg tag ,,gsx; # replace with nothing g=global, s=single-line, x=extended print $htm; __DATA__ <!DOCTYPE html> <html> <body> <svg width="100" height="100"> <circle cx="50" cy="50" r="40" stroke="green" stroke-width="4" fill= +"yellow" /> </svg> </body> </html>