ultranerds has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'm trying to get some regex, that will do something like:

sub { my $content = $_[0]; $content =~ s|\Q<img src="\E(.*)\Q" />|<span class="large-image-righ +t-landscape"><img src="$1" /></span>|sig; return $content; }

..a sample few lines of $content, are:
<span class="large-image-right-landscape"><a href="http://www.pokerlis +tings.com/pop-photo?id=the-paper_25140&amp;height=625&amp;width=700" +class="thickbox"><img src="the-truth-about-bad-beat-jackpots_files/th +e-paper-25140.jpg" alt="The Paper!"></a></span> <span> <strong>When you first enter a live poker room, it's hard to miss the display flashing the current size of the bad beat jackpot. But remember: a guaranteed bonus is always worth more than the hope of a jackpot.</strong> </span> <br><br>

Does anyone have any suggestions?

NB: There *could* be more than one image per line, and the img tag could also have any of these tags:

style="" class="" text="" width="" height=""

TIA for any ideas - I'm at a bit of a roadblock :(

Andy

Replies are listed 'Best First'.
Re: Regex to "wrap" a <span around an image.
by JavaFan (Canon) on Nov 27, 2008 at 11:54 UTC
    I assume your question is, "given an IMAGE element (tag) in an HTML document, how do I put it inside a SPAN element". I'll just ignore the fact you want to do it with a regexp, and suggest you use one of the various HTML parsing modules on CPAN. Given an HTML parser, modifying a document should be easy.
      Hi,

      Yeah, thats pretty much the jist of it :) (its really just so we can assign classes to "spans", so we can make them look nice =))

      Any suggestions as to the perl modules?

      TIA

      Andy
Re: Regex to "wrap" a <span around an image.
by wfsp (Abbot) on Nov 27, 2008 at 14:35 UTC
    #!/usr/bin/perl use warnings; use strict; #use lib q{c:/www/lib}; #use SW::Debug; use HTML::TreeBuilder; my $html = <<HTML; <a href="page.html"> <img src="pic.jpg"> </a> <span>other stuff</span> <img alt="pic" width="10" height="10" src="pic.jpg"> <span>more</span> HTML my $tb = HTML::TreeBuilder->new_from_content($html); my @images = $tb->look_down(_tag => q{img}) or die qq{look_down for im +g failed: $!\n}; for my $img (@images){ $img->replace_with([q{span}, {id => q{fo_big}}, $img]); } print $tb->as_HTML;
    <html> <head> </head> <body> <a href="page.html"> <span id="fo_big"> <img src="pic.jpg" /> </span> </a> <span>other stuff</span> <span id="fo_big"> <img alt="pic" height="10" src="pic.jpg" width="10" /> </span> <span>more</span> </body> </html>
    (blew some whitespace in)

    The html and body tags are added by HTML::TreeBuilder. If you parse an HTML file they'll already be there.

    update: commented out a stray "helper" module (not used by the code)

      Hi,

      Thanks for the example code. Just getting my host to install that module, and then will give that code a test run :)

      Thanks!

      Andy
Re: Regex to "wrap" a <span around an image.
by oko1 (Deacon) on Nov 27, 2008 at 14:37 UTC

    Assuming that you're looking for every image name within the file:

    # On the commandline perl -0wne'print "$_\n" for /img\s+src="([^"]+)"/igsm' filename

    If you're doing this within a script:

    # Since HTML tags can be split into multiple lines, you need all the # content as one string open Fh, "<", "filename" or die "filename: $!\n"; my $file = do { local $/; <Fh> }; close Fh; for ($file =~ /img\s+src="([^"]+)"/igsm){ print "$_\n"; }

    Update: Whoops, I appear to have misread what was being asked, and gave a solution to something different. As JavaFan said, the right answer is to use one of the available modules.


    --
    "Language shapes the way we think, and determines what we can think about."
    -- B. L. Whorf
Re: Regex to "wrap" a <span around an image.
by monarch (Priest) on Nov 27, 2008 at 22:13 UTC

    I tend to stick to regexps for simple things like this and only move to the CPAN modules when processing of nested tags is required. Image tags in HTML are complete and cannot contain any other tags within. So, a regexp solution:

    my $class = "large-image-right-landscape"; $content =~ s{ ( # start capture <img\s # opening of img tag [^>]* # everything to the tag end > # tag end ) # end capture }{ <span class="$class">$1</span> }sigx;

    The /sigx means not to treat newlines specially, ignore case, perform the replacement globally on the string, and use the layout given above with comments interspersed.

      You know that your regex will fail?

      Example HTML code snippet:

      <img src="greater.gif" alt=">">

      Yes, it's not nice to put an unescaped > there, but it's legal and I see it more than often when dealing with jsf and woodstock tag library.


      s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
      +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
        Ah never saw that before. Thanks for pointing that one out! Technically shouldn't that be encoded as <img src="..." alt="&gt;">? I think you'll find my solution shouldn't fail unless you use data from people determined to make your life miserable.