Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Good Day Monks,

As part of script I'm writing, I'm given a scalar value (call it $file) that contains the entire contents of an HTML file, newlines and all. Why as a scalar? Don't ask, I don't get to write that part of the code. =) Anyway, what I need to do is find one particular string in the file (an image tag), which looks like this:

<img src="/some/longish/path/image2002[about six unknown characters he +re]607.gif">

So I was thinking that a good way to do this would be to use tr to strip off everything except that particular string. I.e., after tr is called, $file will be the above line. Unfortunately I'm not very good at regular expressions yet. Can anyone help? I'd really appreciate it!

Replies are listed 'Best First'.
Re: RegEx/tr Help
by talexb (Chancellor) on Jun 07, 2002 at 19:57 UTC
    I'd suggest a regular expression, not tr. You probably want to do something like
    my @ImageTagList = ( $HTMLPage =~ m!<img src="/some/longish/path/image2002(\d+)607.gi +f">!gi );
    The g modifier looks for all occurences, and the i modifier ignores case (ref.: p.150, Camel).

    --t. alex

    "Nyahhh (munch, munch) What's up, Doc?" --Bugs Bunny

      Hmmm...good idea!. I think this code returns the missing 6ish characters though, not the entire image tag.
Re: RegEx/tr Help
by insensate (Hermit) on Jun 07, 2002 at 20:14 UTC
    I'm not sure if you're wanting to extract the entire tag or the image name path by itself. Here's what I would do.
    for($file){ /(<img src=\"(?:\/\w+)+\/image2002\w{6}607.gif\">)/; $tag=$1; }
    Then do whatever you'd like with $tag...push it onto an array etc....You say about six characters...the \w{6} will match exactly and only exactly six characters...you can specify a range or characters in the {} so {4,7} would match at least 4 but no more than 7 characters. If you're not concerned with the entire tag...just the path to the image change the location of the outer ()...which capture the match into $1...to be like so:

    /<img src=\"((?:\/\w+)+\/image2002\w{6}607.gif)\")>/;

    ...or if you just want the name of the file:

    /<img src=\"(?:\/\w+)+\/(image2002\w{6}607.gif)\">)/;

    Hope this helps... -Jason
Re: RegEx/tr Help
by boo_radley (Parson) on Jun 07, 2002 at 20:09 UTC

    you've seriously misconstrued tr's functionality. As a simple overview, tr takes 2 character lists, and replaces characters found in the first list with characters found on the second list. There's a lot more to it, but that's the basic operation.

    What you may wish to try is using the match operator (m//) with the s switch turned on, which will allow you to match past the first newline.

    Alternately, there's a bevy of html parsing and extraction modules on cpan which you may find of use as well


    Update : I'm just baffled at the responses to this node. Can anyone provide a working example that tr is a possibility in this situation?
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: RegEx/tr Help
by Popcorn Dave (Abbot) on Jun 07, 2002 at 23:03 UTC
    There's been a lot of good advice given here but I'm going to add my 2 cents anyway : )

    I'm doing something similar with parsing HTML on my own since I'm looking at some very different HTML pages so I've had to build my own rules, but anyway...

    For your problem I'm going to assume that you're looking for one specific part of your stream.

    $foundfile =~ m!(<img src="/ your path here /[^>]*>)!i; $foundfile = $1;

    This (untested) bit of code should match all that you're looking for and being that you're in a stream, unless you dump that stream in to an array and process it line by line. However if you're looking for every occurence of a graphics format file, change !i to !ig and that should do the trick.

    I think that you could probably eliminate the entire path in the stream as you're just looking for <img src=" "> as well.

    Hope that helps!

    Some people fall from grace. I prefer a running start...