sherab has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks! I am parsing a webpage right now to make it 508 compliant. If an IMG tag has an alt element, I need to leave it alone but if it doesn't, it needs one added. Some of these IMG tags span a few lines and I have discovered that this works on tags that don't have alt ( Any options that change the document formatting like removing \n or \r are out for bureaucratic reasons )
$contents=~s!<IMG\s(.*?)>!<IMG $1 alt="text">!sig;

What I am stumped on is how to pull the equivalent of this off..
$contents=~s!<IMG\s(.*?)>!<IMG $1 alt="text">!sig if $1!~m/alt/!;

I believe there is a conditional regex needed here but all attempts at success are giving me a brutal headache.
Any one deal with this kind of situation before?
Sherab

Replies are listed 'Best First'.
Re: Conditional regex
by ikegami (Patriarch) on May 08, 2009 at 20:18 UTC

    I am parsing a webpage right now to make it 508 compliant.

    Doesn't your parser have a means of locating IMG elements, and for each of them add an attributes if the attribute doesn't already exist? Looks like 3-4 lines of code.

    Update: For XML::LibXML, it would be something similar to the following snippets. I expect something similar from HTML parsers.

    for my $ele ($doc->findnodes('//img')) { if (!defined($ele->getAttribute('alt'))) { $ele->setAttribute(alt => ...); } }
    or even
    for my $ele ($doc->findnodes('//img[count(@alt)==0]')) { $ele->setAttribute(alt => ...); }
      I see your point but it's a regex question.

        Not really. Like you said so well, your real problem is

        I am parsing a webpage right now to make it 508 compliant. If an IMG tag has an alt element, I need to leave it alone but if it doesn't, it needs one added

        I see writing an regexp-based parser as your (broken) solution, not your problem. I'm not gonna make a lot of work for myself reinventing an HTML parser when I can skip that step and go straight to changing the HTML you want changed.

        You could do some nested matching to make this "work" with regular expressions, as you were trying above but it is the wrong way to solve the problem for two reasons: it's as difficult as the related parser code and unless you're Jeffrey Friedl, it will never work as well as the related parser code.

Re: Conditional regex
by kgish (Acolyte) on May 08, 2009 at 20:20 UTC
    Seems to me you are using the wrong approach. Wouldn't it make more sense to use something like HTML::Parser and just let it do the nasty work for you rather than struggling with complicated regular expressions?
      It would be but sadly security is very tight here (federal agency) and I don't believe that HTML::Parser is a core mod
        Did you even try to get HTML::Parser installed? There are a lot of useful modules not in core, without which Perl misses very basic usefulness, such as DBI (to name but one). Did some federal security agency check all the core modules and gave them a "thumbs up"? I really doubt someone went to that effort.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        A reply falls below the community's threshold of quality. You may see it by logging in.
        Neither is any code we give you.
Re: Conditional regex
by oxone (Friar) on May 08, 2009 at 21:10 UTC
    Converting all the IMG tags without alt text to say alt="text" disregards the spirit of the 508 (accessibility) guidelines and is kind of offensive to the visually-impaired audience they are designed to help.

    Why? Hearing the screen reading software say "text" to describe every image they can't actually see isn't likely to make a user real glad you're "508 compliant".

    Any meaningful fix would need somebody 'editorial' to enter descriptive text for each different image in turn (clearly a manual task rather than a coding job).

    Appreciate you're probably just trying to code what you've been asked to, but you might want to do the decent thing and point out to your federal agency that this is really not helpful. You never know, if they agree then the task will be off your desk!

      On the flip side, you could convert all alt-less image tags to say alt="Nkuvu says missing tag here!" which would allow an editor to actually go through and find the tags that need to be manually updated. Essentially highlighting the TODO sections.

      Not that I expect this is the intent in this particular scenario, but I could see using a script to find all alt-less image tags.

Re: Conditional regex
by elTriberium (Friar) on May 08, 2009 at 23:19 UTC
    Why would you need a conditional regex? Why not just use something like:
    $contents=~s!>$! alt="text">!sig if $contents =~ m/<IMG/ && $contents +!~ m/alt/;
    I'm pretty sure this can be simplified, I didn't try to.
Re: Conditional regex
by John M. Dlugosz (Monsignor) on May 08, 2009 at 20:54 UTC
    My first reaction: don't attempt to parse HTML with regex's, from scratch.

    If you must for some reason, I did see code for complete grammar somewhere...ah, XML::Easy::Syntax.

    —John

      While there exists an XML serialization of HTML (XHTML), the OP is not using it (as indicated by uppercase "IMG").
        But if he already identified a complete opening tag for the IMG, the code in there might help him break it up into attributes. Or, if he has trouble reliably finding a complete opening tag, that code might help him do that. It's not a complete parser, but a list of canned regex's doing what he asked for. And they can be further modified.
Re: Conditional regex
by linuxer (Curate) on May 08, 2009 at 21:13 UTC

    Although nowadays I'd prefer to use HTML::Parser or something similar, I think you could use a subroutine call in the replacement part (and use the /e modifier).

    sub foo { my $bar = shift; if ( $CONDITION ) { #modify $bar if you need to } return $bar; } my $text =~ s{(<img.+?>)}{foo($1)}sige;

    update1: changed quantifier to +?

Re: Conditional regex
by Anonymous Monk on May 09, 2009 at 01:52 UTC
      The solution I discovered is this...

      $contents =~ s/(<IMG\s.*?>)/&addalt($1)/sige; $contents =~ s!\>alt=\"\"! alt=\"\"\>!sig; print $contents; sub addalt { my $returned=shift; if ($returned!~m/\salt/){ $returned=$returned.'alt=""'; } return $returned; }
      The key I discovered was the "e" switch on the regex.