JoshuaD has asked for the wisdom of the Perl Monks concerning the following question:

I have a little regex problem:
while(<ASX>) { next unless m{<REF HREF=(.*)/>}; $address = $1; $address =~ s/\"//g; print PLAYLIST $address, "\n\n"; }
Is there anyway to get rid of that temporary value $address? I've tried:
next unless m{<REF HREF=\"?(.*)\"?/>};
which doesn't work because * is greedy, so I tried:
next unless m{<REF HREF=\"?(.*?)\"?/>};
Which should work, considering *? is lazy, and ? is greedy, but it still doesn't. Can on of you monks shed some light on this?

edit: A typical line i'd be matching is
<REF HREF="mms://<some_address>.(wma|wmv|maybeSomethingElse)" />
But i'm not sure if it will always have quotes in it.

janitored by ybiC: Retitle from less-than-descriptive "Regex Question"

Replies are listed 'Best First'.
Re: Matching optionally quoted string
by etcshadow (Priest) on Dec 14, 2003 at 16:48 UTC
    next unless m{<REF HREF=\"?([^"]*)\"?\s*/>};
    ought to work. Two key differences: dissallowing quotes in the (.*), and allowing for whitespace between the close quote and the close of the xml tag.

    Of course, all the standard caveats about trying to parse XML (or HTML, etc) with regular expressions apply. That is, it may work here, but don't expect it to always work... and you've really got to be a god with regexps to do it very well, to begin with.


    ------------
    :Wq Not an editor command: Wq
Re: Matching optionally quoted string
by diotalevi (Canon) on Dec 14, 2003 at 19:30 UTC

    I'm perplexed that no one suggested matching the trailing quote mark. This is really rather easy. This is case-insensitive, is friendlier about white-space and matches the beginning single/double quote with the trailing one. It'll break if the HREF attribute is anywhere except at the beginning of your REF tags.

    while ( <ASX> ) { next unless m(<REF\s+HREF\s*=\s*(['"])(.+?)\1)si; print PLAYLIST $2, "\n\n"; }
Re: Matching optionally quoted string
by mirod (Canon) on Dec 14, 2003 at 17:03 UTC

    A simple m{<REF HREF=(.*?)\s*/>} might work here. At least it will work in the case of the "typical line" that you showed.

    Of course the real clean way to do is to use a proper HTML parser, like HTML::Parser, HTML::TokeParser::Simple or HTML::TreeBuilder. HTML::TreeBuilder offers the extract_links method which will probably be just what you are looking for. It will deal with case problems, absent (or alternate) quotes... and be generally a lot more robust than what you seem to be doing here. Have a look at Sean M Burke's book Perl & LWP for more info on processing HTML with Perl.

    .
      No, m{<REF HREF=(.*?)\s*/>} doesn't work: it still captures the beginning and ending quotes.

      CountZero

      "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Matching optionally quoted string
by CountZero (Bishop) on Dec 14, 2003 at 18:05 UTC
    This regex will work: m{<REF HREF="?(.*?)"?\s*/>}

    Note that you do not have to escape the quotes. The only characters you have to escape are : \ | ( ) [ { ^ $ + ?

    Update: Also . and * need to be escaped. (thanks Tachyon)

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      The converse is equally important: you may escape any non-alphanumeric and be confident that it will literally match.

      { and } are a strange case: as long as you don't have a well formed {min[,[max]]} quantifier, you don't need to escape them. So if you always escape either { or } you are ok. People sometimes get in trouble by saying {,max} which is a literal match, not a quantifier.

      Update: _ seems to be a grey area. perlre seems to guarantee that \_ won't ever become a metacharacter, but quotemeta doesn't quote it.

      You also need to escape @, and sometimes -. For the latter, consider this:
      $ perl -we'use overload q:"":=>sub {print "in stringify"}, q:%{}:=>sub + {print " in hash deref";{foo=>1}}; $x=bless[]; qr/$x\->{foo}/' in stringify sthoenna@DHX98431 ~ $ perl -we'use overload q:"":=>sub {print "in stringify"}, q:%{}:=>sub + {print " in hash deref";{foo=>1}}; $x=bless[]; qr/$x->{foo}/' in hash deref
Re: Matching optionally quoted string
by delirium (Chaplain) on Dec 15, 2003 at 00:11 UTC
    Use your good friend tr// and save the regex headaches:

    while(<ASX>) { tr/"//d if /HREF/; print PLAYLIST $1 if /<REF HREF=(.*)\/>/; }

    (untested)

Re: Matching optionally quoted string
by bl0rf (Pilgrim) on Dec 14, 2003 at 21:45 UTC
    Dear monks, so far I have seen many monks doing
    unncecessary work. Is lazyness not the first virtue
    of the Perl programmer?
    I always use: $foo =~ m!href="([^"]*)!i;
    After all, what you want to match is just non-"
    characters...

Re: Matching optionally quoted string
by Aristotle (Chancellor) on Dec 15, 2003 at 03:59 UTC
    Just use an XML parser. I recommend XML::LibXML. What you want to do then becomes
    print $_->getAttribute( 'HREF' ), "\n\n" for XML::LibXML ->new() ->parse_fh( \*ASX ) ->findnodes( '//ref' )

    Makeshifts last the longest.