htmanning has asked for the wisdom of the Perl Monks concerning the following question:

Monks, I'm using this code to find 3 and 4 digit numbers in a text field, then adding URLs and bold tags around it.
unless ($type eq "project" || $type eq "npalert" || $text =~ /RESID/) +{ my $digits_4 = qr{ \b \d{4} \b }xms; $text =~ s{ ($digits_4) } {<a href="resident-info.pl?do_what=view&unit=$1"><b>$1</b></ +a>}xmsg; my $digits_3 = qr{ \b \d{3} \b }xms; $text =~ s{ ($digits_3) } {<a href="resident-info.pl?do_what=view&unit=$1"><b>$1</b></ +a>}xmsg; }# end unless ($type eq "project") {
My issue is sometimes the text field already contains a URL that contains a 4 digit number. I need to be able to ignore 4 digit numbers if they are within a URL, but still have the above code work for 4 digit numbers that are plain text. Help! Thanks.

Replies are listed 'Best First'.
Re: Replacing 3 and 4 digit numbers.
by Eily (Monsignor) on Apr 07, 2016 at 08:06 UTC

    ++ Discipulus for the single regex. Please note that regex for parsing XML is not always the best idea, although there are some exceptions. I'm not so sure about your problem, and maybe considering a XML module would be a good idea.

    Anyway, what you need here is more context, either what's before your number, after, or both to decide what to do with your digits. I think the easiest is to use look-ahead assertions, because they are more flexible than look-behind, and I think easier to understand. I don't know what your input data looks like, so maybe this:

    qr{ \b \d{3,4} \b # 3 or 4 digits (?! # not followed by </b></a> ) }xms;
    The interesting thing about look ahead is that they will peak at what is present after a certain point, but not include it in the match. So with my regex only 1234 (for example) will match, though perl will have checked that no </b></a> is present just after.

    It won't work if the numbers you want to change are in links with another format (ex, no <b> or more nested tags). In which case you might want to start checking that the first link tag after your number is an opening one, and not a closing one. But at this point it would really start to look that regex are not the best solution, so while you can try it, I highly recommend considering another way.

Re: Replacing 3 and 4 digit numbers.
by Discipulus (Canon) on Apr 07, 2016 at 06:50 UTC
    hello htmanning

    not sure to have understood your question: anyway you have to very similar statements that probably can be reduced to one

    qr{ \b \d{3,4} \b }xms

    See perlrequick for the quantifiers introduction.

    In addiction, if i understand the issue, you can tell the regex to match only if, for example, the numbers are not between href= and what you can consider the rigth end of your URLs.

    More strict is your regex more accurate will be results: you have to know (and predict!) where these digits can be candidate for the substitution.

    Only you can tell what your data is: you need to find the right regex for your all possible cases. Know your Data is an important rule. If you can post an exmple of a normal line and also one line with the problem, probably you 'l get better answers.

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: Replacing 3 and 4 digit numbers. (html hilite highlight bold)
by beech (Parson) on Apr 07, 2016 at 08:00 UTC

      You can use similar approach with almost any xml/html dom/tree module , you can skip the pretty printer if you're using a browser to view the results

      #!/usr/bin/perl -- use strict; use warnings; use XML::LibXML; use XML::LibXML::PrettyPrint; my $html = q{<p> Looking for targets <p> Text nodes to <i> bold </i> <p> Inside <em> all kinds <i> of <a href="#tags"> tags </a></i></em> <p> Maybe even <em><i>sep</i><u>a</u><i>rat</i><u>ed</u></em> in the f +uture <p> But <b targets="targets" bold="bold" tags="tags" separated="separa +ted"> not </b> inside attributes }; my $xpp = XML::LibXML::PrettyPrint->new; my $dom = XML::LibXML->new()->load_html( string => $html ); print $xpp->pretty_print( $dom ); hilite_text( $dom , 'target|bold|tags|separated' ); print $xpp->pretty_print( $dom ); sub hilite_text { my( $dom, $targets ) = @_; for my $text ( $dom->findnodes( '//text()' ) ){ my( $before, $word , $after, ) = split /($targets)/, "$text"; if( defined $word and length $word ){ $before = $dom->createTextNode( $before ); $after = $dom->createTextNode( $after ); my $bold = $dom->createElement('b'); $bold->appendText($word); $text->parentNode->replaceChild( $before, $text ); $before->addSibling( $bold ); $bold->addSibling( $after ); } } return $dom; }
Re: Replacing 3 and 4 digit numbers.
by AnomalousMonk (Archbishop) on Apr 07, 2016 at 14:11 UTC

    I think the advice of others about using an (X|HT)ML parser to avoid certain "URLs" (whatever that means in your particular circumstances) is wisely given and well taken. However, if you can clearly define a "URL" in the context of your data (and that's a big if!), here's a neat approach that will work. (This needs Perl version 5.10+ regex extensions.) (Untested)

    my $digits_3_4 = qr{ \b \d{3,4} \b }xms; my $url = qr{ <a> ... </a> }xms; $text =~ s{ $url (*SKIP) (*FAIL) | ($digits_3_4) } {<a href="resident-info.pl?do_what=view&unit=$1"><b>$1</b></ +a>}xmsg;
    Please see Special Backtracking Control Verbs for  (*SKIP) (*FAIL) in perlre.


    Give a man a fish:  <%-{-{-{-<

      $text =~ s{ $url (*SKIP) (*FAIL) | ($digits_3_4) }
      wow! perlre is defintively an unxplored (by me) continent! thanks AnomalousMonk

      L*

      There are no rules, there are no thumbs..
      Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

        :D "old style" that is backwards compatible

        s{ (...) | (...) | ... }{ RepReplace($1,$2,$3); }ge; sub RepReplace { my( $link, $dig3, $num4 ) = @_; if( defined $link ){ return $link; ### change nothing } ...
      Please see Special Backtracking Control Verbs for (*SKIP) (*FAIL) in perlre.

      Hey! You fooled me into thinking that you want htmanning to RTFM.

      Discipulus++'s answer was a real insight. I had never expected that readable words are part of perl's RE syntax. I've looked into my local perl, and lo and behold, it's supported even here, in 5.18.1 (but still marked experimental).

      The only thing that I don't like is that perlre does not state since which version this feature is supported.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
        ... since which version this feature is supported.

        Special Backtracking Control Verbs first appeared in Perl 5.10.


        Give a man a fish:  <%-{-{-{-<

      Thanks so much for the replies. This works, but I have a different issue now. I need to ignore years. How can I set some parameters to ignore, 2014, 2015, 2016, 2017, etc.?

        What is the context in which these digit groups appear? With what other four-digit groups that you want to process might they be confused? Can you give some brief example input and desired output? What code have you written so far? What is the "this" that works?


        Give a man a fish:  <%-{-{-{-<