Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Skipping special tags in regexes

by fletcher_the_dog (Friar)
on Dec 04, 2003 at 15:42 UTC ( [id://312223]=perlquestion: print w/replies, xml ) Need Help??

fletcher_the_dog has asked for the wisdom of the Perl Monks concerning the following question:

Hi All, I have files that have some special markers in them something like this:
<5b>I <5c>like <5d>tacos
I was wondering if there was anyway possible to make the regex engine ignore these tags without deleting them? For example, if I had a regex like this:
s/(\w+)\s+tacos/$1 yummy tacos/g;
I would want it to change the above phrase to:
<5b>I <5c>like yummy <5d>tacos
I realize I could change the regex to:
s/(\w+)\s+(<\w\w>)?tacos/$1 yummy $2tacos/g;
but my script requires quite a few complicated regexes substitutions and having optional "(<\w\w>)?"s everywhere will make the regexes very ugly and make it hard to keep track of which capturing parenthesis are catching what.
I was wondering if there was a way to make the regex engine see the a string something like this where it would do its matching just against the first element in each sub array:
$string = [ ['I','<5b>'], [' '], ['l','<5c>'], ['i'], ['k'], ['e'], [' '], ['t','<5d>'], ['a'], ['c'], ['o'], ['s'] ]
Another possibilty I have considered and tried was erasing all the tags and then after I was done with all the substitutions doing a diff and trying to reconstruct where the tags should go. This actually works but was way too slow.

Replies are listed 'Best First'.
Re: Skipping special tags in regexes
by Abigail-II (Bishop) on Dec 04, 2003 at 16:09 UTC
    #!/usr/bin/perl use strict; use warnings; my $tag = '(?:<[^>]*>)'; sub word { my $word = shift; qr /$tag*$word/; } while (<DATA>) { s/((??{ word 'tacos' }))/yummy $1/g; s/((??{ word 'salad' }))/green $1/g; print; } __DATA__ I like tacos. <5b>I <5c>like <5d>tacos. I like a salad. <foo>I <foo>like <foo>a <bar><baz><foo>salad. <foo>I <foo>like <foo>a <bar><baz><foo>salad <bup>with <bobob>tacos. I like yummy tacos. <5b>I <5c>like yummy <5d>tacos. I like a green salad. <foo>I <foo>like <foo>a green <bar><baz><foo>salad. <foo>I <foo>like <foo>a green <bar><baz><foo>salad <bup>with yummy <bo +bob>tacos.

    Abigail

      Is there a reason you are using the /(??{ })/ code block other than just to make a more general solution? This seems to work just fine.
      #!/usr/bin/perl use strict; use warnings; my $tag = '(?:<[^>]*>)'; while (<DATA>) { s/($tag*tacos)/yummy $1/g; s/($tag*salad)/green $1/g; print; } __DATA__ I like tacos. <5b>I <5c>like <5d>tacos. I like a salad. <foo>I <foo>like <foo>a <bar><baz><foo>salad. <foo>I <foo>like <foo>a <bar><baz><foo>salad <bup>with <bobob>tacos. I like yummy tacos. <5b>I <5c>like yummy <5d>tacos. I like a green salad. <foo>I <foo>like <foo>a green <bar><baz><foo>salad. <foo>I <foo>like <foo>a green <bar><baz><foo>salad <bup>with yummy <bo +bob>tacos.

      --

      flounder

        Is there a reason you are using the /(??{ })/ code block other than just to make a more general solution?
        Because I was expecting the OP to refine his question, and come up with a slightly different definition of a "word" (perhaps it needed trailing tags as well, or not more than 2 tags, whatever). Then I only need to change the sub, and not every regex using it.

        I think my solution is more general than yours, but the effect is the same.

        Abigail

Re: Skipping special tags in regexes
by BrowserUk (Patriarch) on Dec 04, 2003 at 17:28 UTC

    You might consider creating your own custom tag. This example is probably not well thought through, it's basically just a tweaking of the example given at the end of perlre, but I'd never tried this before so I did what came easy:)

    #! perl -slw use strict; package CustomReTag; use overload; sub import { shift; die "No argument to customre::import allowed" if @_; overload::constant 'qr' => \&convert; } sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"} my %rules = ( '\\' => '\\', 'Y|' => qr/<\w\w>/ ); sub convert { my $re = shift; $re =~ s{ \\ ( \\ | Y . ) } { $rules{$1} or invalid($re,$1) }sgex; return $re; } package main; my $s = '<5b>I <5c>like <5d>tacos'; my $re = CustomReTag::convert '(\Y|tacos)'; $s =~ s[$re][yummy $1]g; print $s; __END__ P:\test>junk <5b>I <5c>like yummy <5d>tacos

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    Hooray!
    Wanted!

Re: Skipping special tags in regexes
by thospel (Hermit) on Dec 04, 2003 at 16:07 UTC
    The wanted solution is not well specified though. In your example, it would seem equally valid to return:
    <5b>I<5c>like <5d>yummy tacos
    (corresponding to a different place to put the (<\w\w>)?).You'll have to state this doesn't matter or state a way to resolve this ambiguity.
      That is good point. The tags are associated with the word immediately following them, so it would be nice if I could associate them with the first character of that word.
Re: Skipping special tags in regexes
by dragonchild (Archbishop) on Dec 04, 2003 at 16:01 UTC
    Build dynamic regexes. Maybe something like:
    my $regex = 's/(\w+)\s+tacos/$1 yummy tacos/g'; $regex =~ s/(\\s\+)(\w)/$1(<\\w\\w>)$2/g; # Handle first part of + substitution my $match = $2; $regex =~ s/ $match/\$2$match/g; # Handle second part o +f substitution eval "$regex";

    That would handle the transform from the first to the second. Ideally, you would re-evaluate your regexes and build them using some regex builder. The builder would handle the optional tags and making sure they stayed in after the substitutions.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: Skipping special tags in regexes
by delirium (Chaplain) on Dec 04, 2003 at 18:25 UTC
    Another possibility is to not rely on regexes. I'm not sure what your specs for searching and replacing are, but if they are narrowed down to prefixing words with other words, as in your example, you could get away with building a custom function and passing it the line to change, the word to look for, and what to prefix it with, e.g.:

    #!/usr/bin/perl -w use strict; sub prefix { my ($line, $look_for, $prefix, $flip) = @_; return $line unless $look_for && $line =~ /$look_for/; my @arr = split /(<[^>]*>)/, $line; for my $cnt (0..$#arr) { if ($arr[$cnt] =~ /$look_for/) { $arr[$cnt-2] .= $prefix if $cnt > 1 && !$flip; $arr[$cnt+2] = $prefix . ' ' . $arr[$cnt+2] if $cnt < $#ar +r-1 && $flip; } } return join '', @arr; } my $line = "<5b>I <5c>like <5d>tacos\n"; print prefix ($line, 'tacos', 'yummy'); print prefix ($line, 'like', 'yummy', 1); __OUTPUT__ <5b>I <5c>like yummy<5d>tacos <5b>I <5c>like <5d>yummy tacos
      unfortunately, I have many regexes that are not fixed strings and require all the powers of regexes
Re: Skipping special tags in regexes
by BUU (Prior) on Dec 04, 2003 at 22:03 UTC
    My best guess would be to actually attempt to parse it some how, and store the total thing in some sort of datastructure. The simples would be a hash of the form tag => string, then you could just iterate over the values of the hash to ignore the tags, and vice versa. How you would actually parse this string is a bit beyond me, if the tags are truly as simple as you depict here then it should be fairly simple to just use a regex /<5\w>\w+/ or something, but beyond that you would have to look at some of the parsers on cpan.
      If the tags are associated with the word directly following them (without any intervening whitespace), then you could split the sentence on whitespace and then split of each tag from the word following it.

      It would then be trivial to build a data-structure you could use as a basis to put the tags back in after the regex has done its thing with the "untagged" sentence.

      CountZero

      "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

        The only problem is that each word in a string may not be unique, so you couldn't just plop things in a hash. Also the regexes might introduce new words someplace in the string that alreay existed in the string somewhere else. That why I have tried using diffing, but it was just too slow.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://312223]
Approved by Corion
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2024-04-19 01:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found