Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Skipping special tags in regexes

by BUU (Prior)
on Dec 04, 2003 at 22:03 UTC ( [id://312339]=note: print w/replies, xml ) Need Help??


in reply to Skipping special tags in regexes

My best guess would be to actually attempt to parse it some how, and store the total thing in some sort of datastructure. The simples would be a hash of the form tag => string, then you could just iterate over the values of the hash to ignore the tags, and vice versa. How you would actually parse this string is a bit beyond me, if the tags are truly as simple as you depict here then it should be fairly simple to just use a regex /<5\w>\w+/ or something, but beyond that you would have to look at some of the parsers on cpan.

Replies are listed 'Best First'.
Re: Re: Skipping special tags in regexes
by CountZero (Bishop) on Dec 04, 2003 at 22:48 UTC
    If the tags are associated with the word directly following them (without any intervening whitespace), then you could split the sentence on whitespace and then split of each tag from the word following it.

    It would then be trivial to build a data-structure you could use as a basis to put the tags back in after the regex has done its thing with the "untagged" sentence.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      The only problem is that each word in a string may not be unique, so you couldn't just plop things in a hash. Also the regexes might introduce new words someplace in the string that alreay existed in the string somewhere else. That why I have tried using diffing, but it was just too slow.
        That's true, but I was not necessarily thinking of using a hash.

        An array based datastructure would probably be OK and it has the added benefit of preserving the sequence of the words: this would make it a lot easier to construct the"untagged" sentence for regex-purposes and thereafter, one could split the regexed-sentence on whitespace and compare this list with the array made by splitting the "original" list.

        All you have to do then is to walk the original list, adding tags to the regexed-list where necessary and skipping the newly inserted words in the regexed-list. You might still have a problem in cases where you introduce duplicate words next to one another.

        CountZero

        "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://312339]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2024-04-19 03:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found