fletcher_the_dog has asked for the wisdom of the Perl Monks concerning the following question:

I have a problem where I take a text document, do a bunch of regex substitutions, write the munged text to a new file, and then run a data mining application over the new file. The data mining application returns byte offsets for data that it has extracted which can be used to highlight information in the text. My problem is the byte offsets are for the munged text, but I need to be able to highlight stuff in the original text. I am stumped as to a way to reverse the regex substitutions in order to get the original offsets.

Replies are listed 'Best First'.
•Re: Finding and hightlight information
by merlyn (Sage) on Mar 27, 2003 at 20:38 UTC
    If you could put artificial markers into the text that won't upset the data mining application and can be removed easily, you could build a table mapping original-to-munged, then datamine the munged, then use the offsets and the mapping table to point at the right places in the original.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      I have been thinking about this idea, and I can make it work by doing something like this:
      use strict; open INFILE,"infile.txt"; open OFILE,">outfile.txt"; my $total_os=0; while(<INFILE>){ my $tmp=$_; $str=~s/(\s+)/osmarker(pos($str),$1)/xeg; # a bunch of regular expressions $total_os+=length($_); print OFILE $str; } sub osmarker{ my $os=shift; my $spaces=shift; $os+=length($spaces)+total_os; return $spaces."<OS=$os>"; }
      The problem that inserting this markers has is not on the data mining tool, but in the regular expressions that munge in the text. There are some that look for "WORD\s+WORD" that would be screwed up by this marker. I could fix this by defining some variable like this:
      my $space=qr/(?:<OS=\d+>|\s)/;
      and replacing all instances of "\s" with "$space". Is there an easier way of doing this? Is there a way to overload "\s"?
Re: Finding and hightlight information
by toma (Vicar) on Mar 28, 2003 at 04:48 UTC
    The article Embedded Markup Considered Harmful introduced me to the ideas of Project Xanadu.

    Keeping keep the markup in a separate file is called parallel markup. With this approach the original document remains intact.

    Even if you are as big a fan of markup as I am, it is interesting to see a different way to do it.

    It should work perfectly the first time! - toma

Re: Finding and hightlight information
by BrowserUk (Patriarch) on Mar 27, 2003 at 21:43 UTC

    Without knowing the nature of the substitutions you are making, or what the original file is, or how it is used, it's difficult to know whether this idea might fly or not, but here it is anyway.

    When I first started coding--in assembler, many moons ago--it was quite common practice to embed blocks of 16 or 32 nop's at the end of each block of code.

    These blocks of nops where called "patch areas". The idea being that if once compiled, a bug was found in a program, it was possible to patch the executable rather than having to re-build it, and these patch areas allowed for the potential that a routine needed to grow in size.

    Why would you do this reather than re-build. Well, it was consider safer to patch the executable as there is considerably less likelyhood of unwittingly making other changes. Eg. Someone omits or adds a compiler switch, #define or whatever, and having fixed one bug, you suddenly start getting several others show up in completely unrelated parts of the code. I believe that the technique is still actively used in such things as satallite control software, the space shuttle etc.

    Anyway, back to the question. Depending upon the nature of the substitutions you are making, you might be able to adjust the substitutions such that (most of) the offsets within the file remain the same after substitution as before. For net deletions this is fairly easy. If the replacement text is shorter than the original, you can pad it--with spaces or nulls for example.

    The problem comes when the replacement is longer than the original. Depending upon the nature of the original file, and what applications are used to view/manipulate it, you might get away with adding some 'patch space' to it.

    For instance: if you added 10 or 20 spaces or null bytes to the end of each line, it might give you latitude to make the substitutions and have enough play to adjust the padding at the end of the following or previous lines to compensate for the changes. Obviously this wouldn't by itself cater for all possibilities. You might need to add a few nulls to the end of each word in the original file.

    Having typed all that, I think that the effort involved in getting the padding juggling algorithm correct would probably be much more than building a lookup table to do the mapping, but there you go. Only you will know if this has any merit for your situation.


    Examine what is said, not who speaks.
    1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
    2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
    3) Any sufficiently advanced technology is indistinguishable from magic.
    Arthur C. Clarke.
Re: Finding and hightlight information
by nothingmuch (Priest) on Mar 27, 2003 at 20:58 UTC
    Perhaps you can use s///g in scalar context, keeping track of where you are with pos(). You can document all the offset differences so that you know there's a three byte warp here, and a two byte hole there, or whatever.
    How exactly do you apply the regex substitutions?

    -nuffin
    zz zZ Z Z #!perl
Re: Finding and hightlight information
by Anonymous Monk on Mar 28, 2003 at 02:29 UTC
    Run the same regexes over the original file only dont
    substitute  --> then save the positions ....
    
    Then make another run over the original file, this
    time substitute and write into the new file.
    
    -bl0rf
    
Re: Finding and hightlight information
by aquarium (Curate) on Mar 30, 2003 at 13:16 UTC
    Tk, and by extension, Perl/Tk has symbolic marking in Text widgets, i.e. tagging. The tags move logically as you do search and replace. You'd have to use the Tk search & replace instead of perl regex though. When you're done searching/replacing and tagging where it occured, you'd find the char position of each tag and then write the file for input to your super data munger. It's fairly straightforward, the SAMS Teach Yourself TCL/TK book (that I know of) explains how. Chris