in reply to Finding and hightlight information

If you could put artificial markers into the text that won't upset the data mining application and can be removed easily, you could build a table mapping original-to-munged, then datamine the munged, then use the offsets and the mapping table to point at the right places in the original.

-- Randal L. Schwartz, Perl hacker
Be sure to read my standard disclaimer if this is a reply.

  • Comment on •Re: Finding and hightlight information

Replies are listed 'Best First'.
Re: •Re: Finding and hightlight information
by fletcher_the_dog (Friar) on Mar 27, 2003 at 21:27 UTC
    I have been thinking about this idea, and I can make it work by doing something like this:
    use strict; open INFILE,"infile.txt"; open OFILE,">outfile.txt"; my $total_os=0; while(<INFILE>){ my $tmp=$_; $str=~s/(\s+)/osmarker(pos($str),$1)/xeg; # a bunch of regular expressions $total_os+=length($_); print OFILE $str; } sub osmarker{ my $os=shift; my $spaces=shift; $os+=length($spaces)+total_os; return $spaces."<OS=$os>"; }
    The problem that inserting this markers has is not on the data mining tool, but in the regular expressions that munge in the text. There are some that look for "WORD\s+WORD" that would be screwed up by this marker. I could fix this by defining some variable like this:
    my $space=qr/(?:<OS=\d+>|\s)/;
    and replacing all instances of "\s" with "$space". Is there an easier way of doing this? Is there a way to overload "\s"?