bfdi533 has asked for the wisdom of the Perl Monks concerning the following question:

I have a file with a really long string in it (it is actually XML but for some reason it is stored in 1 line). What I need to do is to do a substring search of the file and print out the "word" that contains the substring. This "word" might be a url, a description, etc. For coding and extraction purposes, the "word" is delineated by whitespace. So I need to back up to the beginning of the "word" an print out to the end of the "word."

Here is the code I have already but as you can see it uses an absolute substring size and I need it to be dynamic:

while (<>) { my $istr = lc($_); my $offset = index($istr,"cesi"); print $offset."\n"; if ($offset > -1) { my $str = substr($istr, $offset-20, 100); print $str."\n"; } }

Thanks in advance for any input.

Replies are listed 'Best First'.
Re: substring extraction
by BrowserUk (Patriarch) on Jan 03, 2006 at 22:11 UTC

    Your description doesn't completely tally with your code. If the file contains a single long string, then your while loop will only iterate once. However, to print out all, whitespace delimited words that either match or contain a given search term, you could use:

    $string = 'this is a really long string (no really, it is!) that conta +ins a whitespace delimited word'; print $1 while $string =~ m[(\b\S*limit\S*\b)]gi;; ## All words, case +insensitive. delimited

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: substring extraction
by suaveant (Parson) on Jan 03, 2006 at 22:08 UTC
    pretty easy with a regex....
    my $string = 'cesi'; while($istr =~ /(\S*$string\S*)/gi) { print "$1\n"; }
    not tested, but should work... the i does case insensitive matching, the g matches more than once, allowing the loop to catch all occurances. Lowercasing the string ahead of time may help the speed, especially if you want the output to be lowercase (though you probably don't if you have things like URLs).

    If you want to know the location of the word in the source string the special array @- and @+ should come in handy.

                    - Ant
                    - Some of my best work - (1 2 3)

      Perfect; that was the missing piece. I knew that most likely had to use a regex but that is admittedly a weak point for me. This does just what I am looking for.

      Thanks for the help and the rapid reply.