Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a large email inbox file, with some useful data within it. I would like to extract, the next 140 characters (excluding whitespace and newlines) after a keyword eg. "START".

2 years ago I did this with a character by character loop, testing for the word, striping out lots of text, and then getting rid of the whitespace.

Now, however, I'm thinking that there must be a way of doing this with a regex. Could anyone give me a pointer in the right direction?

Thanks

Replies are listed 'Best First'.
Re: Extracting text after a keyword
by Zaxo (Archbishop) on Jul 02, 2002 at 09:10 UTC

    Putting aside the mail reading bits, the regex for that would be: /START(.{140)/)/ with modifiers to fit your data format (which could have used a tighter description here). If you have several keywords, you might be better served by Parse::RecDescent.

    Update: Having seen your data now, you might be better off splitting on whitespace. You may also be able to exploit the structure of five-character groups. The data you show doesn't seem to fit the description in the root node, am I missing something?

    After Compline,
    Zaxo

      Thanks. I'll have a play with those. The data format is a group of numbers/letters/punctuation. It looks something like this, but may be set out out on any number of lines. The '/' is sometimes a letter in some circumstances.
      UFOFH 10504 91001 /0600 6036/ 7049/ 8055/ 9065/ 0068/ 1073/ 2075/ 3076/ 4076/ 5073/ 6079/ 7073/ 8068/ 9064/ 0060/ 1046/ 2042/ 3040/ 0037/ 1035/ 2035/ 3035/ 4031/ 5030/
Re: Extracting text after a keyword
by amphiplex (Monk) on Jul 02, 2002 at 09:12 UTC
    If you can afford to read the entire file before processing it, this yould be a simple solution:
    use strict; my $keyword = "START"; my $length = 20; my $file = <<EOF; something START text1 text2 ........................ text 3 text 4 START more text still more text EOF $file =~ s/(?:\s+|\n+)//gc; my @hits = $file =~ /$keyword(.{$length})/g; for (@hits) { print $_."\n"; }

    ---- kurt
      Just a little note on efficiency...

      use Benchmark; sub a { $_ = "this is a\ntest\n"; s/(?:\s+|\n+)//gc; } sub b { $_ = "this is a\ntest\n"; s/\s+//gc; } sub c { $_ = "this is a\ntest\n"; tr/\n\r\t //d; } timethese(250000,{ a => \&a, b => \&b, c => \&c }); Benchmark: timing 250000 iterations of a, b, c... a: 5 wallclock secs ( 4.18 usr + 0.01 sys = 4.19 CPU) @ 59 +665.87/s (n=250000) b: 1 wallclock secs ( 1.61 usr + 0.03 sys = 1.64 CPU) @ 15 +2439.02/s (n=250000) c: 1 wallclock secs ( 0.66 usr + 0.02 sys = 0.68 CPU) @ 36 +7647.06/s (n=250000)
      All of these have the same effect on the string...

      for one... \s includes \n, so the right part of that doesn't ever actually do anything useful, but I believe it still gets checked each time to make sure it doesn't make a better match... I believe in most cases a character class would be better for that, but that is really irrelevant, since stripping single characters is much faster with tr///d

      Just thought I would point it out...

                      - Ant
                      - Some of my best work - (1 2 3)

Re: Extracting text after a keyword
by Juerd (Abbot) on Jul 02, 2002 at 09:15 UTC

    I would like to extract, the next 140 characters (excluding whitespace and newlines) after a keyword eg. "START".

    my ($foo) = $mailbox =~ /START((?:\s*\S){140})/;
    This matches START followed by 140 non-whitespace characters, each of which can have any amount of whitespace in front of it.

    - Yes, I reinvent wheels.
    - Spam: Visit eurotraQ.
    

Re: Extracting text after a keyword
by Abigail-II (Bishop) on Jul 02, 2002 at 12:30 UTC

    Why a regex? For extracting fixed strings, substr() is the handiest way. To find the index from where to extract, we'd use index().

    So, combining them gives us:

    my $data = "... whatever ..."; my $key = "START"; my $length = 140; my $text = substr $data, index ($data, $key) + length ($key), $ +length;