Extracting text after a keyword

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Extracting text after a keyword by Zaxo (Archbishop) on Jul 02, 2002 at 09:10 UTC
Putting aside the mail reading bits, the regex for that would be: `/START(.{140)/)/` with modifiers to fit your data format (which could have used a tighter description here). If you have several keywords, you might be better served by Parse::RecDescent. Update: Having seen your data now, you might be better off splitting on whitespace. You may also be able to exploit the structure of five-character groups. The data you show doesn't seem to fit the description in the root node, am I missing something? After Compline, Zaxo	[reply] [d/l]
Re: Re: Extracting text after a keyword by Anonymous Monk on Jul 02, 2002 at 09:16 UTC
Thanks. I'll have a play with those. The data format is a group of numbers/letters/punctuation. It looks something like this, but may be set out out on any number of lines. The '/' is sometimes a letter in some circumstances. `UFOFH 10504 91001 /0600 6036/ 7049/ 8055/ 9065/ 0068/ 1073/ 2075/ 3076/ 4076/ 5073/ 6079/ 7073/ 8068/ 9064/ 0060/ 1046/ 2042/ 3040/ 0037/ 1035/ 2035/ 3035/ 4031/ 5030/` [download]	[reply] [d/l]
Re: Extracting text after a keyword by amphiplex (Monk) on Jul 02, 2002 at 09:12 UTC
If you can afford to read the entire file before processing it, this yould be a simple solution: `use strict; my $keyword = "START"; my $length = 20; my $file = <<EOF; something START text1 text2 ........................ text 3 text 4 START more text still more text EOF $file =~ s/(?:\s+\|\n+)//gc; my @hits = $file =~ /$keyword(.{$length})/g; for (@hits) { print $_."\n"; }` [download] ---- kurt	[reply] [d/l]
Re: Re: Extracting text after a keyword by suaveant (Parson) on Jul 02, 2002 at 13:44 UTC
Just a little note on efficiency... use Benchmark; sub a { $_ = "this is a\ntest\n"; s/(?:\s+\|\n+)//gc; } sub b { $_ = "this is a\ntest\n"; s/\s+//gc; } sub c { $_ = "this is a\ntest\n"; tr/\n\r\t //d; } timethese(250000,{ a => \&a, b => \&b, c => \&c }); Benchmark: timing 250000 iterations of a, b, c... a: 5 wallclock secs ( 4.18 usr + 0.01 sys = 4.19 CPU) @ 59 +665.87/s (n=250000) b: 1 wallclock secs ( 1.61 usr + 0.03 sys = 1.64 CPU) @ 15 +2439.02/s (n=250000) c: 1 wallclock secs ( 0.66 usr + 0.02 sys = 0.68 CPU) @ 36 +7647.06/s (n=250000) [download] All of these have the same effect on the string... for one... \s includes \n, so the right part of that doesn't ever actually do anything useful, but I believe it still gets checked each time to make sure it doesn't make a better match... I believe in most cases a character class would be better for that, but that is really irrelevant, since stripping single characters is much faster with tr///d Just thought I would point it out... - Ant - Some of my best work - (1 2 3)	[reply] [d/l]
Re: Extracting text after a keyword by Juerd (Abbot) on Jul 02, 2002 at 09:15 UTC
I would like to extract, the next 140 characters (excluding whitespace and newlines) after a keyword eg. "START". `my ($foo) = $mailbox =~ /START((?:\s*\S){140})/;` [download] This matches START followed by 140 non-whitespace characters, each of which can have any amount of whitespace in front of it. - Yes, I reinvent wheels. - Spam: Visit eurotraQ.	[reply] [d/l]
Re: Extracting text after a keyword by Abigail-II (Bishop) on Jul 02, 2002 at 12:30 UTC
Why a regex? For extracting fixed strings, `substr()` is the handiest way. To find the index from where to extract, we'd use `index()`. So, combining them gives us: `my $data = "... whatever ..."; my $key = "START"; my $length = 140; my $text = substr $data, index ($data, $key) + length ($key), $ +length;` [download]	[reply] [d/l] [select]