Peeling Data with Reserved Characters and Long Lines

PerlReader has asked for the wisdom of the Perl Monks concerning the following question:

I've had difficulty peeling data from a flat file. I initially thought I could terminal line grep it, but it urped all over the very long lines and Perl is more flexible anyway if I need to add functions later. The information I'm seeking uses some reserved characters that need to be escaped in grep.

Here's what I'm trying to do:

1. Open a text file with a mucky-thick block of text (almost no line feeds).

2. Search for occurrences of a string that begins with -/ (dash slash) that variably contains between 5 and 20 alphanumeric characters and which ends with ?srt= (question mark srt equals).

Here are examples:

...emable-Stuff-10100-PTZ-/1280640AB018292?srt=More2stuff&ha...

1280640AB018292

...wer-Idaptx-SJ10-/35DE4715844?srt=L12_Defa43Dom..

35DE4715844

3. Write the alphanumeric strings to a text file, each occurrence on a new line.

Thanks for your accumulated (and accumulating) wisdom.

Comment on Peeling Data with Reserved Characters and Long Lines

Replies are listed 'Best First'.
Re: Peeling Data with Reserved Characters and Long Lines by davido (Cardinal) on Mar 12, 2011 at 09:40 UTC
Something like this? `use strict; use warnings; use autodie qw/:all/; open my $outfile, '>', 'filename.txt'; while (<>) { if ( m#-/(\w+)\?srt=# ) { print $outfile $1, "\n"; } } close $outfile;` [download] Update: Rats, the preceding example assumes one match per line. That's not a certainty though. How about this: `use strict; use warnings; use autodie qw/:all/; open my $outfile, '>', 'filename.txt'; while ( <> ) { while( m#-/(\w+)\?srt=#g ) { print $outfile $1, "\n"; } } close $outfile;` [download] There's another nifty way too. If you don't really care about newlines as record separators, why not call the '?srt=' your record separator instead? In that case, it would look like this: `use strict; use warnings; use autodie qw/:all/; open my $outfile, '>', 'filename.txt'; { $local $/ = '?srt='; while( <> ) { chomp; if( m#-/(\w+)$# ) { print $outfile $1, "\n"; } } } close $outfile;` [download] Dave	[reply] [d/l] [select]
Re^2: Peeling Data with Reserved Characters and Long Lines by roboticus (Chancellor) on Mar 12, 2011 at 13:03 UTC
davido: I really liked your third method, but you didn't use the 5-20 character bit of the specification: `$ cat t.pl #!/usr/bin/perl use strict; use warnings; local $/ = '?srt='; while (<DATA>) { chomp; print "$1\n" if m[-/(\w{5,20})$]; } __DATA__ ...emable-Stuff-10100-PTZ-/1280640AB018292?srt=More2stuff&ha... ...wer-Idaptx-SJ10-/35DE4715844?srt=L12_Defa43Dom.. foo-/a?srt=bar;foo-/abcdefghijklmnopqrstuvwxyz?srt= foo-/abcde?srt=bar-/abcdef?srt=baz-/abcdefg?srt= $ perl t.pl 1280640AB018292 35DE4715844 abcde abcdef abcdefg` [download] ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l]
Re: Peeling Data with Reserved Characters and Long Lines by Eliya (Vicar) on Mar 12, 2011 at 09:37 UTC
`while (<>) { while (m#-/(.?)\?srt=#g) { say $1; } }` [download] (Because this is reading the input line by line, it's assuming the items to search don't cross line boundaries. In case your data isn't all that huge, you could of course also read it as one* record instead, by setting `$/` to `undef`.) The `while (m#...#g)` allows more than one search item to occur on one line.	[reply] [d/l] [select]
Re: Peeling Data with Reserved Characters and Long Lines by JavaFan (Canon) on Mar 12, 2011 at 12:48 UTC
Something like (untested): `use 5.010; use autodie; my $CHUNK_SIZE = 1024 * 1024; # Read 1Mb chunks. open my $fh, "<", $filename; open my $out, ">", $outfilename; my $buffer; while (read $fh, my $chunk, $CHUNK_SIZE) { $buffer .= $chunk; while ($buffer =~ m{-/([a-zA-Z0-9]{5,20})\?srt=/}g) { say $out $1; substr($buffer, 0, pos($buffer)) = ""; } substr($buffer, 0, -26) = ""; # Keep last characters } __END__` [download]	[reply] [d/l]
Re: Peeling Data with Reserved Characters and Long Lines by PerlReader (Initiate) on Mar 12, 2011 at 16:00 UTC
You guys are fast. I've been working through the first examples. Starting from the suggestions, I wrote several scripts (I did variants with $& for the match and also one that uses substring), but... the "real world" files include lines that are 12,000 characters long, often with no word breaks. When I run the scripts on a test file with short lines, they work. When I try them on the dense-text file, it comes up blank. Am I hitting a 2048-character limit? or is it the lack of word breaks? Maybe the more recent posts will help. I'll read them and check back.	[reply]
Re^2: Peeling Data with Reserved Characters and Long Lines by Eliya (Vicar) on Mar 12, 2011 at 20:18 UTC
There definitely is no 2048-character limit, and lines with 12,000 characters aren't exactly long — with machines having several Gigs of RAM these days. Also, a lack of word breaks shouldn't matter either, as your match pattern is independent of them. So it must be something else... Are your real world files maybe UTF-16 encoded, or some such?	[reply]
Re: Peeling Data with Reserved Characters and Long Lines by PerlReader (Initiate) on Mar 13, 2011 at 00:23 UTC
That was it! Turns out they're UTF-16 coded. Hadn't thought of that. I saved a test file in Roman and one in Latin—the scripts worked on both. I don't yet know if the specific data that has to be matched loses info if I convert to Roman/Latin but at least I'm on a better path. Thanks.	[reply]
Re^2: Peeling Data with Reserved Characters and Long Lines by Eliya (Vicar) on Mar 13, 2011 at 01:48 UTC
I don't yet know if the specific data that has to be matched loses info if I convert to Roman/Latin You can tell Perl the file is encoded in UTF-16, so it will decode it properly. This way you won't lose anything. E.g. `my $infile = shift @ARGV; open my $fh, "<:encoding(UTF-16)", $infile or die $!; while (<$fh>) { ...` [download] (In case the file has no BOM, you might need to use `encoding(UTF-16LE)` instead of `encoding(UTF-16)`.)	[reply] [d/l] [select]