Stripping page headers

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
(Ovid) Re: Stripping page headers by Ovid (Cardinal) on Sep 27, 2000 at 21:06 UTC
The following assumptions are made: Header is info, newline, info, two newlines Column headings are to be retained. Header is to be stripped and remainder written to a file `#!/usr/bin/perl -w use strict; my $file; open DATA, "<somefile.dat" or die "Can't open somefile for reading: $! +\n"; { undef $/; $file = <DATA>; } $file =~ s/^[^\n]+\n[^\n]+\n\n//; close DATA; open NEWFILE, ">newfile.dat" or die "Can't open newfile for writing: $ +!\n"; print NEWFILE $file; close NEWFILE;` [download] If the header consists of an unknown number of lines terminated by two newlines, change the regex to the following: `$file =~ s/.*?\n\n//;` [download] Cheers, Ovid Join the Perlmonks Setiathome Group or just go the the link and check out our stats.	[reply] [d/l] [select]
(jcwren) RE: Stripping page headers by jcwren (Prior) on Sep 27, 2000 at 21:12 UTC
I think we need a little more data here. Is the page actually formatted as you describe, or is the header on one line, and each line of data on a line by itself? Something more like this: `User Report All Users User Name Default Login Name Shell Name Token ID Last Logi +n Bob Smith bsmith bash 0000123456 05/03/200 +0 Ralph Jones rjones csh 0000123444 05/04/200 +0` [download] Is the data paged periodically, like every 66 lines, or after the initial headers, does it run continuously? Are there any other inconsistencies we should know about? (Do you have a project specification, with delivery dates, cost estimates, environmental impact statements, and EPA approval? <G>) --Chris e-mail jcwren	[reply] [d/l]
RE: Stripping page headers by indigo (Scribe) on Sep 27, 2000 at 21:16 UTC
If you can find two regexes that match the beginning and ending lines of the header, you might try the bistable operator. Here is a guess as to what those regexes might be: `perl -ne 'print unless /^User/ .. /^->\d/' file > file.new` [download]	[reply] [d/l]
(Dermot) Re: Stripping page headers by Dermot (Scribe) on Sep 27, 2000 at 21:26 UTC
There are two main approaches to this problem insofar as I can see. The first is to strip out the bits of text that you don't want leaving the bits that you do want. The second approach is to ignore the bits you don't want and use a regex to match the bits that you do want (the records). I would be inclined to strip out the header using something along the lines of: `#!/usr/bin/perl -w use strict; my ($REPFILE, $report); undef $/; # Allows whole file to be slurped open REPFILE, "sample.rep" or die "Can't open file $REPFILE: $!\n"; $report = <REPFILE>; # All file now in report variable # Only do this for reasonably sized # report files or buy some memory :) $report =~ s/^User Report//g; $report =~ s/^Other Header Stuff//g; print $report;` [download] Second approach, building a regex to strip out the data you do want is left as an exercise for the reader.	[reply] [d/l]
RE: Re: Stripping page headers by Anonymous Monk on Sep 27, 2000 at 23:40 UTC
I really like this approach, but problem is it doesnt appear to be doin anything. The file I get out of it is identical to the original when compared. Where could the error be (and i copied it practically verbatim). Thanks!	[reply]
(Dermot) RE: RE: Re: Stripping page headers by Dermot (Scribe) on Sep 27, 2000 at 23:53 UTC
If you're getting the same output as the input it means the substitution is not happening. Post the s/// that you are using and the file you are running it on. One possible problem would be using ^User as the regex but there are spaces before the word User in the file i.e spaces between the start of the line which is indicated by the caret (^) symbol and the text. Not sure what else it could be. You could put an if around the substitution and see if it isn't happening.	[reply]
RE: RE: RE: Re: Stripping page headers by Anonymous Monk on Sep 28, 2000 at 00:55 UTC
RE: RE: RE: Re: Stripping page headers by Anonymous Monk on Sep 28, 2000 at 00:53 UTC
RE: RE: RE: Re: Stripping page headers by Anonymous Monk on Sep 28, 2000 at 01:04 UTC
(Dermot) RE: RE: RE: RE: Re: Stripping page headers by Dermot (Scribe) on Sep 28, 2000 at 01:54 UTC
Some notes below your chosen depth have not been shown here
Re: Stripping page headers by mrmick (Curate) on Sep 27, 2000 at 21:20 UTC
If you know how many lines to ignore before hitting records, the following could work: `my $cnt = 0; open (FILE,$filename)\|\|die "Cannot open $filename\n$!\n"; while($cnt<5){ $header = <FILE>; } # now we can go through the records... while(<FILE>){ ... } close (FILE);` [download] Of course, if the number of lines changes, the code will have to as well. :-( Mick	[reply] [d/l]