Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> <meta name="GENERATOR" content="Microsoft FrontPage 4.0"> <meta name="ProgId" content="FrontPage.Editor.Document"> </head> <body>

Hi all, I need to strip the header data out of a file that was created by some reporting software. It has a page header all throughout it like:

User Report
All Users

User Name    Default Login Name     Shell Name
->Token ID         Last Login
Bob Smith     bsmith                             bash
->0000123456     05/03/2000

Now, how do I extract just the user records and their info to pipe them back into another file simply? Thanks!

</body> </html>

Replies are listed 'Best First'.
(Ovid) Re: Stripping page headers
by Ovid (Cardinal) on Sep 27, 2000 at 21:06 UTC
    The following assumptions are made:
    • Header is info, newline, info, two newlines
    • Column headings are to be retained.
    • Header is to be stripped and remainder written to a file
    #!/usr/bin/perl -w use strict; my $file; open DATA, "<somefile.dat" or die "Can't open somefile for reading: $! +\n"; { undef $/; $file = <DATA>; } $file =~ s/^[^\n]+\n[^\n]+\n\n//; close DATA; open NEWFILE, ">newfile.dat" or die "Can't open newfile for writing: $ +!\n"; print NEWFILE $file; close NEWFILE;
    If the header consists of an unknown number of lines terminated by two newlines, change the regex to the following:
    $file =~ s/.*?\n\n//;

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just go the the link and check out our stats.

(jcwren) RE: Stripping page headers
by jcwren (Prior) on Sep 27, 2000 at 21:12 UTC

    I think we need a little more data here. Is the page *actually* formatted as you describe, or is the header on one line, and each line of data on a line by itself? Something more like this:
    User Report All Users User Name Default Login Name Shell Name Token ID Last Logi +n Bob Smith bsmith bash 0000123456 05/03/200 +0 Ralph Jones rjones csh 0000123444 05/04/200 +0
    Is the data paged periodically, like every 66 lines, or after the initial headers, does it run continuously? Are there any other inconsistencies we should know about? (Do you have a project specification, with delivery dates, cost estimates, environmental impact statements, and EPA approval? <G>)

    --Chris

    e-mail jcwren
RE: Stripping page headers
by indigo (Scribe) on Sep 27, 2000 at 21:16 UTC
    If you can find two regexes that match the beginning and ending lines of the header, you might try the bistable operator. Here is a guess as to what those regexes might be:
    perl -ne 'print unless /^User/ .. /^->\d/' file > file.new
(Dermot) Re: Stripping page headers
by Dermot (Scribe) on Sep 27, 2000 at 21:26 UTC
    There are two main approaches to this problem insofar as I can see. The first is to strip out the bits of text that you don't want leaving the bits that you do want. The second approach is to ignore the bits you don't want and use a regex to match the bits that you do want (the records). I would be inclined to strip out the header using something along the lines of:
    #!/usr/bin/perl -w use strict; my ($REPFILE, $report); undef $/; # Allows whole file to be slurped open REPFILE, "sample.rep" or die "Can't open file $REPFILE: $!\n"; $report = <REPFILE>; # All file now in report variable # Only do this for reasonably sized # report files or buy some memory :) $report =~ s/^User Report//g; $report =~ s/^Other Header Stuff//g; print $report;
    Second approach, building a regex to strip out the data you do want is left as an exercise for the reader.
      I really like this approach, but problem is it doesnt appear to be doin anything. The file I get out of it is identical to the original when compared. Where could the error be (and i copied it practically verbatim). Thanks!
        If you're getting the same output as the input it means the substitution is not happening. Post the s/// that you are using and the file you are running it on. One possible problem would be using ^User as the regex but there are spaces before the word User in the file i.e spaces between the start of the line which is indicated by the caret (^) symbol and the text. Not sure what else it could be. You could put an if around the substitution and see if it isn't happening.
Re: Stripping page headers
by mrmick (Curate) on Sep 27, 2000 at 21:20 UTC
    If you know how many lines to ignore before hitting records, the following could work:
    my $cnt = 0; open (FILE,$filename)||die "Cannot open $filename\n$!\n"; while($cnt<5){ $header = <FILE>; } # now we can go through the records... while(<FILE>){ ... } close (FILE);
    Of course, if the number of lines changes, the code will have to as well. :-(

    Mick