jping45 has asked for the wisdom of the Perl Monks concerning the following question:

I am brand new to PERL (about 2 weeks since the purchase of O'Reilly 4th Edition Learning Perl) and I am attempting to parse a file. I have figured out how to load the file into arrays, using \n\n as the seperator value (parsing the file pragraph by paragraph) I would like to know if I'm going about this the correct way. I need to search the file looking for a paragraph that starts with, "keyword" and a ':' then I need to write that paragraph out to another file. So here's what I have so far: (I get the value of KWD from STDIN)
open FILE, "<input.txt" ; $file=<FILE> ; @paras=split /\n\n/, $file ; foreach (@paras) { @LINE=split /:/, @paras ; if ($LINE[1] = $KWD) { print "@paras" ; } }
Any suggestions would be greatly appreicated...

Replies are listed 'Best First'.
Re: Newbie Question on parsing files
by ikegami (Patriarch) on Mar 27, 2007 at 17:09 UTC

    Bugs:

    • = is the assignment operator, not a comparison operator. Use == to compare numbers. Use eq to compare strings.

    Improvements that can be made:

    • Paragraph mode could be used here. Check out $/.
    • There's no need to read in the whole file into memory.
    • Using 3-arg open when possible is safer. (open(..., '<...')open(..., '<', '...'))
    • open is rather likely to fail. Check to make sure it succeeds.
    • Using lexicals when possible is safer. (FILE$fh)
    • Uppercase variable names are usually reserved for constants.

    Resulting code:

    my $file_name = ...; my $kwd = ...; open my $fh, '<', $file_name or die("Unable to open file \"$file_name\": $!\n"); local $/ = ""; while (<$fh>) { chomp; my @fields = split /:/, $_; if ($fields[1] eq $kwd) { print "$_\n\n"; } }

    Update: You asked about writting it to another file:

    my $fn_in = ...; my $fn_out = ...; my $kwd = ...; open my $fh_in, '<', $fn_in or die("Unable to open file \"$fn_in\": $!\n"); open my $fh_out, '>', $fn_out or die("Unable to create file \"$fn_out\": $!\n"); local $/ = ""; while (<$fh_in>) { chomp; my @fields = split /:/, $_; if ($fields[1] eq $kwd) { print $fh_out "$_\n\n"; } }
Re: Newbie Question on parsing files
by saintly (Scribe) on Mar 27, 2007 at 17:22 UTC
    Congratulations on choosing Perl, the One True Language!

    There are a couple problems with the code you've written that might prevent it from working correctly.

    $file = <FILE>;
    This would only get one line from the file (when you ask for a scalar value (as opposed to a list) from a filehandle, you get back only one line by default). You might modify this to:
    local $/ = undef; $file = <FILE>;
    The '$/' special variable is normally set to '\n', telling Perl to stop reading a file when it hits a newline. By setting it to 'undef', you force Perl to give you the whole file. You could just as easily say:
    $file = join("", <FILE>);
    too.

    The foreach command sets a single variable to each element of the thing it's 'each'ing through. Since you didn't specify it, it's a bit hidden right now. (It's using the special variable $_). You could make it more clear:
    foreach $this_paragraph (@paras) { .... }
    And then check the value of $this_paragraph to see if it matches your criteria. As an example:
    foreach my $this_paragraph (@paras) { my @split_para = split(/:/, $this_paragraph); if( $split_para[1] = $KWD ) { print $this_paragraph; } }
    However, this still has some problems. From your description it seems as if you want to search for a keyword followed by a ':', as in 'KEYWORD: some text in the paragraph'. Even if you're searching for ':KEYWORD', this code still has some problems.

    • $split_para[1] = $KWD; # this SETS $split_para[1] to $KWD, and always evaluates to TRUE as long as $KWD was true. You probably want something like '$split_para[1] eq $KWD' instead.
    • If the paragraph doesn't look like 'something:KEYWORD:something....', this test will fail. Perl's lists/arrays start counting at 0 instead of 1, so the first list element is '$split_para[0]'. You could test the first one with '$split_para[0] eq $KWD' if your intended text looks like 'KEYWORD:something ....'
    • It's not really necessary to break up the paragraph into pieces to check if it's got the keyword in it, Perl programmers are very fond of using regular expressions for this sort of thing...
    foreach my $this_paragraph (@paras) { if( $this_paragraph =~ /^$KWD:/s ) { print $this_paragraph; # This works if the paragraph looks like 'KEYWORD: ...' } }
    If your paragraph looks more like 'something ..... :KEYWORD' instead, then the regular expression needs a little modification:
    foreach my $this_paragraph (@paras) { if( $this_paragraph =~ /:$KWD\b/s ) { print $this_paragraph; # This works if the paragraph looks like '.... :KEYWORD ...' } }

    update: The others made some good points about rewriting the way you open files. The way you chose would work, but if you get in the habit of doing it that way, it's prone to problems.

    Here are some resources that can help you:
    • perlretut - Regular Expressions tutorial
    • open - Syntax for the 'open' command. Although the examples in the Learning Perl book demonstrate it, there are 'safer' ways of opening files. As the other monks have pointed out, you should usually use scalar filehandles (open my $fh, ....) instead of the older (open FILE, ...), and you should always check the return code from 'open' to make sure it worked.
    • http://www.oreilly.com/catalog/perlbp/ - O'Reilly's 'Perl Best Practices' book. Although this doesn't teach you how to write Perl programs, it's a great book new Perl programmers should read if you plan to write Perl code professionally.
    • http://www.oreilly.com/catalog/pperl3/ - The companion book to 'Learning Perl', 'Programming Perl' is a reference to the built-in Perl commands and the tools and libraries that come with it.
    Although I probably shouldn't mention this, all the cool people write out the name of our beloved language as 'Perl', not 'PERL'.

    Some of the other examples don't seem to be looking at paragraphs (such as the first, which only prints lines).

    Good luck with Perl, and I hope the responses here haven't been too overwhelming! I would suggest experimenting by seeing what happens when you make use different alternatives for each line of the program. And of course, I'm sure we all hope you'll come back and ask more questions if you need more help!
Re: Newbie Question on parsing files
by Anno (Deacon) on Mar 27, 2007 at 17:26 UTC
    Your attempt at reading the file into an array won't work. You're only reading the first line (not even a paragraph). Another mistake in the code is that, in your if-statement, you use assignment (=) where you want a string comparison (eq). You should also switch on warnings and strict mode, especially if you're not yet entirely sure what you're doing.

    Perl has a paragraph input mode which is activated by setting the variable $/ to an empty string. (See perlvar for more.) That lets you process a file by paragraphs as you read it.

    Here is a revised version of your code (untested):

    my $KWD = shift; my $file = 'input.txt'; open my $in, '<', $file or die "Can't read '$file': $!"; local $/ = ''; while ( my $para = <$in> ) { my ( $first) = split /:/, $para; print $para if $first eq $KWD; }
    I have kept your approach using split() to isolate the part before ":", but a regular expression simplifies things:
    /^$KWD:/ and print while <in>;
    Anno
      man you guys ROCK!!! thanks for all the input and reference links.....
Re: Newbie Question on parsing files
by GrandFather (Saint) on Mar 27, 2007 at 22:12 UTC

    One of the best pieces of general advice we can give you is use strictures: use strict; use warnings;. They often pick up issues before they become issues.

    A cute trick that Perl can do is use a different "line terminator" than the default by setting the $/ special variable. So you can:

    my @paras = do {local $/ = "\n\n"; <FILE>};

    which locally (to the do) overrides the line end terminator to be "\n\n", then uses <FILE> in an array context (my @paras = is array context) to "slurp" the file a paragraph at a time into @paras.

    You can then use grep to filter out all the paragraphs that don't start with the keyword:

    my @wanted = grep {/^\Q$KWD:\E/} @paras;

    which uses a regular expression (see perlretut for a start) to check that the paragraph starts with the required key word. Now you can print the result out:

    print OUTFILE @wanted;

    DWIM is Perl's answer to Gödel