Re: Identifying unmatched data in a database

Hello ardibehest, and welcome to the Monastery!

As LanX says, once you’ve opened a file you need to access it via the filehandle, not the filename. In addition, you should get into the habit of using the 3-argument form of open, using lexical variables instead of barewords for filehandles, calling close on the filehandle when the file is no longer being used, and always checking open and close for success or failure:

#!/usr/bin/perl
use strict;
use warnings;

use constant
{
    DATA_FILE      => 'try.txt',
    MATCHED_FILE   => 'matched.txt',
    UNMATCHED_FILE => 'unmatched.txt',
};

open(my $input,     '<', DATA_FILE)
    or die "Cannot open file '" . DATA_FILE      . "' for reading: $!"
+;
open(my $matched,   '>', MATCHED_FILE)
    or die "Cannot open file '" . MATCHED_FILE   . "' for writing: $!"
+;
open(my $unmatched, '>', UNMATCHED_FILE)
    or die "Cannot open file '" . UNMATCHED_FILE . "' for writing: $!"
+;

while (my $line = <$input>)
{
    chomp $line;

    my ($left, $right) = split /=/, $line;
    my  @left_array    = split / /, $left;
    my  @right_array   = split / /, $right;

    my  $left_count    = scalar @left_array;
    my  $right_count   = scalar @right_array;

    if ($left_count == $right_count)
    {
        print $matched "$line\n";
    }
    else
    {
        my $diff = abs($left_count - $right_count);
        print $unmatched $line, "($diff) \n";
    }
}

close $unmatched or die "Cannot close file '" . UNMATCHED_FILE . "': $
+!";
close $matched   or die "Cannot close file '" . MATCHED_FILE   . "': $
+!";
close $input     or die "Cannot close file '" . DATA_FILE      . "': $
+!";
[download]

Some notes:

Your line my @right_array = split / /, $right: ends in a colon, not a semicolon. This is a syntax error.
Likewise, the line my $diff = abs (left_count - $right_count) lacks both a terminating semicolon and a $ sigil (prefix) on the variable $left_count.
Your use of chomp is correct, but I think it’s better practice to chomp first thing in the loop and so keep the contents of $line the same throughout the loop body. (This will be less confusing for the maintenance programmer — likely you! — trying to understand the code some time in the future.)

It’s great that you use strict and warnings. Here are some additional pragmas you will find handy:

diagnostics: Useful for understanding errors and warnings messages.
constant: See the above code.
autodie: A convenient alternative to explicitly testing for success on file open and close.

As you are new to Perl, be sure to check out chromatic’s book Modern Perl, which is available for free download at http://modernperlbooks.com/mt/index.html.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

Comment on Re: Identifying unmatched data in a database Select or Download Code

Replies are listed 'Best First'.

Re^2: Identifying unmatched data in a database
by AppleFritter (Vicar) on Jun 29, 2014 at 10:08 UTC

Most esteemed prior, a humble pilgrim seeks to barge in and benefit from your wisdom.

you should get into the habit of [...] calling close on the filehandle when the file is no longer being used, and always checking open and close for success or failure

I hear your advice, but I don't understand why these are good habits.

I always regarded calling close as superfluous, unless I was either a) going to open the same file again (perhaps with different parameters, perhaps not), or b) concerned that the system itself would run out of file descriptors for open files (or perhaps c) opening pipes rather than files, but I've never done that). What does explicitely calling close -- much less at the very end of a script, with no further code following it -- accomplish?

On the same note, while it's of course always a good idea to check for errors, what would an inability to close a file signify for the script? I assume that the worst that could happen is that the file remains open; if close is not necessary to begin with, as above, this would not be a problem (since not calling close would leave the file open, anyway). Even if an explicit close is advisable, I'd expect that failure would at most warrant a warning in most situations.

But I'm not an experienced monk of Perl. Please enlighten me, brother!

[reply]

Re^3: Identifying unmatched data in a database

by hippo (Archbishop) on Jun 29, 2014 at 11:16 UTC

What does explicitely calling close -- much less at the very end of a script, with no further code following it -- accomplish?

The short answer is: very little. However we all know that code has a tendency to both grow and propagate over time. It may be that later on either you or someone else will add a whole heap of extra processing onto the end of your script at which point it would be prudent to close the file first. By having the close in there anyway you (or the other programmer) can add their code after it without even thinking about what other housekeeping might be advisable.

While you are smart enough to consider limitations on the number of file descriptors, some other programmer who cargo-cults your routine into a massive loop over N different filehandles may not. So, the close does no harm and helps to avoid potential problems later.

while it's of course always a good idea to check for errors, what would an inability to close a file signify for the script?

More usually it is not the closing of the file per se which it is desirable to test, but rather the implicit flush and/or lock release. A failure of those may have serious consequences for data integrity so it is as well to inform the user of such a failure. Whether that constitutes a fatal error would depend on the context.

All just my opinion, of course.

[reply]

Re^4: Identifying unmatched data in a database

by soonix (Chancellor) on Jun 29, 2014 at 20:34 UTC

the implicit flush and/or lock release

I think this is the most important part. If you work with Windows (or cross-platform), having a file open for reading means that no one (including the same process) is allowed to rename or delete that file.
Probably seldom necessary, but makes e.g. PDF::API2::Simple failing one of ist tests on Windows...

[reply]

Re^3: Identifying unmatched data in a database

by Athanasius (Archbishop) on Jun 29, 2014 at 14:57 UTC

Hello AppleFritter,

I don’t have much to add to hippo’s excellent answer. But:

I always regarded calling close as superfluous, unless I was either a) going to open the same file again (perhaps with different parameters, perhaps not), ...

Well, according to close:

You don't have to close FILEHANDLE if you are immediately going to do another open on it, because open closes it for you. ... However, an explicit close on an input file resets the line counter ($.), while the implicit close done by open does not.

But the real issue for me is (usually) not whether the file is closed, but whether errors are detected. A well-written programme should make it clear:

that an error has occurred, so that data is not silently corrupted; and
where it occurred, to focus the programmer’s attention on the real problem and facilitate the debugging process.

If a file error occurs after a successful call to open, an explicit close may be the best, or only, location in the code where it can be detected. This is so even when using autodie. For example (assuming the file “fred.txt” does not exist):

 0:50 >perl -wE "open(my $fh, '<', 'fred.txt'); close $fh;"

 0:51 >perl -Mautodie -wE "open(my $fh, '<', 'fred.txt'); close $fh;"
Can't open 'fred.txt' for reading: 'No such file or directory' at -e l
+ine 1

 0:51 >perl -wE "open(my $fh, '<', 'fred.txt'); use autodie; close $fh
+;"
Can't close(GLOB(0x3bbb68)) filehandle: 'Bad file descriptor' at -e li
+ne 1

 0:51 >perl -wE "open(my $fh, '<', 'fred.txt'); use autodie;"

 0:52 >
[download]

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^3: Identifying unmatched data in a database

by Laurent_R (Canon) on Jun 29, 2014 at 15:24 UTC

AppleFritter

Granted, Perl is doing its best to do what you mean. And this means, inter alia, that it will flush the write buffers and close the file when the filehandle goes out of scope or when the program completes. So, most of the time, closing explicitly a filehandle seems to be unnecessary. But I still think it is good practice to explicitly close your filehandles (in Perl and in other languages automatically closing filehandles on exit) because:

- It makes your intent clearer to the chap that will have to maintain your code (and we all know that, six months from now, that chap may be you or I);

- The earlier a file is closed, the earlier resources associated to it are freed;

- The earlier a file is closed, the smaller the risk is to use it wrongly;

- the earlier an output filehandle is closed, the earlier the written file is in a stable form. If your program crashes violently, it might not be able to flush the write buffer to the file and close it properly before aborting. If the file was closed cleanly beforehand, everything is fine.

For these reasons, I (almost) always close explicitly my files, especially the output files, as soon as I no longer need them.

Having said that, I must admit that usually don't test the result of the close function.

Edit:

I had not seen Athanasius's answer when I wrote mine. I might not have answered if I had seen it.

[reply]

Re^2: Identifying unmatched data in a database
by ardibehest (Novice) on Jun 29, 2014 at 11:07 UTC

Many thanks Athanasius for his ever so kind help and above all the comments on my perl code. The program he wrote is easy to read and above all easy to follow. I ran the program and it handled around 70,000 words in less than a minute. I understood the goof-up I had done and in the future, I will ensure that such mistakes don't happen.

[reply]