libmonk has asked for the wisdom of the Perl Monks concerning the following question:

<body>

Hello Monks,

Please don't judge me too harshly as a perl newbie but I am a bit lost in the perl forest. I am attempting I/O on a tab delimited file which looks like this:

ISBN OCLC TITLE AUTHOR CALL_NUMBER

853694893 53123369 Transdermal and topical drug delivery from theory to clinical practice / Adrian Williams. "Williams, Adrian, 1963-" RM151 .W55 2003X 2003
471056693 "Pattern classification / Richard O. Duda, Peter E. Hart and David G. Stork." "Duda, Richard O." 006.4; D844; 2001

I've cobbled together code from some old threads but the parser is not pleased. I'm sure there are some elementary mistakes. Would you please help me clean it up?

Thanks,

Libmonk

#!/usr/bin/perl
use strict;
use warnings;
use IO::File;
use LWP::Simple;

## file's structure - tab delimited
# ISBN OCLC TITLE AUTHOR CALL_NUMBER

open (DATAFILE, "input.txt") or die "There is a problem opening the datafile.";
open (NEWFILE, "output.txt") or die "There is a problem opening the output file.";

while (<NEWFILE>) {
chomp;

my @fields = split(/\t/, $line); # splits tab separated fields - replace \t with \, \^ \|, whatever you need

# field names

$isbn = $fields[0];
$ocln = $fields1;
$title = $fields2;
$author = $fields3;
$call_number = $fields4;

}

#print

foreach $line (<DATAFILE>)
{
$line =~ s/&+/&amp;/;
($isbn, $ocln, $title, $author, $call_number) = split(/\t/, $line);
#$ISBN =~ /(^\d{10,13})/;
#$ISBN = $1;
print NEWFILE "<match>\n";
print NEWFILE "<title>";
print NEWFILE $title, "</title>\n";
print NEWFILE "<isbn>";
print NEWFILE $isbn, "</isbnr>\n";
print NEWFILE "<call_number>";
print NEWFILE $call_number,"</call_number>";
print NEWFILE "</match>\n";
}

close (DATAFILE);
close (NEWFILE);

</body>

Replies are listed 'Best First'.
Re: input tab delimited file
by ELISHEVA (Prior) on Jul 14, 2009 at 16:49 UTC

    First, lets start with what you did right:

    • Provided sample data.
    • use strict; use warnings - very good.

    Now for things to clean this up:

    • Declare all your variables: @fields is declared, but nothing else is.
    • open - you might want to take a look at perlopentut if you haven't seen that already. You need to tell Perl whether the file is for input or output, like this: open(INPUTFILE, '<', "some_filename"); open(OUTPUTFILE, '>', "some_other_filename")
    • Going by your comments and file names, it looks like you have mixed up your input and output sources: you are chomp'ing data from your output file and printing data from the input file.
    • You don't actually need to read the file in line by line if you choose to use for (<DATAFILE>) - Perl will automatically slurp in the entire file and generate an array with one element for each line in your input file - so you can totally get rid of that first while loop for now. However, when you get more advanced, it is usually more efficient and scalable to read files in line by line using a while loop.
    • Formatting at PM: You can preserve the indenting of your code sample by surrounding it in <code> tags. Also you don't need <body> tags for Perl Monks posts.
    • Formatting in your source code (just in case your source code also has all lines flush left - if not, ignore this). Normally we indent a group of statements inside each curly bracket pair so we can see easily which statements are inside the loop and which are outside.
    • Checking for undefined variables: in your sample data, some of your records are missing the call number field. Data files don't always have all the data you expect, so it is usually a good idea to check first to see if any values are in fact undefined before using them.
    • HTML generation cleanup: some of your tags don't match, e.g. isbn/isbnr

    Here is an example of your forloop cleaned up a bit.

    #you can use my to declare things even in the middle of a #statement! #the for loop slurps in all the lines in DATAFILE, #provided you have opened it for reading. foreach my $line (<DATAFILE>) { $line =~ s/&+/&amp;/; # you can also use my (...) to declare many variables # at once my ($isbn, $ocln, $title, $author, $call_number) = split(/\t/, $line); # you'll likely have different defaults for cases where # fields are undefined $isbn='' unless defined($isbn); $ocln='' unless defined($ocln); $title='' unless defined($title); $author='' unless defined($author); $call_number='' unless defined($call_number); print NEWFILE "<match>\n"; print NEWFILE "<title>"; print NEWFILE $title, "</title>\n"; print NEWFILE "<isbn>"; print NEWFILE $isbn, "</isbnr>\n"; #typo? isbn*r* print NEWFILE "<call_number>"; print NEWFILE $call_number,"</call_number>"; print NEWFILE "</match>\n"; }

    Best, beth

      Wow, Perl is really cool when it works! Thanks for your help!!

      Here's my code:

      #!/usr/bin/perl -w
      use strict;

      open(INPUTFILE, "< input.txt") or die "cannot open file for reading $!";
      open(OUTPUTFILE, "> output.txt") or die "cannot open file for writing $!";

      foreach my $line (<INPUTFILE>)
      {
      $line =~ s/&+/&amp;/;

      # you can also use my (...) to declare many variables
      # at once
      my ($isbn, $ocln, $title, $author, $call_number, $publish_date, $lccn, $request_number)
      = split(/\t/, $line);

      # you'll likely have different defaults for cases where
      # fields are undefined
      $isbn='' unless defined($isbn);
      $ocln='' unless defined($ocln);
      $title='' unless defined($title);
      $author='' unless defined($author);
      $call_number='' unless defined($call_number);
      $publish_date='' unless defined($publish_date);
      $lccn='' unless defined($lccn);
      $request_number='' unless defined($request_number);

      print OUTPUTFILE "<match>\n";
      print OUTPUTFILE "<title>";
      print OUTPUTFILE $title, "</title>\n";
      print OUTPUTFILE "<isbn>";
      print OUTPUTFILE $isbn, "</isbn>\n";
      print OUTPUTFILE "<call_number>";
      print OUTPUTFILE $call_number,"</call_number>";
      print OUTPUTFILE "</match>\n";
      }

      close INPUTFILE;
      close OUTPUTFILE;

      Here's a sample of the output:

      $ tail output.txt
      <isbn>313228841</isbn>
      <call_number>Z8424.D69</call_number></match>
      <match>
      <title>"Blogs, Wikipedia, Second life, and Beyond : from production to produsage / Axel Bruns."</title>
      <isbn>820488674</isbn>
      <call_number>ZA4482 .B78 2008</call_number></match>
      <match>
      <title>"Living standards in the past : new perspectives on well-being in Asia and Europe / edited by Robert C. Allen, Tommy Bengtsson, and Martin Dribe."</title>
      <isbn>199280681</isbn>
      <call_number>zHD7048.L58 2005</call_number></match>

      Thanks again! Now I'm off to build on this beginning ...

      Best,

      Libmonk

        Good luck! When you come back, if you decide to post again, you will save yourself (and other monks) a lot of trouble by putting your code and data inside of <code> .... </code> tags. Check out the Markup in the Monastery page, and when you want to post some code, just type this into the text-input box:
        <code> </code>
        And then paste your code (or data) from a terminal window or editing tool into the blank space between those tags. No need to do special stuff with angle brackets, line breaks, ampersands or any of the stuff that would normally screw up an HTML display. Works like a charm.
        A few points. You will need a chomp($line). That deletes the trailing end of line character. Otherwise the trailing \n will wind up at the end of the last token parsed by the split on tabs. The default split /\s+/ (split on any whitespace character) doesn't need a "chomp" because \t is one of the 5 whitespace chars (\n\r\f\t\s).

        I don't know if you will need to trim trailing spaces or not. But you should consider the following code...

        #!/usr/bin/perl -w use strict; my $line = "tok1 \t \t\t tok4\n"; chomp ($line); #try running without this! my @x = my ($tok1, $tok2, $tok3, $tok4) = split(/\t/,$line); foreach my $token (@x) { print "token = $token..\n"; #.. is there to show blanks } __END__ prints: token = tok1 .. token = .. token = .. token = tok4..
        I don't know what $line =~ s/&+/&amp;/; equates to but I think this should be: chomp($line);. I hope that you've come to see the power of multiple variables to the left of the equals sign!! In many languages you have to write a bunch of stuff that essentially means something like thing 3 in the array is a "postal code". In Perl, we can just assign these variables names straight from the "get-go".

        Now we come the question about "undef" vars resulting from split. You have a lengthy section like $isbn='' unless defined($isbn);.

        Run the above code with this line, adding $tok5:

        my @x = my ($tok1, $tok2, $tok3, $tok4, $tok5) = split(/\t/,$line);
        You will see that you get a runtime warning about an undefined var. "Use of uninitialized value $token in concatenation (.)". This happens in the print and Perl keeps going and this is normally what you would want. You get some info that your database is corrupted and Perl does the best that it can.

        The split() will not generate intermediate undef's, if that happens, the undef will be at the end (ie not a position 3 or whatever). In the above $tok5 is "undef" because we have exceeded the number of things returned by the split().Let's say that you want to detect "undef's" in the split and do something on your own.
        Here is one way:

        my @x = split(/\t/,$line); die "I don't have enough stuff..need 5 tokens\n" if @x <5; my ($tok1, $tok2, $tok3, $tok4, $tok5) = @x;
        We see how many things that split() comes up with and assign that to @x. There won't be any "undef" values there. Then we see if we have enough defined values to satisfy the $var assignments (scalar value of the @x variable), if not then do what you want. This is just an example.

        In general if some field is completely "MIA" in the DB, it is field 3? 2? I mean if we are expecting to get 5 things and only get 4, then who knows what is missing and dealing with that can be very problematic! but the split/\t/ will generate a "", null for a "blank field", not an undef value.

        Good luck and happy Perling! A fantastic language.

        Update:Perl has an operator that I've never seen in another other language, |=, $varA |= "some text"; This statement means if $varA evaluates as *logical* false, then "some text" is assigned to it. In Perl, numeric 0, undef, "" all mean logical false. In some situations this "logical true OR" gizmo is a very nice thing, mainly dealing with undef or Null text strings.

Re: input tab delimited file
by ropey (Hermit) on Jul 14, 2009 at 16:24 UTC

    I'm not entirely sure what your trying to achieve as you seem to be reading from both files and writing back to one, but hopefully this will put you more on the right track

    1) I think your copy and pasting has added some things you don't need nor use, this will harm your understanding. Please remove.

    use IO::File; use LWP::Simple;

    You really don't seem to need them for this task

    2) What exactly are you doing with your input and output files? they seem to be the wrong way round ? you seem to be attempting to read from your output file (even though you haven't opened them correctly. If you want to read from 'input.txt' and write to 'output.txt' you would do something like this.

    open(INPUTFILE, "< input.txt") or die "cannot open file for reading $! +"; open(OUTPUTFILE, "> output.txt") or die "cannot open file for writing +$!"; ..... code goes here close INPUTFILE; close OUTPUTFILE;

    Now you are in a position to read from the files, so maybe something like this

    while(my $line = <INPUTFILE>) { $line =~ s/&+/&amp;/; # Assuming you have a good reason for this # See $line has now been declared, where before it wasn't my ($isbn, $ocln, $title, $author, $call_number) = split(/\t/, $li +ne);# See the variables have been declared # .. do something here }
Re: input tab delimited file
by moritz (Cardinal) on Jul 14, 2009 at 15:53 UTC
    my @fields = split(/\t/, $line);

    c>$line</c> is neither declared anywhere, nor was a value assigned to it.

    Writing while (my $line = <NEWFILE>) {... should help.

    (Update: changed wording)

      ...as the strictures would have been only too quick/pleased to point out:-))

      Hmmm, maybe it's a problem with the understanding, by the OP, of the warnings & errors output by the compiler?

      A user level that continues to overstate my experience :-))