in reply to Re: input tab delimited file
in thread input tab delimited file

Wow, Perl is really cool when it works! Thanks for your help!!

Here's my code:

#!/usr/bin/perl -w
use strict;

open(INPUTFILE, "< input.txt") or die "cannot open file for reading $!";
open(OUTPUTFILE, "> output.txt") or die "cannot open file for writing $!";

foreach my $line (<INPUTFILE>)
{
$line =~ s/&+/&amp;/;

# you can also use my (...) to declare many variables
# at once
my ($isbn, $ocln, $title, $author, $call_number, $publish_date, $lccn, $request_number)
= split(/\t/, $line);

# you'll likely have different defaults for cases where
# fields are undefined
$isbn='' unless defined($isbn);
$ocln='' unless defined($ocln);
$title='' unless defined($title);
$author='' unless defined($author);
$call_number='' unless defined($call_number);
$publish_date='' unless defined($publish_date);
$lccn='' unless defined($lccn);
$request_number='' unless defined($request_number);

print OUTPUTFILE "<match>\n";
print OUTPUTFILE "<title>";
print OUTPUTFILE $title, "</title>\n";
print OUTPUTFILE "<isbn>";
print OUTPUTFILE $isbn, "</isbn>\n";
print OUTPUTFILE "<call_number>";
print OUTPUTFILE $call_number,"</call_number>";
print OUTPUTFILE "</match>\n";
}

close INPUTFILE;
close OUTPUTFILE;

Here's a sample of the output:

$ tail output.txt
<isbn>313228841</isbn>
<call_number>Z8424.D69</call_number></match>
<match>
<title>"Blogs, Wikipedia, Second life, and Beyond : from production to produsage / Axel Bruns."</title>
<isbn>820488674</isbn>
<call_number>ZA4482 .B78 2008</call_number></match>
<match>
<title>"Living standards in the past : new perspectives on well-being in Asia and Europe / edited by Robert C. Allen, Tommy Bengtsson, and Martin Dribe."</title>
<isbn>199280681</isbn>
<call_number>zHD7048.L58 2005</call_number></match>

Thanks again! Now I'm off to build on this beginning ...

Best,

Libmonk

Replies are listed 'Best First'.
Re^3: input tab delimited file
by graff (Chancellor) on Jul 15, 2009 at 01:33 UTC
    Good luck! When you come back, if you decide to post again, you will save yourself (and other monks) a lot of trouble by putting your code and data inside of <code> .... </code> tags. Check out the Markup in the Monastery page, and when you want to post some code, just type this into the text-input box:
    <code> </code>
    And then paste your code (or data) from a terminal window or editing tool into the blank space between those tags. No need to do special stuff with angle brackets, line breaks, ampersands or any of the stuff that would normally screw up an HTML display. Works like a charm.
Re^3: input tab delimited file
by Marshall (Canon) on Jul 15, 2009 at 09:36 UTC
    A few points. You will need a chomp($line). That deletes the trailing end of line character. Otherwise the trailing \n will wind up at the end of the last token parsed by the split on tabs. The default split /\s+/ (split on any whitespace character) doesn't need a "chomp" because \t is one of the 5 whitespace chars (\n\r\f\t\s).

    I don't know if you will need to trim trailing spaces or not. But you should consider the following code...

    #!/usr/bin/perl -w use strict; my $line = "tok1 \t \t\t tok4\n"; chomp ($line); #try running without this! my @x = my ($tok1, $tok2, $tok3, $tok4) = split(/\t/,$line); foreach my $token (@x) { print "token = $token..\n"; #.. is there to show blanks } __END__ prints: token = tok1 .. token = .. token = .. token = tok4..
    I don't know what $line =~ s/&+/&amp;/; equates to but I think this should be: chomp($line);. I hope that you've come to see the power of multiple variables to the left of the equals sign!! In many languages you have to write a bunch of stuff that essentially means something like thing 3 in the array is a "postal code". In Perl, we can just assign these variables names straight from the "get-go".

    Now we come the question about "undef" vars resulting from split. You have a lengthy section like $isbn='' unless defined($isbn);.

    Run the above code with this line, adding $tok5:

    my @x = my ($tok1, $tok2, $tok3, $tok4, $tok5) = split(/\t/,$line);
    You will see that you get a runtime warning about an undefined var. "Use of uninitialized value $token in concatenation (.)". This happens in the print and Perl keeps going and this is normally what you would want. You get some info that your database is corrupted and Perl does the best that it can.

    The split() will not generate intermediate undef's, if that happens, the undef will be at the end (ie not a position 3 or whatever). In the above $tok5 is "undef" because we have exceeded the number of things returned by the split().Let's say that you want to detect "undef's" in the split and do something on your own.
    Here is one way:

    my @x = split(/\t/,$line); die "I don't have enough stuff..need 5 tokens\n" if @x <5; my ($tok1, $tok2, $tok3, $tok4, $tok5) = @x;
    We see how many things that split() comes up with and assign that to @x. There won't be any "undef" values there. Then we see if we have enough defined values to satisfy the $var assignments (scalar value of the @x variable), if not then do what you want. This is just an example.

    In general if some field is completely "MIA" in the DB, it is field 3? 2? I mean if we are expecting to get 5 things and only get 4, then who knows what is missing and dealing with that can be very problematic! but the split/\t/ will generate a "", null for a "blank field", not an undef value.

    Good luck and happy Perling! A fantastic language.

    Update:Perl has an operator that I've never seen in another other language, |=, $varA |= "some text"; This statement means if $varA evaluates as *logical* false, then "some text" is assigned to it. In Perl, numeric 0, undef, "" all mean logical false. In some situations this "logical true OR" gizmo is a very nice thing, mainly dealing with undef or Null text strings.