UrbanHick has asked for the wisdom of the Perl Monks concerning the following question:

Hello all. I am a neophite user of Perl and I am looking to write a quick little app to parse out a simple tab seperated file that was exported from excel. It looks somthing like this:

 

NAME EMAIL OFFICE Ben Wang wang@mycorp.com Boston Warren Smith wsmith@mycorp.com Boston, New_York Emma Sun esun@mycorp.com Shenzhen, Hong_Kong Xiao Long long@mycorp.com Shenzhen George Stephenson stephenson@mycorp.com London, New_York ect...

I would like to parse up this file so It cane return the names and emails of everyone listed with a particular office. These will be used to write office specific emails. I think I should use an hash to do this and use the OFFICES as the keys. So far I have used this to load the data from :

open (EMPLOYEES, "employees.txt"); while ($record =<EMPLOYEES>){ ## I need to know how to pass it into a hash or array here.##

After that it is just a matter of I/O and some looping arguements to for the searching. Thank you for any help you can give me, this has been chewing me red for the last few hours.

-UH

Replies are listed 'Best First'.
Re: Parsing a 3-Column Tab-Deliminated File
by davorg (Chancellor) on Oct 03, 2006 at 08:48 UTC

    Sounds like you're looking for split.

    while (my $record = <EMPLOYEES>) { chomp $record; my @person = split /\t/, $record; # now do something interesting with @person }

    Actually I think I'd use a hash:

    my @cols = qw(name email office); while (my $record = <EMPLOYEES>) { chomp $record; my %person; @person{@cols} = split /\t/, $record; # now do something interesting with %person }

    Oh, and you should really be checking the return value from open and taking appropriate action.

    open (EMPLOYEES, 'employees.txt') or die "Can't open employees.txt: $!\n";
    --
    <http://dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

Re: Parsing a 3-Column Tab-Deliminated File
by davidrw (Prior) on Oct 03, 2006 at 12:47 UTC
    Maybe overkill, but here's a DBI solution with DBD::CSV:
    use DBI; my $dbh = DBI->connect("DBI:CSV:f_dir=.;csv_sep_char=\t") or die "Cann +ot connect: " . $DBI::errstr; $dbh->{'csv_tables'}->{'employees'} = { file => 'employees.txt'}; # now you have your choice of DBI commands. This one gets a ref to a +Array of Hashrefs my $aref = $dbh->selectall_arrayref("select * from employees", {Slice= +>{}}, ); # AoH $dbh->disconnect();
    Or, to go overkill one more step :) wrap that in Class::DBI:
    package My::Employees; use base qw/Class::DBI/; __PACKAGE__->connection("DBI:CSV:f_dir=.;csv_sep_char=\t",); __PACKAGE__->db_Main->{'csv_tables'}->{'employees'} = { file => 'emplo +yees.txt'}; # i'm guessing a little here .. it's a lot easier if the + file doesn't have an extension __PACKAGE__->table('employees'); __PACKAGE__->columns(Primary => qw/NAME/ ); # is it unique?? __PACKAGE__->columns(All => qw/NAME EMAIL OFFICE/); package main; my @all = My::Employees->retrieve_all; my $first = $all[0]; warn $first->EMAIL;
      my $dbh = DBI->connect("DBI:CSV:f_dir=.;csv_sep_char=\t") or die "Cannot connect: " . $DBI::errstr;

      I had a hard time trying the DBD::CSV example, until I saw in the documentation that the default value for csv_eol was \015\012 (I'm on Unix, so \n has other value here). When I stuck a csv_eol=\n to that line everything started to work.

      Later I thought it would also be safe for Windows or Mac users to specify the csv_eol explicitly, since \n would map to the correct sequence of characters for the architecture. However, I haven't seen anyone recommending this ("Use always csv_eol=\n, unless working with files from other platforms"), maybe because there's a good reason for not doing so. Is there any?

      Update: Fixed typo spotted by planetscape.

      --
      David Serrano

        Thereby hangs a tale. I inherited DBD::CSV from Jochen Weidman back in 1999 and at that time it already had many users. Jochen had decided on using windows-style eol as the default because the primary intended audience was users of Excel. I left the default as-is in DBD::CSV for backward compatibility but in my DBD::AnyData I changed the default to be "\n" (i.e. the line ending on the platform where the script is being run) which usually works better except when people get files from elsewhere. So, in short, no, always using "\n" is not a bad idea.
Re: Parsing a 3-Column Tab-Deliminated File
by Melly (Chaplain) on Oct 03, 2006 at 09:04 UTC

    Hmm, well I'll give it a go... you say tab seperated, but it's hard to know from the sample whether these have been transformed to spaces or not, so the following should work in either case.

    Personally, although an elegant idea, I wouldn't use a hash here (it would dump you in the deep end with regard to references, etc.).

    my $office = 'boston'; #office to match on my @matched; open(EMPLOYEES, 'employees.txt'); while(<EMPLOYEES>){ chomp; if(/(.*?)\s*(\S*@\S*)\s*(.*$office.*)/i){ push @matched, "$1 $2"; } } print "Employees from $office office:\n"; foreach(@matched){ print "$_\n"; }
    Tom Melly, tom@tomandlu.co.uk
Re: Parsing a 3-Column Tab-Deliminated File
by jdporter (Paladin) on Oct 03, 2006 at 15:00 UTC
    use strict; use warnings; my $office = shift or die "Usage: $0 office\n"; my %h; my $infile = "employees.txt"; open F, "< $infile" or die "read $infile - $!\n"; while (<F>) { chomp; my( $name, $email, $offices ) = split /\t/; $offices or next; $h{$_}{$name} = $email for split /,\s*/, $offices; } close F; print "$_\t$h{$office}{$_}\n" for sort keys %{ $h{$office} };
    We're building the house of the future together.