Parsing a 3-Column Tab-Deliminated File

UrbanHick has asked for the wisdom of the Perl Monks concerning the following question:

Hello all. I am a neophite user of Perl and I am looking to write a quick little app to parse out a simple tab seperated file that was exported from excel. It looks somthing like this:

NAME   EMAIL OFFICE 
Ben Wang     wang@mycorp.com     Boston
Warren Smith     wsmith@mycorp.com     Boston, New_York
Emma Sun     esun@mycorp.com      Shenzhen, Hong_Kong
Xiao Long       long@mycorp.com      Shenzhen
George Stephenson       stephenson@mycorp.com       London, New_York
ect...
[download]

I would like to parse up this file so It cane return the names and emails of everyone listed with a particular office. These will be used to write office specific emails. I think I should use an hash to do this and use the OFFICES as the keys. So far I have used this to load the data from :

 open (EMPLOYEES, "employees.txt");
while ($record =<EMPLOYEES>){
## I need to know how to pass it into a hash or array here.##
[download]

After that it is just a matter of I/O and some looping arguements to for the searching. Thank you for any help you can give me, this has been chewing me red for the last few hours.

-UH

Comment on Parsing a 3-Column Tab-Deliminated File Select or Download Code

Replies are listed 'Best First'.
Re: Parsing a 3-Column Tab-Deliminated File by davorg (Chancellor) on Oct 03, 2006 at 08:48 UTC
Sounds like you're looking for split. `while (my $record = <EMPLOYEES>) { chomp $record; my @person = split /\t/, $record; # now do something interesting with @person }` [download] Actually I think I'd use a hash: `my @cols = qw(name email office); while (my $record = <EMPLOYEES>) { chomp $record; my %person; @person{@cols} = split /\t/, $record; # now do something interesting with %person }` [download] Oh, and you should really be checking the return value from `open` and taking appropriate action. `open (EMPLOYEES, 'employees.txt') or die "Can't open employees.txt: $!\n";` [download] -- <http://dave.org.uk> "The first rule of Perl club is you do not talk about Perl club." -- Chip Salzenberg	[reply] [d/l] [select]
Re: Parsing a 3-Column Tab-Deliminated File by davidrw (Prior) on Oct 03, 2006 at 12:47 UTC
Maybe overkill, but here's a DBI solution with DBD::CSV: `use DBI; my $dbh = DBI->connect("DBI:CSV:f_dir=.;csv_sep_char=\t") or die "Cann +ot connect: " . $DBI::errstr; $dbh->{'csv_tables'}->{'employees'} = { file => 'employees.txt'}; # now you have your choice of DBI commands. This one gets a ref to a +Array of Hashrefs my $aref = $dbh->selectall_arrayref("select * from employees", {Slice= +>{}}, ); # AoH $dbh->disconnect();` [download] Or, to go overkill one more step :) wrap that in Class::DBI: package My::Employees; use base qw/Class::DBI/; __PACKAGE__->connection("DBI:CSV:f_dir=.;csv_sep_char=\t",); __PACKAGE__->db_Main->{'csv_tables'}->{'employees'} = { file => 'emplo +yees.txt'}; # i'm guessing a little here .. it's a lot easier if the + file doesn't have an extension __PACKAGE__->table('employees'); __PACKAGE__->columns(Primary => qw/NAME/ ); # is it unique?? __PACKAGE__->columns(All => qw/NAME EMAIL OFFICE/); package main; my @all = My::Employees->retrieve_all; my $first = $all[0]; warn $first->EMAIL; [download]	[reply] [d/l] [select]
Re^2: Parsing a 3-Column Tab-Deliminated File by Hue-Bond (Priest) on Oct 03, 2006 at 15:23 UTC
`my $dbh = DBI->connect("DBI:CSV:f_dir=.;csv_sep_char=\t") or die "Cannot connect: " . $DBI::errstr;` I had a hard time trying the DBD::CSV example, until I saw in the documentation that the default value for `csv_eol` was `\015\012` (I'm on Unix, so `\n` has other value here). When I stuck a `csv_eol=\n` to that line everything started to work. Later I thought it would also be safe for Windows or Mac users to specify the `csv_eol` explicitly, since `\n` would map to the correct sequence of characters for the architecture. However, I haven't seen anyone recommending this ("Use always `csv_eol=\n`, unless working with files from other platforms"), maybe because there's a good reason for not doing so. Is there any? Update: Fixed typo spotted by planetscape. -- David Serrano	[reply] [d/l]
Re^3: Parsing a 3-Column Tab-Deliminated File by jZed (Prior) on Oct 03, 2006 at 15:50 UTC
Thereby hangs a tale. I inherited DBD::CSV from Jochen Weidman back in 1999 and at that time it already had many users. Jochen had decided on using windows-style eol as the default because the primary intended audience was users of Excel. I left the default as-is in DBD::CSV for backward compatibility but in my DBD::AnyData I changed the default to be "\n" (i.e. the line ending on the platform where the script is being run) which usually works better except when people get files from elsewhere. So, in short, no, always using "\n" is not a bad idea.	[reply]
Re: Parsing a 3-Column Tab-Deliminated File by Melly (Chaplain) on Oct 03, 2006 at 09:04 UTC
Hmm, well I'll give it a go... you say tab seperated, but it's hard to know from the sample whether these have been transformed to spaces or not, so the following should work in either case. Personally, although an elegant idea, I wouldn't use a hash here (it would dump you in the deep end with regard to references, etc.). `my $office = 'boston'; #office to match on my @matched; open(EMPLOYEES, 'employees.txt'); while(<EMPLOYEES>){ chomp; if(/(.?)\s(\S@\S)\s(.$office.*)/i){ push @matched, "$1 $2"; } } print "Employees from $office office:\n"; foreach(@matched){ print "$_\n"; }` [download] Tom Melly, tom@tomandlu.co.uk	[reply] [d/l]
Re: Parsing a 3-Column Tab-Deliminated File by jdporter (Paladin) on Oct 03, 2006 at 15:00 UTC
`use strict; use warnings; my $office = shift or die "Usage: $0 office\n"; my %h; my $infile = "employees.txt"; open F, "< $infile" or die "read $infile - $!\n"; while (<F>) { chomp; my( $name, $email, $offices ) = split /\t/; $offices or next; $h{$_}{$name} = $email for split /,\s*/, $offices; } close F; print "$_\t$h{$office}{$_}\n" for sort keys %{ $h{$office} };` [download] We're building the house of the future together.	[reply] [d/l]