whats the best way to query/extract fields from a tab delimited file

aussieaubs has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: whats the best way to query/extract fields from a tab delimited file by davido (Cardinal) on Oct 31, 2003 at 19:34 UTC
You could do something with split like: `my @headers = split /\t/, <DATA>; while ( my $line = <DATA> ) { chomp $line; next if not $line; # skip blank lines. my ($name, $rank, $serial_num, $iq, $hit_points) = split /\t/, $line; # Do whatever it is you want to with this info... }` [download] I used the <DATA> filehandle simply out of convenience for this example. You'll want to either use <> or open a handle to a particular external file using open. I'm not sure what you meant by "the previous HTML file", so I can't comment on that part of your question. Iterating over 2000 rows in a file to find one column in one row isn't an unreasonable thing to do unless you're going to be doing it so often that the script becomes IO bound. In which case, it would be better to move away from flat files and into the database world. If you need to keep track of what line number you're on in the file, remember that the special variable, `$.` (as described in perlvar) always knows what line number you're looking at in a file. Of course the first line is the header, as you mentioned. If the columns maintain the same dimensions from one line to the next, you can gain some speed efficiency by using unpack or substr instead of split. However, unpack and substr aren't going to allow for variable-width columns. If that's a possibility, split is the first choice. Dave "If I had my life to live over again, I'd be a plumber." -- Albert Einstein	[reply] [d/l] [select]
Re: Re: whats the best way to query/extract fields from a tab delimited file by aussieaubs (Initiate) on Oct 31, 2003 at 20:18 UTC
excellent davido, thanks again for the great info. I love it.... thanks mate aussie....	[reply]
Re: whats the best way to query/extract fields from a tab delimited file by meetraz (Hermit) on Oct 31, 2003 at 19:31 UTC
use split() and the diamond operator: `use strict; my $filename = 'test.tsv'; open (FILE, $filename); my @headers = split(/\t/,<FILE>); while (my $line = <FILE>) { my @fields = split(/\t/,$line); } close (FILE);` [download] you might also want to use chomp(). An (somewhat more complex) alternative is to use DBI and DBD::CSV.	[reply] [d/l]
Re: Re: whats the best way to query/extract fields from a tab delimited file by aussieaubs (Initiate) on Oct 31, 2003 at 19:50 UTC
wow thank you very much for the info I il look into it. another word to put on my list - split and Diamond operator	[reply]
Re: whats the best way to query/extract fields from a tab delimited file by sauoq (Abbot) on Oct 31, 2003 at 20:11 UTC
The Text::CSV module on CPAN might be useful to you. CSV stands for "character separated values" and that's exactly what you have. (Most CSV files have headers too.) If you are going to work with this sort of data a lot, I highly recommend getting the module. If you have a lot of data, you might want to look at Text::CSV_XS as well. It's essentially the same thing but written in C for speed. If all you are ever going to do is print columns, this should help get there: `#!/usr/bin/perl -w use strict; my $datafile = shift; my $want_col = shift; open my $fh, '<', $datafile or die "$datafile: $!"; my $headings = <$fh>; chomp $headings; my @columns = split /\t/, $headings; my %column; @column{ @columns } = 0 .. $#columns; die "Column '$want_col' not in $datafile" unless exists $column{ $want +_col }; while ( <$fh> ) { chomp; print join "\t", (split /\t/)[ $column{ $want_col } ], "\n"; }` [download] You can put that in a file, make it executable, and call it as `script.pl file.dat columnname`. Modify it as needed. -sauoq "My two cents aren't worth a dime.";	[reply] [d/l]
Re: whats the best way to query/extract fields from a tab delimited file by reclaw (Curate) on Oct 31, 2003 at 20:25 UTC
You could also try using AnyData `use AnyData; my $table = adTie('Tab', 'file.tsv', 'r'); # Get all column Names my @col_names = adNames($table); # or get all values in a column my @col_vals = adColumn($table, 'column_name'); # or get a column value from a row print $table->{row_name}->{column_name}; # or search all rows for a value while (my $row = each %$table) { print $row->{column_name} if $row->{row_name} eq 'this'; } undef $table;` [download] Reclaw Your mileage may very.	[reply] [d/l]
Re: whats the best way to query/extract fields from a tab delimited file by jZed (Prior) on Oct 31, 2003 at 21:15 UTC
If you know SQL, then DBD::CSV and DBD::AnyData will both let you treat the tab separated files as databases. I say separated rather than delimited because, technically, that's what the tabs do. "quote delimited, comma separated, semi-colon terminated";	[reply]
Re: Re: whats the best way to query/extract fields from a tab delimited file by aussieaubs (Initiate) on Nov 03, 2003 at 16:21 UTC
wow, thank you everyone for all your tips. These all are definately really good hints that I never thought of. You all truly are truly amazing. cheers and thanks again aussie // aubrey hughes	[reply]
Re: whats the best way to query/extract fields from a tab delimited file by Art_XIV (Hermit) on Oct 31, 2003 at 21:02 UTC
The following is admittedly ugly as sin but it allows you to choose columns and set their order -- just substitute tabs for spaces: use strict; my @cols_to_print = qw(LAST_NAME FIRST_NAME); my @col_nbrs; while (<DATA>) { my $line = $_; if ($. == 1) { #1st line of input? my @col_names = split(/\s+/, $line); for my $x (0..$#cols_to_print) { for my $y (0..$#col_names) { if ($cols_to_print[$x] eq $col_names[$y]) { push @col_nbrs, $y; last; } } } } else { my @cols = split /\s+/, $line; @cols = @cols[@col_nbrs]; #slice into wanted column nbrs print "@cols\n"; } } __DATA__ CUST_ID FIRST_NAME LAST_NAME FAV_LANG 1 Larry Wall Perl 2 Bill Joy Java 3 Richard Stallman emacslisp 4 Bill Gates tinyBasic [download] Would render: `Wall Larry Joy Bill Stallman Richard Gates Bill` [download] This helps if the column selections truly need to be dynamic. @cols_to_print could come from ARGV or other external vars. There might be a more idomatic way of handling the sequence for the first line, but darned if I figure out what it is. Hanlon's Razor - "Never attribute to malice that which can be adequately explained by stupidity"	[reply] [d/l] [select]