aussieaubs has asked for the wisdom of the Perl Monks concerning the following question:

Hi there, Im new to perl and I dont know how to word this question properly in 'perl terms' but here goes. I have a flat file (tab delimited) and the first line is column headings and the rows after that are data that match the column headings. I want to selectively pull out columns of data based on what column headings are needed (from the prvious html page) and then print it out. pretty straight forward i thought. i hope you can tell me what areas i need to look at to make it do this efficiently. The file I am parsing is quite large - about 2000 rows. any wisdom is greatly appreciated, thanks Aussie.
  • Comment on whats the best way to query/extract fields from a tab delimited file

Replies are listed 'Best First'.
Re: whats the best way to query/extract fields from a tab delimited file
by davido (Cardinal) on Oct 31, 2003 at 19:34 UTC
    You could do something with split like:

    my @headers = split /\t/, <DATA>; while ( my $line = <DATA> ) { chomp $line; next if not $line; # skip blank lines. my ($name, $rank, $serial_num, $iq, $hit_points) = split /\t/, $line; # Do whatever it is you want to with this info... }

    I used the <DATA> filehandle simply out of convenience for this example. You'll want to either use <> or open a handle to a particular external file using open.

    I'm not sure what you meant by "the previous HTML file", so I can't comment on that part of your question.

    Iterating over 2000 rows in a file to find one column in one row isn't an unreasonable thing to do unless you're going to be doing it so often that the script becomes IO bound. In which case, it would be better to move away from flat files and into the database world.

    If you need to keep track of what line number you're on in the file, remember that the special variable, $. (as described in perlvar) always knows what line number you're looking at in a file. Of course the first line is the header, as you mentioned.

    If the columns maintain the same dimensions from one line to the next, you can gain some speed efficiency by using unpack or substr instead of split. However, unpack and substr aren't going to allow for variable-width columns. If that's a possibility, split is the first choice.


    Dave


    "If I had my life to live over again, I'd be a plumber." -- Albert Einstein
      excellent davido, thanks again for the great info. I love it.... thanks mate aussie....
Re: whats the best way to query/extract fields from a tab delimited file
by meetraz (Hermit) on Oct 31, 2003 at 19:31 UTC
    use split() and the diamond operator:
    use strict; my $filename = 'test.tsv'; open (FILE, $filename); my @headers = split(/\t/,<FILE>); while (my $line = <FILE>) { my @fields = split(/\t/,$line); } close (FILE);
    you might also want to use chomp().

    An (somewhat more complex) alternative is to use DBI and DBD::CSV.

      wow thank you very much for the info I il look into it. another word to put on my list - split and Diamond operator
Re: whats the best way to query/extract fields from a tab delimited file
by sauoq (Abbot) on Oct 31, 2003 at 20:11 UTC

    The Text::CSV module on CPAN might be useful to you. CSV stands for "character separated values" and that's exactly what you have. (Most CSV files have headers too.) If you are going to work with this sort of data a lot, I highly recommend getting the module. If you have a lot of data, you might want to look at Text::CSV_XS as well. It's essentially the same thing but written in C for speed.

    If all you are ever going to do is print columns, this should help get there:

    #!/usr/bin/perl -w use strict; my $datafile = shift; my $want_col = shift; open my $fh, '<', $datafile or die "$datafile: $!"; my $headings = <$fh>; chomp $headings; my @columns = split /\t/, $headings; my %column; @column{ @columns } = 0 .. $#columns; die "Column '$want_col' not in $datafile" unless exists $column{ $want +_col }; while ( <$fh> ) { chomp; print join "\t", (split /\t/)[ $column{ $want_col } ], "\n"; }

    You can put that in a file, make it executable, and call it as script.pl file.dat columnname. Modify it as needed.

    -sauoq
    "My two cents aren't worth a dime.";
    
Re: whats the best way to query/extract fields from a tab delimited file
by reclaw (Curate) on Oct 31, 2003 at 20:25 UTC

    You could also try using AnyData

    use AnyData; my $table = adTie('Tab', 'file.tsv', 'r'); # Get all column Names my @col_names = adNames($table); # or get all values in a column my @col_vals = adColumn($table, 'column_name'); # or get a column value from a row print $table->{row_name}->{column_name}; # or search all rows for a value while (my $row = each %$table) { print $row->{column_name} if $row->{row_name} eq 'this'; } undef $table;

    Reclaw

    Your mileage may very.

Re: whats the best way to query/extract fields from a tab delimited file
by jZed (Prior) on Oct 31, 2003 at 21:15 UTC
    If you know SQL, then DBD::CSV and DBD::AnyData will both let you treat the tab separated files as databases. I say separated rather than delimited because, technically, that's what the tabs do.

    "quote delimited, comma separated, semi-colon terminated";

      wow, thank you everyone for all your tips. These all are definately really good hints that I never thought of. You all truly are truly amazing. cheers and thanks again aussie //
      aubrey hughes
Re: whats the best way to query/extract fields from a tab delimited file
by Art_XIV (Hermit) on Oct 31, 2003 at 21:02 UTC

    The following is admittedly ugly as sin but it allows you to choose columns and set their order -- just substitute tabs for spaces:

    use strict; my @cols_to_print = qw(LAST_NAME FIRST_NAME); my @col_nbrs; while (<DATA>) { my $line = $_; if ($. == 1) { #1st line of input? my @col_names = split(/\s+/, $line); for my $x (0..$#cols_to_print) { for my $y (0..$#col_names) { if ($cols_to_print[$x] eq $col_names[$y]) { push @col_nbrs, $y; last; } } } } else { my @cols = split /\s+/, $line; @cols = @cols[@col_nbrs]; #slice into wanted column nbrs print "@cols\n"; } } __DATA__ CUST_ID FIRST_NAME LAST_NAME FAV_LANG 1 Larry Wall Perl 2 Bill Joy Java 3 Richard Stallman emacslisp 4 Bill Gates tinyBasic
    Would render:
    Wall Larry Joy Bill Stallman Richard Gates Bill

    This helps if the column selections truly need to be dynamic. @cols_to_print could come from ARGV or other external vars.

    There might be a more idomatic way of handling the sequence for the first line, but darned if I figure out what it is.

    Hanlon's Razor - "Never attribute to malice that which can be adequately explained by stupidity"