in reply to searching a file, results into an array

The regex is rather simple for this but I think split would be even better (more efficient) if you are sure they will always be seperated by a first tab and any line with a first tab is valid. I put them into an array of arrays, more efficient than a hash keyed on doc if you only ever want to read through them sequentialy. Just another way to do it....

#!/usr/local/bin/perl -w use strict; my @documents; while (<DATA>) { # uncomment following line for the regex way # if (/^([\S]*.DOC)\t(.*)/) {push @documents, [$1, $2]} # uncomment these to use the split method # chomp; # next unless (my ($doc, $title)=split /\t/, $_, 2); # push @documents, [$doc, $title]; } print "I found the following docs\n\n"; foreach (@documents) { print "Doc: $_->[0] \t Title: $_->[1]\n"; } __DATA__ RS0029.DOC INTER UNIT HARNESS REQUIREMENT SPECIFICATION RS0036.DOC INSTRUMENT ELECTRONICS UNIT RS0037.DOC MECHANISM CONTROL ELECTRONICS RS0041.DOC IOU DESCAN MECHANISM RS0042.DOC IOU GENERIC MECHANISMS
Note the regex given is a bit more fussy than the obvious /(.*)\t(.*)/ which would cause you grief if the title contained a tab (if you don't know why read up about greedy pattern matching, it is very important)

Cheers,
R.

Replies are listed 'Best First'.
Re^2: searching a file, results into an array
by perlcapt (Pilgrim) on Oct 14, 2004 at 02:58 UTC
    This topic is pretty well worked out, but want to add my 2bits: I like the list of lists of this solution over the hash method. The reason being that the list retains the sequence of records. I prefer a regular expression over a split in this type of format.. reason: there may be other tabs on the line. Since thare are no spaces in the filenames (in your example), I would use this
    ($filename,$description) = ($line =~ m/(^\S+)\s+(.*)/);
    or as given in the referenced comment:
    if($line =~ m/(^\S+)\s+(.*)/) { push @documents, [$1,$2]; }

    Kinda of a "me too" comment; I know.

      Hi perlcapt

      The split I was using had the third parameter, (number of parts to split into) set to two. This prevents it eating any tabs beyond the first so any in the title are no problem. I think it has to remain the prefered option for efficiency as long as the file is all either blank lines or docs and tittles seperated by a tab.

      In my regex I included the litteral .DOC to improve rejection of spurrious lines though of course I am assuming no .XLS or .PPT files. I did make a couple of errors though...

      # I gave /^([\S]*.DOC)\t(.*)/ # the class grouping [] for \S is of course silly and # I forgot to escape the . in .DOC # this would have been better /^(\S*\.DOC)\t(.*)/

      Cheers,
      R.