Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

i have the following file
blahety blah blah RS0029.DOC INTER UNIT HARNESS REQUIREMENT SPECIFICATION RS0036.DOC INSTRUMENT ELECTRONICS UNIT RS0037.DOC MECHANISM CONTROL ELECTRONICS RS0041.DOC IOU DESCAN MECHANISM RS0042.DOC IOU GENERIC MECHANISMS
i want to extract the filenames, (rs0036.doc) and the titles. i'd like to push the filenames into one array, and their titles into another. (or a hash if i must, but i'm not very good with them). the format is $filename."\t".$title."\n"

i think i want to search each line for a tab character, and then if it does, push the strings each side into the ararays. hopefully then $filename3 should be the document titled $title3

thanks. sorry if this is all obvious, i haven't done much regex.

the easy bit will then to be to generate a html document with a list of links to the files with the link text as their title. matt

Replies are listed 'Best First'.
Re: searching a file, results into an array
by tmoertel (Chaplain) on Oct 13, 2004 at 14:55 UTC
    Ah, this is classic split territory. Also, hashes were made for precisely this kind of thing, so let's roll that way, too. Now is as good of a time as any to give them a try, right?

    The following code shows one way of doing what you want:

    #!/usr/bin/perl use warnings; use strict; # read the files into a hash of ( filename => title ) my %files; while (<DATA>) { chomp; # get rid of line-ending if (my ($file, $title) = split ' ', $_, 2) { $files{$file} = $title; } } # print out the files in our hash, sorted by file name foreach my $file (sort keys %files) { my $title = $files{$file}; print "$file = $title\n"; } __DATA__ RS0029.DOC INTER UNIT HARNESS REQUIREMENT SPECIFICATION RS0036.DOC INSTRUMENT ELECTRONICS UNIT RS0037.DOC MECHANISM CONTROL ELECTRONICS RS0041.DOC IOU DESCAN MECHANISM RS0042.DOC IOU GENERIC MECHANISMS
    (For convenience, I put the list of files in the code's __DATA__ section, but you'll read them from a separate file.)

    The only tricky part is our split invocation, which says, "split lines on whitespace into two parts." If we're successful in splitting the current line, we get back the filename and its title, which we store in the variables $file and $title respectively. These, in turn, we store in a hash called, appropriately enough, %files.

    The foreach loop shows how to read values out of the hash. I just print them out, but I trust that you can convert each into the appropraite hypertext link. Here's the code's output:

    RS0029.DOC = INTER UNIT HARNESS REQUIREMENT SPECIFICATION RS0036.DOC = INSTRUMENT ELECTRONICS UNIT RS0037.DOC = MECHANISM CONTROL ELECTRONICS RS0041.DOC = IOU DESCAN MECHANISM RS0042.DOC = IOU GENERIC MECHANISMS

    Cheers,
    Tom

Re: searching a file, results into an array
by Random_Walk (Prior) on Oct 13, 2004 at 15:05 UTC

    The regex is rather simple for this but I think split would be even better (more efficient) if you are sure they will always be seperated by a first tab and any line with a first tab is valid. I put them into an array of arrays, more efficient than a hash keyed on doc if you only ever want to read through them sequentialy. Just another way to do it....

    #!/usr/local/bin/perl -w use strict; my @documents; while (<DATA>) { # uncomment following line for the regex way # if (/^([\S]*.DOC)\t(.*)/) {push @documents, [$1, $2]} # uncomment these to use the split method # chomp; # next unless (my ($doc, $title)=split /\t/, $_, 2); # push @documents, [$doc, $title]; } print "I found the following docs\n\n"; foreach (@documents) { print "Doc: $_->[0] \t Title: $_->[1]\n"; } __DATA__ RS0029.DOC INTER UNIT HARNESS REQUIREMENT SPECIFICATION RS0036.DOC INSTRUMENT ELECTRONICS UNIT RS0037.DOC MECHANISM CONTROL ELECTRONICS RS0041.DOC IOU DESCAN MECHANISM RS0042.DOC IOU GENERIC MECHANISMS
    Note the regex given is a bit more fussy than the obvious /(.*)\t(.*)/ which would cause you grief if the title contained a tab (if you don't know why read up about greedy pattern matching, it is very important)

    Cheers,
    R.

      This topic is pretty well worked out, but want to add my 2bits: I like the list of lists of this solution over the hash method. The reason being that the list retains the sequence of records. I prefer a regular expression over a split in this type of format.. reason: there may be other tabs on the line. Since thare are no spaces in the filenames (in your example), I would use this
      ($filename,$description) = ($line =~ m/(^\S+)\s+(.*)/);
      or as given in the referenced comment:
      if($line =~ m/(^\S+)\s+(.*)/) { push @documents, [$1,$2]; }

      Kinda of a "me too" comment; I know.

        Hi perlcapt

        The split I was using had the third parameter, (number of parts to split into) set to two. This prevents it eating any tabs beyond the first so any in the title are no problem. I think it has to remain the prefered option for efficiency as long as the file is all either blank lines or docs and tittles seperated by a tab.

        In my regex I included the litteral .DOC to improve rejection of spurrious lines though of course I am assuming no .XLS or .PPT files. I did make a couple of errors though...

        # I gave /^([\S]*.DOC)\t(.*)/ # the class grouping [] for \S is of course silly and # I forgot to escape the . in .DOC # this would have been better /^(\S*\.DOC)\t(.*)/

        Cheers,
        R.

Re: searching a file, results into an array
by borisz (Canon) on Oct 13, 2004 at 14:53 UTC
    my %h; while ( defined ($_ = <DATA>) ) { next if /^\s*$/; chomp; my @d = split /\t/, $_, 2; $h{$d[0]} = $d[1]; } use Data::Dumper; print Dumper(\%h); __OUTPUT__ $VAR1 = { 'RS0042.DOC' => 'IOU GENERIC MECHANISMS', 'RS0036.DOC' => 'INSTRUMENT ELECTRONICS UNIT', 'RS0029.DOC' => 'INTER UNIT HARNESS REQUIREMENT SPECIFICATIO +N', 'RS0041.DOC' => 'IOU DESCAN MECHANISM', 'RS0037.DOC' => 'MECHANISM CONTROL ELECTRONICS' };
    Boris
Re: searching a file, results into an array
by muntfish (Chaplain) on Oct 13, 2004 at 14:53 UTC

    Several ways of doing it:

    ($filename, $title) = split /\t/;

    or

    ($filename, $title) = /^(.*)\t(.*)$/;

    assuming $_ contains each line of the file in turn.

    No reason to be scared of hashes. In your example it's no more difficult to use a hash, than pushing onto separate arrays. In fact I think its easier. Having found your filename and title:

    $docTitles{$filename} = $title;

    having declared my %docTitles; first, of course :-)

    Then you can create your HTML by doing something like:

    print "<table>\n"; for my $doc (sort keys %docTitles) { print "<tr><td>$doc</td><td>$docTitles{$doc}</td></tr>\n"; } print "</table>\n";

    (or whatever your markup is gonna look like)


    s^^unp(;75N=&9I<V@`ack(u,^;s|\(.+\`|"$`$'\"$&\"\)"|ee;/m.+h/&&print$&
Re: searching a file, results into an array
by StrebenMönch (Beadle) on Oct 13, 2004 at 14:55 UTC
    I am not a Perl guru by any means, but you might want to start by looking at Text::TabFile.

    There might be better mods out there on CPAN but at least this is a start.

    *Update -- I must be really slow... I guess this is not a start but, as with all things in perl, one of many ways to do it.
    ------------------------
    StrebenMönch