qhayaal has asked for the wisdom of the Perl Monks concerning the following question:

Hello,
I need to read a file into arrays directly so that each column is a seperate array..., where in each column is seperated in the original file by some delimeter (in shell $ cut -d : -f 1 foo ). One way is to loop over the file, and use split. But this is time consuming (for a large file). Is there a smarter way of doing the same? Sorry if this is a silly question, but I didn't find solution either in 'Learning Perl' or thorugh google.
-Qhayaal

Replies are listed 'Best First'.
Re: Vertical split (ala cut -d:) of a file
by Tanktalus (Canon) on Jan 30, 2005 at 15:40 UTC

    qhayaal,

    It doesn't really matter what happens under the covers, the computer simply must go through the file line by line to find two things: the text you're cutting on (":"), and the newline to signify the next line starting. Whether you do this via while(<$fh>) { my ($field) = split /:/; do_stuff_with($field); }, or you do it via Text::xSV, or you do it via my @fields = `cut -d: -f1 $file` or even my @fields = `awk -F: '{print \$1}' $file`, the computer will go through each file, line by line, inspecting characters. (Note that with the split example, we want to give it an array context - split is really special in that it can "see" how many fields are wanted, and will only split into one more field than that - so it will only look for one ":" in the string, already being as efficient as split can be.)

    If you really think you need the speed, first try it with one of the above (I would recommend one of the first two). If it really is too slow (I doubt it will be), there are some optimisations you can make:

    • Instead of split, use index to find the first ':', and use substr to extract the beginning of the line: my $field = substr($_, 0, index($_, ':')); The speed difference may be enough - although I would never write code this way without first seeing if split was fast enough since split is so much easier to use, and since we're already using an optimised split.
    • Instead of reading in an entire line, play with $/. However, this is reserved for really advanced users, IMO. (I think of myself as really advanced, and I would never do this.) The purpose is to get the read operation to scan the input for ':' as your separator for the first field, and then reset $/ back to normal for slurping the rest of the line so you can set it back to ':' for the first field of the next line. This will mean that you're scanning each character only once - whether it's for ':' or for "\n". However, it is also dangerous - if you have a line that has no ':'s, you will read it in with the next line, and get a vastly different answer here. To avoid this, you can look for "\n"s in the input and split on that ... but now we're back to scanning each character twice - once for "\n" and once for ":".
    • And next, the same as the last idea, except use awk to do it. Awk can scan for : and \n both at the same time (looping through each character once). The disadvantages are that you need to spawn another process (some overhead here which may eat up any savings), and that awk programming is harder than perl programming, IMO :-}.
    • Finally, write your own input routines which reads a block of data into memory, and loop through the characters one at a time. This is only for the seriously advanced, though, since it'll be really easy to munge this.
    Now, having said all that, I want to re-iterate: TEST OUT THE SPLIT (or Text::xSV) FIRST. It's probably more than fast enough, with the least amount of effort. Most of the rest of the above suggestions will only shave a fraction of a percent from the time, if they shave anything, with huge amounts of programmer time dedicated to it, also meaning large chances of bugs to find and eradicate.

Re: Vertical split (ala cut -d:) of a file
by friedo (Prior) on Jan 30, 2005 at 07:44 UTC
Re: Vertical split (ala cut -d:) of a file
by bgreenlee (Friar) on Jan 30, 2005 at 08:48 UTC
    As friedo said, Text::xSV will most definitely do the job, but if the separator character(s) is not found in the actual data (i.e. it is only used as a separator), I don't know if you're going to get much faster performance than doing a line-by-line split. It shouldn't be that time consuming, even on a large file (unless you're running on ancient hardware).

    -b

      Thanks friedo and bgreenlee for the replies. I will check the xSV. I was hoping there would be something that's analogous to split itself. Like:
      (@col_1, @col_2, @col_3) = quasi_split /:/, foo
      *sigh* The problem is I have *lots* of files and each with hundreds of lines. If there is no such feature, may be I can risk asking for a feature request? I don't the internals, so I am not sure if I would be blasted for such a request...

        Text::xSV will nearly certainly be slower than a straight split. “Hundreds of lines” alone doesn't even come close to stressing Perl though. How many files do you have? Do you only need specific fields? Adjacent or disparate ones?

        Makeshifts last the longest.

Re: Vertical split (ala cut -d:) of a file
by wazoox (Prior) on Jan 30, 2005 at 15:20 UTC
    If you'd rather use only standard tools, unpack may be a better option than split.
    my (@col1, @col2, @col3); my $i=0; while (<FOO>) { ($col1[$i],$col2[$i],$col3[$i])=unpack("A1 A4 A8",$_); $i++; }
    You may also dynamically manage the number of columns by putting all of the arrays in a hash.

      You forget to mention that this will only work for files with fixed-width fields. Should that actually be the case, though, then indeed unpack is the fastest option.

      Makeshifts last the longest.

Re: Vertical split (ala cut -d:) of a file
by belg4mit (Prior) on Jan 30, 2005 at 12:50 UTC
Re: Vertical split (ala cut -d:) of a file
by bsb (Priest) on Jan 30, 2005 at 22:49 UTC