Perl Apprentice has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, hope you can help?

I'm processing a large billing data file, pipe delimited. The first field always contains a tag/descriptor. Maybe an identifier and the possibility of a value.

I have experimented and found the use of index and substr was the best way to strip out the first field. Split was to slow.

An Identifier is identified by "tagname_12" the number after the underscore.

The Value is after the tagname can be numbers/letters etc.

I have to strip down to the tag name each time and store the possibility of the identifier and the value.

The file I'm experimenting with is about 1 GB, when split was introduced process was very slow.

Sample of tags:-

START_1 123| FILE 2222| XXXX| AAAA| NEW | END_1|
Anyway any alternatives for split? Any advice will be welocme! cheers

update (broquaint): added <code> tags to sample

Replies are listed 'Best First'.
Re: Alternatives to split?
by Abigail-II (Bishop) on Sep 03, 2003 at 09:41 UTC
    Anyway any alternatives for split?

    Didn't you already answer your own question?

    I have experimented and found the use of index and substr was the best way to strip out the first field.

    If index and substr works for you, and is fast enough, why not use it?

    Abigail

Re: Alternatives to split?
by davido (Cardinal) on Sep 03, 2003 at 16:38 UTC
    I may be way out in left field here, but I have a suggestion that may help you to avoid abandoning the Perlish use of split. I'm surprised it hasn't been offered yet. ...maybe it's because I AM in left field. ;)

    If your data is truly pipe delimited, perhaps you should read it in using the pipe as the record separator. Before slurping in this long datafile, add the line  $/ = "|"; prior to your file read. Then put your file read into a loop, and be sure to either process each line individually, or slurp it into an array now instead of a scalar. After you're done reading in the file, just to retain your sanity, change  $/ back to  "\n".

    That's going to serve the function of having the file read already split your file on "|". Now you can process each pipe delimited line on its own as a much smaller, easier to use chunk.

    Read the file in pre-split. Come to think of it, this does abandon the Perlish split, in favor of an even more Perlish approach (IMVHO).

    I hope this helps.

    Dave

    "If I had my life to do over again, I'd be a plumber." -- Albert Einstein

Re: Alternatives to split?
by Perl Apprentice (Initiate) on Sep 03, 2003 at 09:50 UTC
    Really I just have a need for speed. When simply processing the file, getting some file pointers from tag matches and some seeking around the input file the process fly's. As soon as substr and especially split comes into play the process slows down.

    Is this expected?

      I really don't understand what you want to do.
      Do you just want to extract the part up to the first delimiter?
      try:
      my ($tag)= split(/|/,$_,2);
      my ($tag)= split(/\|/,$_,2);
      This won't split on all but just the first |.

      Update: Thanks to wirrwarr for correcting me

        You have to escape the "|", otherwise you'll split at each character.
        my ($tag)= split(/\|/,$_,2);
        daniel.