jc.smith3 has asked for the wisdom of the Perl Monks concerning the following question:

UPDATED EXAMPLE... Below is a program that will split a string into an array. Then it scans ARRAY ELEMENT 2 for a '?' then a word 'ekey='. If it finds that stuff, then it will strip out the text after 'ekey=' and add it as a new array element. Finally, it converts the array back to a string.

Problem: I have millions of input records of web data of 550 columns, variable and often null fields, as well as very long fields. I need to scan 10 columns throughout each record looking for 4 different 'scanwords'. If found I always remove the data from the original field, and add the text after the scanword as a new element. Obviously this is very slow.

I have seen something like this on the interwebs: $a=split(/\s+/, $line))[ 3 ] which I assume grabs just a certain column. Could I grab all 10 columns with similar syntax? Would it be faster? How would I update the fields in the original record?

HELP!

#!/usr/bin/perl use strict; use warnings; my $scanword; my $wk1; my $wk2; my $wk3; my $wk4; my $wk5; my $wk6; my @elements; my $i; my $n1; my $inrec1='0;1;2;3;4;5;6?link=misc/redirect/ekey=BOZOTHECLOWN&dmsdate +=20150210;7;8;9'; my $inrec2='0;1;2?link=misc/redirect/ekey=BOZOTHECLOWN&dmsdate=2015021 +0;3;4;5;6;7;8;9'; my $outrec; ##### sub special() { $wk1=''; $wk2=''; $wk3=''; $wk4=''; $wk5=''; $wk6=''; #reset + vars $wk1=$elements[$i]; #field + number to scan $a=index($wk1,"?"); #look +for ? if ( $a != -1 ) { #? was + found #Look for $scanword $wk2=index($wk1,"$scanword"); #find +start of scanword. I.E. "ekey=" $wk5=index($wk1,"=",$wk2); #find +start of "=" $wk3=index($wk1,"&",$wk2); #find +start of next & if ( $wk2 != -1 ) { #found + scanword? if ( $wk3 == -1) { $wk3 = length ($wk1); } #defau +lt to length of string if ampersand not found $wk4 = substr($wk1,$wk2,$wk3-$wk2); #wk1 i +s the field, wk2 is start of scanword, wk3 is the end position $wk6 = substr($wk1,$wk5+1,$wk3-$wk5-1); #wk1 i +s the field, wk5+1 is byte after = in the scanword, wk3 is the end po +sition print STDOUT "array element = $elements[$i] \n"; print STDOUT "Found ? in the field at offset $a \n"; print STDOUT "Found $scanword in the field at offset $wk2 \n" +; print STDOUT "Found end in the field at offset $wk3 \n" +; print STDOUT "Field wk4 = $wk4 \n"; print STDOUT "Field wk6 = $wk6 \n"; $b=$wk3-$wk5-1; #lengt +h to blank substr($elements[$i],$wk5+1,$wk3-$wk5-1) = ' ' x $b; #move +blanks to array element } } } ##### print STDOUT "inrec1=$inrec1 \n"; #print + inrec @elements = split(';', $inrec1, -1); #split + by semicolon, -1 means to keep trailing fields if empty $i=2; #eleme +nt offset $scanword="ekey="; #text +to scan for special(); #call +routine $n1=$wk6; #what +was stripped out push (@elements, $n1); #add t +o end of array $outrec = join(";",@elements); #conve +rt to output record print STDOUT "outrec=$outrec \n"; #print + output record print STDOUT "inrec2=$inrec2 \n"; #print + inrec @elements = split(';', $inrec2, -1); #split + by semicolon, -1 means to keep trailing fields if empty $i=2; #eleme +nt offset $scanword="ekey="; #text +to scan for special(); #call +routine $n1=$wk6; #what +was stripped out push (@elements, $n1); #add t +o end of array $outrec = join(";",@elements); #conve +rt to output record print STDOUT "outrec=$outrec \n"; #print + output record exit; OUTPUT EXAMPLES: ekey is not in elements[2] so do nothing inrec1=0;1;2;3;4;5;6?link=misc/redirect/ekey=BOZOTHECLOWN&dmsdate=2015 +0210;7;8;9 outrec=0;1;2;3;4;5;6?link=misc/redirect/ekey=BOZOTHECLOWN&dmsdate=2015 +0210;7;8;9; ekey is in elements[2] so blank it in input record and add move follow +ing text BOZOTHECLOWN to a new element in array inrec2=0;1;2?link=misc/redirect/ekey=BOZOTHECLOWN&dmsdate=20150210;3;4 +;5;6;7;8;9 array element = 2?link=misc/redirect/ekey=BOZOTHECLOWN&dmsdate +=20150210 Found ? in the field at offset 1 Found ekey= in the field at offset 21 Found end in the field at offset 38 Field wk4 = ekey=BOZOTHECLOWN Field wk6 = BOZOTHECLOWN outrec=0;1;2?link=misc/redirect/ekey= &dmsdate=20150210;3;4 +;5;6;7;8;9;BOZOTHECLOWN

Replies are listed 'Best First'.
Re: split, manipulate, join
by aaron_baugher (Curate) on Apr 10, 2015 at 22:48 UTC

    It's hard to give specific advice without seeing a sample of your input data, but it may be hard for you to show us a sample if your lines are 550 columns wide. Short answer: yes, it's quite likely that there is a faster solution using split and/or regex than what you're doing now with multiple index and substr commands.

    One example to get you started: If you want the six columns numbered 10, 20, 30, 32, 34, and 38 (with the first column numbered 0), and your columns are separated by whitespace, you could do this to get them in an array:

    while(my $line = <$file>){ chomp $line; # split the line on whitespace, then take certain indexed columns my @columns = (split /\s+/, $line)[10,20,30,32,34,38]; do_stuff_with_those_columns(@columns); }

    On the other hand, if you know what you want out of each column, you may be able to skip the step of splitting into an array and go straight to extracting what you need. It just depends on the data. Give us at least a couple lines if you can.

    Aaron B.
    Available for small or large Perl jobs and *nix system administration; see my home node.

      I really can't give any data as it is customer sensitive. Assume I have to search 10 columns for key1= or key2= or key3= or key4=. There is no way to predict what will be after those 4 words. But I have to strip out whatever follows up to a & or reach end of the field. I have to blank the original data where it comes from (just what I am pulling, not whole column) and then append new columns with what I stripped out. I like the idea of the slice syntax. But then how would I go back and update those same columns?
        > I really can't give any data as it is customer sensitive

        please invent non-sensitiv data for your use case.

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

        Can you make up some random but representative data?

        For example, it sounds a lot like you're parsing a URL for GET parameters, so your data with fake customer data with might be:
        http://www.example.com/random_funky_characters_here.html?name=JoeSmith&address=123UniversalAve&CreditCardNumber=1234567890&kneecapstate=broken

        my $inrec='0;1;2?foo=duh;3;4;5;6;7;8;9;;;;;;';

        You've already given some input data, presumably faked to the necessary degree. All you needed to do was to give a couple more fake input record instances and their corresponding output records to make the needed transformation(s) much more clear. No "customer sensitive" data need apply.


        Give a man a fish:  <%-(-(-(-<

        You can strip out the values while moving them. From what you describe, I'd ignore the columns (unless it's possible for those key2= patterns to appear in other columns where you want to ignore them). So you could use something like this to get the value for key1, strip it out, and append it to the end of the line:

        open my $in, '<', 'infile' or die $!; # open input file open my $out, '>', 'outfile' or die $!; # open output file while(my $line = <$in>){ chomp $line; # remove newline if( $line =~ s/key1=([^&\s]+)// ){ # capture value in $1 while # replacing with empty string $line .= " key1=$1"; # append key and value to line } print $out "$line\n"; # print line to output file } close $in; close $out;

        That will do one find/replace/copy, so you can repeat the if loop while changing the key?= to do multiple ones.

        Aaron B.
        Available for small or large Perl jobs and *nix system administration; see my home node.

Re: split, manipulate, join
by choroba (Cardinal) on Apr 11, 2015 at 00:05 UTC
    The following subroutine seems to return the same result as yours, and is about 2 times faster. You didn't provide much testing data, so I'm not sure it will work correctly for the real input.
    sub not_so_special { my $s = shift; my @cols = split /;/, $s, -1; my $e; $cols[2] =~ s/=(.+)/'=' . (' ' x length($e = $1))/e; return join ';', @cols, $e }

    Update
    Or, maybe, more readable and a tiny bit faster:

    sub new { my @cols = split /;/, shift, -1; $cols[2] =~ s/=(.+)/=/; $cols[2] .= ' ' x length $1; return join ';', @cols, $1 }

    Update 2
    This seems even faster (but uglier):

    use Syntax::Construct qw{ /r }; sub newer { shift =~ s/^([^;]*;[^;]*;)(.*?=)([^;]*)(;.*)/"$1$2" . (' ' x lengt +h $3) . "$4;$3"/er }
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: split, manipulate, join
by choroba (Cardinal) on Apr 10, 2015 at 22:33 UTC
    Please, also include sample input and output, enclosed in <code>...</code> tags.

    To grab several columns from split, you can use the slice syntax:

    my @columns = (split /[?=]/)[3 .. 6, 9];
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: split, manipulate, join
by AnomalousMonk (Archbishop) on Apr 10, 2015 at 23:35 UTC

    I'm a bit confused about just what you want to delete and what you want to keep, but here's an approach. I've no idea about its speed. It needs Perl version 5.10+. Note that field offsets are zero-based.

    c:\@Work\Perl\monks>perl -wMstrict -le "use 5.010; ;; my $inrec = '0;1;2?foo=DUH;3;4;5;6?bar=BLAB DAB;7;8;9;;;;;;'; print qq{begin: '$inrec'}; ;; my $sep = ';'; my $data = qr{ [^\Q$sep\E] }xms; my $field = qr{ $data* \Q$sep\E }xms; ;; my %targets = qw(foo 2 bar 6); my @words = sort { $targets{$a} <=> $targets{$b} } keys %targets ; ;; for my $word (@words) { $inrec =~ s{ \A (?:$field){$targets{$word}} [^?]+ \? \Q$word\E = \K ($data*) } {}xms; $inrec .= qq{$1$sep}; print qq{'$inrec'}; } ;; print qq{final: '$inrec'}; " begin: '0;1;2?foo=DUH;3;4;5;6?bar=BLAB DAB;7;8;9;;;;;;' '0;1;2?foo=;3;4;5;6?bar=BLAB DAB;7;8;9;;;;;;DUH;' '0;1;2?foo=;3;4;5;6?bar=;7;8;9;;;;;;DUH;BLAB DAB;' final: '0;1;2?foo=;3;4;5;6?bar=;7;8;9;;;;;;DUH;BLAB DAB;'

    Update: Changed code example to make more consistent use of the  $sep scalar.


    Give a man a fish:  <%-(-(-(-<

Re: split, manipulate, join
by james28909 (Deacon) on Apr 12, 2015 at 04:12 UTC
    This will capture everything after "ekey=" all the way up to new line
    (undef, $captured) = split(/ekey=(.+)/, $_);

      I think I would rather explicitly extract whatever it was I was interested in rather than relying on a somewhat obscure feature of the split built-in:

      c:\@Work\Perl\monks>perl -wMstrict -le "$_ = qq{xxx ekey=foo bar \n xyzzy}; ;; my (undef, $captured) = split(/ekey=(.+)/, $_); print qq{captured '$captured'}; ;; my ($bagged) = $_ =~ m{ ekey= ([^\n]+) }xms; print qq{ bagged '$bagged'}; " captured 'foo bar ' bagged 'foo bar '


      Give a man a fish:  <%-(-(-(-<