split, manipulate, join

jc.smith3 has asked for the wisdom of the Perl Monks concerning the following question:

UPDATED EXAMPLE... Below is a program that will split a string into an array. Then it scans ARRAY ELEMENT 2 for a '?' then a word 'ekey='. If it finds that stuff, then it will strip out the text after 'ekey=' and add it as a new array element. Finally, it converts the array back to a string.

Problem: I have millions of input records of web data of 550 columns, variable and often null fields, as well as very long fields. I need to scan 10 columns throughout each record looking for 4 different 'scanwords'. If found I always remove the data from the original field, and add the text after the scanword as a new element. Obviously this is very slow.

I have seen something like this on the interwebs: $a=split(/\s+/, $line))[ 3 ] which I assume grabs just a certain column. Could I grab all 10 columns with similar syntax? Would it be faster? How would I update the fields in the original record?

HELP!

#!/usr/bin/perl
use strict;
use warnings;

my $scanword;
my $wk1;
my $wk2;
my $wk3;
my $wk4;
my $wk5;
my $wk6;
my @elements;
my $i;
my $n1;
my $inrec1='0;1;2;3;4;5;6?link=misc/redirect/ekey=BOZOTHECLOWN&dmsdate
+=20150210;7;8;9';
my $inrec2='0;1;2?link=misc/redirect/ekey=BOZOTHECLOWN&dmsdate=2015021
+0;3;4;5;6;7;8;9';
my $outrec;

#####
sub special()
{

   $wk1=''; $wk2=''; $wk3=''; $wk4=''; $wk5=''; $wk6='';        #reset
+ vars
   $wk1=$elements[$i];                                          #field
+ number to scan
   $a=index($wk1,"?");                                          #look 
+for ?
   if ( $a != -1 ) {                                            #? was
+ found
      #Look for $scanword
      $wk2=index($wk1,"$scanword");                             #find 
+start of scanword.  I.E. "ekey="
      $wk5=index($wk1,"=",$wk2);                                #find 
+start of "="
      $wk3=index($wk1,"&",$wk2);                                #find 
+start of next &
      if ( $wk2 != -1 ) {                                       #found
+ scanword?
         if ( $wk3 == -1) { $wk3 = length ($wk1); }             #defau
+lt to length of string if ampersand not found
         $wk4 = substr($wk1,$wk2,$wk3-$wk2);                    #wk1 i
+s the field, wk2 is start of scanword, wk3 is the end position
         $wk6 = substr($wk1,$wk5+1,$wk3-$wk5-1);                #wk1 i
+s the field, wk5+1 is byte after = in the scanword, wk3 is the end po
+sition

         print STDOUT "array element = $elements[$i] \n";
         print STDOUT "Found ?         in the field at offset $a \n";
         print STDOUT "Found $scanword in the field at offset $wk2 \n"
+;
         print STDOUT "Found end       in the field at offset $wk3 \n"
+;
         print STDOUT "Field wk4 = $wk4 \n";
         print STDOUT "Field wk6 = $wk6 \n";

         $b=$wk3-$wk5-1;                                        #lengt
+h to blank
         substr($elements[$i],$wk5+1,$wk3-$wk5-1) = ' ' x $b;   #move 
+blanks to array element
      }
   }
}
#####
print STDOUT "inrec1=$inrec1 \n";                               #print
+ inrec
@elements = split(';', $inrec1, -1);                            #split
+ by semicolon, -1 means to keep trailing fields if empty
$i=2;                                                           #eleme
+nt offset
$scanword="ekey=";                                              #text 
+to scan for
special();                                                      #call 
+routine
$n1=$wk6;                                                       #what 
+was stripped out
push (@elements, $n1);                                          #add t
+o end of array
$outrec = join(";",@elements);                                  #conve
+rt to output record
print STDOUT "outrec=$outrec  \n";                              #print
+ output record

print STDOUT "inrec2=$inrec2 \n";                               #print
+ inrec
@elements = split(';', $inrec2, -1);                            #split
+ by semicolon, -1 means to keep trailing fields if empty
$i=2;                                                           #eleme
+nt offset
$scanword="ekey=";                                              #text 
+to scan for
special();                                                      #call 
+routine
$n1=$wk6;                                                       #what 
+was stripped out
push (@elements, $n1);                                          #add t
+o end of array
$outrec = join(";",@elements);                                  #conve
+rt to output record
print STDOUT "outrec=$outrec  \n";                              #print
+ output record
exit;

OUTPUT EXAMPLES:

ekey is not in elements[2] so do nothing
inrec1=0;1;2;3;4;5;6?link=misc/redirect/ekey=BOZOTHECLOWN&dmsdate=2015
+0210;7;8;9
outrec=0;1;2;3;4;5;6?link=misc/redirect/ekey=BOZOTHECLOWN&dmsdate=2015
+0210;7;8;9;

ekey is in elements[2] so blank it in input record and add move follow
+ing text BOZOTHECLOWN to a new element in array

inrec2=0;1;2?link=misc/redirect/ekey=BOZOTHECLOWN&dmsdate=20150210;3;4
+;5;6;7;8;9
        array element = 2?link=misc/redirect/ekey=BOZOTHECLOWN&dmsdate
+=20150210
        Found ?         in the field at offset 1
        Found ekey= in the field at offset 21
        Found end       in the field at offset 38
        Field wk4 = ekey=BOZOTHECLOWN
        Field wk6 = BOZOTHECLOWN
outrec=0;1;2?link=misc/redirect/ekey=            &dmsdate=20150210;3;4
+;5;6;7;8;9;BOZOTHECLOWN
[download]

Comment on split, manipulate, join Download Code

Replies are listed 'Best First'.
Re: split, manipulate, join by aaron_baugher (Curate) on Apr 10, 2015 at 22:48 UTC
It's hard to give specific advice without seeing a sample of your input data, but it may be hard for you to show us a sample if your lines are 550 columns wide. Short answer: yes, it's quite likely that there is a faster solution using split and/or regex than what you're doing now with multiple index and substr commands. One example to get you started: If you want the six columns numbered 10, 20, 30, 32, 34, and 38 (with the first column numbered 0), and your columns are separated by whitespace, you could do this to get them in an array: `while(my $line = <$file>){ chomp $line; # split the line on whitespace, then take certain indexed columns my @columns = (split /\s+/, $line)[10,20,30,32,34,38]; do_stuff_with_those_columns(@columns); }` [download] On the other hand, if you know what you want out of each column, you may be able to skip the step of splitting into an array and go straight to extracting what you need. It just depends on the data. Give us at least a couple lines if you can. Aaron B. Available for small or large Perl jobs and *nix system administration; see my home node.	[reply] [d/l]
Re^2: split, manipulate, join by jc.smith3 (Initiate) on Apr 10, 2015 at 22:55 UTC
I really can't give any data as it is customer sensitive. Assume I have to search 10 columns for key1= or key2= or key3= or key4=. There is no way to predict what will be after those 4 words. But I have to strip out whatever follows up to a & or reach end of the field. I have to blank the original data where it comes from (just what I am pulling, not whole column) and then append new columns with what I stripped out. I like the idea of the slice syntax. But then how would I go back and update those same columns?	[reply]
Re^3: split, manipulate, join by LanX (Saint) on Apr 10, 2015 at 23:07 UTC
> I really can't give any data as it is customer sensitive please invent non-sensitiv data for your use case. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re^3: split, manipulate, join by SuicideJunkie (Vicar) on Apr 10, 2015 at 23:09 UTC
Can you make up some random but representative data? For example, it sounds a lot like you're parsing a URL for GET parameters, so your data with fake customer data with might be: `http://www.example.com/random_funky_characters_here.html?name=JoeSmith&address=123UniversalAve&CreditCardNumber=1234567890&kneecapstate=broken`	[reply] [d/l]
Re^3: split, manipulate, join by AnomalousMonk (Archbishop) on Apr 11, 2015 at 00:38 UTC
`my $inrec='0;1;2?foo=duh;3;4;5;6;7;8;9;;;;;;';` You've already given some input data, presumably faked to the necessary degree. All you needed to do was to give a couple more fake input record instances and their corresponding output records to make the needed transformation(s) much more clear. No "customer sensitive" data need apply. Give a man a fish: `<%-(-(-(-<`	[reply] [d/l] [select]
Re^4: split, manipulate, join by jc.smith3 (Initiate) on Apr 11, 2015 at 01:32 UTC
Re^3: split, manipulate, join by aaron_baugher (Curate) on Apr 10, 2015 at 23:26 UTC
You can strip out the values while moving them. From what you describe, I'd ignore the columns (unless it's possible for those key2= patterns to appear in other columns where you want to ignore them). So you could use something like this to get the value for key1, strip it out, and append it to the end of the line: `open my $in, '<', 'infile' or die $!; # open input file open my $out, '>', 'outfile' or die $!; # open output file while(my $line = <$in>){ chomp $line; # remove newline if( $line =~ s/key1=([^&\s]+)// ){ # capture value in $1 while # replacing with empty string $line .= " key1=$1"; # append key and value to line } print $out "$line\n"; # print line to output file } close $in; close $out;` [download] That will do one find/replace/copy, so you can repeat the if loop while changing the key?= to do multiple ones. Aaron B. Available for small or large Perl jobs and *nix system administration; see my home node.	[reply] [d/l]
Re: split, manipulate, join by choroba (Cardinal) on Apr 11, 2015 at 00:05 UTC
The following subroutine seems to return the same result as yours, and is about 2 times faster. You didn't provide much testing data, so I'm not sure it will work correctly for the real input. `sub not_so_special { my $s = shift; my @cols = split /;/, $s, -1; my $e; $cols[2] =~ s/=(.+)/'=' . (' ' x length($e = $1))/e; return join ';', @cols, $e }` [download] Update Or, maybe, more readable and a tiny bit faster: `sub new { my @cols = split /;/, shift, -1; $cols[2] =~ s/=(.+)/=/; $cols[2] .= ' ' x length $1; return join ';', @cols, $1 }` [download] Update 2 This seems even faster (but uglier): `use Syntax::Construct qw{ /r }; sub newer { shift =~ s/^([^;];[^;];)(.?=)([^;])(;.*)/"$1$2" . (' ' x lengt +h $3) . "$4;$3"/er }` [download] لսႽ� ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l] [select]
Re: split, manipulate, join by choroba (Cardinal) on Apr 10, 2015 at 22:33 UTC
Please, also include sample input and output, enclosed in `<code>...</code>` tags. To grab several columns from split, you can use the slice syntax: `my @columns = (split /[?=]/)[3 .. 6, 9];` [download] لսႽ� ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l] [select]
Re: split, manipulate, join by AnomalousMonk (Archbishop) on Apr 10, 2015 at 23:35 UTC
I'm a bit confused about just what you want to delete and what you want to keep, but here's an approach. I've no idea about its speed. It needs Perl version 5.10+. Note that field offsets are zero-based. c:\@Work\Perl\monks>perl -wMstrict -le "use 5.010; ;; my $inrec = '0;1;2?foo=DUH;3;4;5;6?bar=BLAB DAB;7;8;9;;;;;;'; print qq{begin: '$inrec'}; ;; my $sep = ';'; my $data = qr{ [^\Q$sep\E] }xms; my $field = qr{ $data* \Q$sep\E }xms; ;; my %targets = qw(foo 2 bar 6); my @words = sort { $targets{$a} <=> $targets{$b} } keys %targets ; ;; for my $word (@words) { $inrec =~ s{ \A (?:$field){$targets{$word}} [^?]+ \? \Q$word\E = \K ($data) } {}xms; $inrec .= qq{$1$sep}; print qq{'$inrec'}; } ;; print qq{final: '$inrec'}; " begin: '0;1;2?foo=DUH;3;4;5;6?bar=BLAB DAB;7;8;9;;;;;;' '0;1;2?foo=;3;4;5;6?bar=BLAB DAB;7;8;9;;;;;;DUH;' '0;1;2?foo=;3;4;5;6?bar=;7;8;9;;;;;;DUH;BLAB DAB;' final: '0;1;2?foo=;3;4;5;6?bar=;7;8;9;;;;;;DUH;BLAB DAB;' [download] Update:* Changed code example to make more consistent use of the `$sep` scalar. Give a man a fish: `<%-(-(-(-<`	[reply] [d/l] [select]
Re: split, manipulate, join by james28909 (Deacon) on Apr 12, 2015 at 04:12 UTC
This will capture everything after "ekey=" all the way up to new line `(undef, $captured) = split(/ekey=(.+)/, $_);` [download]	[reply] [d/l]
Re^2: split, manipulate, join by AnomalousMonk (Archbishop) on Apr 12, 2015 at 07:14 UTC
I think I would rather explicitly extract whatever it was I was interested in rather than relying on a somewhat obscure feature of the split built-in: `c:\@Work\Perl\monks>perl -wMstrict -le "$_ = qq{xxx ekey=foo bar \n xyzzy}; ;; my (undef, $captured) = split(/ekey=(.+)/, $_); print qq{captured '$captured'}; ;; my ($bagged) = $_ =~ m{ ekey= ([^\n]+) }xms; print qq{ bagged '$bagged'}; " captured 'foo bar ' bagged 'foo bar '` [download] Give a man a fish: `<%-(-(-(-<`	[reply] [d/l] [select]