comment on

UPDATED EXAMPLE... Below is a program that will split a string into an array. Then it scans ARRAY ELEMENT 2 for a '?' then a word 'ekey='. If it finds that stuff, then it will strip out the text after 'ekey=' and add it as a new array element. Finally, it converts the array back to a string.

Problem: I have millions of input records of web data of 550 columns, variable and often null fields, as well as very long fields. I need to scan 10 columns throughout each record looking for 4 different 'scanwords'. If found I always remove the data from the original field, and add the text after the scanword as a new element. Obviously this is very slow.

I have seen something like this on the interwebs: $a=split(/\s+/, $line))[ 3 ] which I assume grabs just a certain column. Could I grab all 10 columns with similar syntax? Would it be faster? How would I update the fields in the original record?

HELP!

#!/usr/bin/perl
use strict;
use warnings;

my $scanword;
my $wk1;
my $wk2;
my $wk3;
my $wk4;
my $wk5;
my $wk6;
my @elements;
my $i;
my $n1;
my $inrec1='0;1;2;3;4;5;6?link=misc/redirect/ekey=BOZOTHECLOWN&dmsdate
+=20150210;7;8;9';
my $inrec2='0;1;2?link=misc/redirect/ekey=BOZOTHECLOWN&dmsdate=2015021
+0;3;4;5;6;7;8;9';
my $outrec;

#####
sub special()
{

   $wk1=''; $wk2=''; $wk3=''; $wk4=''; $wk5=''; $wk6='';        #reset
+ vars
   $wk1=$elements[$i];                                          #field
+ number to scan
   $a=index($wk1,"?");                                          #look 
+for ?
   if ( $a != -1 ) {                                            #? was
+ found
      #Look for $scanword
      $wk2=index($wk1,"$scanword");                             #find 
+start of scanword.  I.E. "ekey="
      $wk5=index($wk1,"=",$wk2);                                #find 
+start of "="
      $wk3=index($wk1,"&",$wk2);                                #find 
+start of next &
      if ( $wk2 != -1 ) {                                       #found
+ scanword?
         if ( $wk3 == -1) { $wk3 = length ($wk1); }             #defau
+lt to length of string if ampersand not found
         $wk4 = substr($wk1,$wk2,$wk3-$wk2);                    #wk1 i
+s the field, wk2 is start of scanword, wk3 is the end position
         $wk6 = substr($wk1,$wk5+1,$wk3-$wk5-1);                #wk1 i
+s the field, wk5+1 is byte after = in the scanword, wk3 is the end po
+sition

         print STDOUT "array element = $elements[$i] \n";
         print STDOUT "Found ?         in the field at offset $a \n";
         print STDOUT "Found $scanword in the field at offset $wk2 \n"
+;
         print STDOUT "Found end       in the field at offset $wk3 \n"
+;
         print STDOUT "Field wk4 = $wk4 \n";
         print STDOUT "Field wk6 = $wk6 \n";

         $b=$wk3-$wk5-1;                                        #lengt
+h to blank
         substr($elements[$i],$wk5+1,$wk3-$wk5-1) = ' ' x $b;   #move 
+blanks to array element
      }
   }
}
#####
print STDOUT "inrec1=$inrec1 \n";                               #print
+ inrec
@elements = split(';', $inrec1, -1);                            #split
+ by semicolon, -1 means to keep trailing fields if empty
$i=2;                                                           #eleme
+nt offset
$scanword="ekey=";                                              #text 
+to scan for
special();                                                      #call 
+routine
$n1=$wk6;                                                       #what 
+was stripped out
push (@elements, $n1);                                          #add t
+o end of array
$outrec = join(";",@elements);                                  #conve
+rt to output record
print STDOUT "outrec=$outrec  \n";                              #print
+ output record

print STDOUT "inrec2=$inrec2 \n";                               #print
+ inrec
@elements = split(';', $inrec2, -1);                            #split
+ by semicolon, -1 means to keep trailing fields if empty
$i=2;                                                           #eleme
+nt offset
$scanword="ekey=";                                              #text 
+to scan for
special();                                                      #call 
+routine
$n1=$wk6;                                                       #what 
+was stripped out
push (@elements, $n1);                                          #add t
+o end of array
$outrec = join(";",@elements);                                  #conve
+rt to output record
print STDOUT "outrec=$outrec  \n";                              #print
+ output record
exit;

OUTPUT EXAMPLES:

ekey is not in elements[2] so do nothing
inrec1=0;1;2;3;4;5;6?link=misc/redirect/ekey=BOZOTHECLOWN&dmsdate=2015
+0210;7;8;9
outrec=0;1;2;3;4;5;6?link=misc/redirect/ekey=BOZOTHECLOWN&dmsdate=2015
+0210;7;8;9;

ekey is in elements[2] so blank it in input record and add move follow
+ing text BOZOTHECLOWN to a new element in array

inrec2=0;1;2?link=misc/redirect/ekey=BOZOTHECLOWN&dmsdate=20150210;3;4
+;5;6;7;8;9
        array element = 2?link=misc/redirect/ekey=BOZOTHECLOWN&dmsdate
+=20150210
        Found ?         in the field at offset 1
        Found ekey= in the field at offset 21
        Found end       in the field at offset 38
        Field wk4 = ekey=BOZOTHECLOWN
        Field wk6 = BOZOTHECLOWN
outrec=0;1;2?link=misc/redirect/ekey=            &dmsdate=20150210;3;4
+;5;6;7;8;9;BOZOTHECLOWN
[download]

In reply to split, manipulate, join by jc.smith3

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.