I am new at using perl. I have a CSV that I am importing with Text::CSV_XS. Currently, I am able to clean up much of the poorly formed data generated by the venerable database output tool (thanks perl people). I would like to extract a piece of data from one field that has multiple fields in it. The original field is a long description that usually contains a #F123456, #123456, #123-F123456, #123-123456, or #12AB-123456 in it. This data floats around from left to right and there should be whitespace before the #. Also, the end of the data is either whitespace, or the end of the field. I could extract the data before or after I clean up the field. Here is a couple of examples of the original and modified data.

#Before: ,"TRAY,HINGED,PLSTC,20 CAV #F32473", #After: ,"TRAY HINGED PLSTC 20 CAV #F32473", #Before: ,"BOX HSC,35-3/4X17-1/4 X 50-1/2 SIMULATOR TALL BOX", #After: ,"BOX HSC 35-3/4X17-1/4 X 50-1/2 SIMULATOR TALL BOX", #Before: ,"PAD, FOAM, 24 X 24 X 1/4 #16193 + 112 SHEETS PER ROLL, ORDER IN FULL ROLLS", #After: ,"PAD FOAM 24 X 24 X 1/4 #16193 112 SHEETS PER ROLL ORDER IN FULL ROLL +S", #Before: ,"PKG LIST,ASST ARM,RAD,300 #F37784", #After: ,"PKG LIST ASST ARM RAD 300 #F37784", #Before: ,"PAD, TOP CAP RE17-30048 #F30121 + CORRUGATED ASSEMBLY, 22-7/8 X 21-1/8 X 4-3/4", #After: ,"PAD TOP CAP RE17-30048 #F30121 CORRUGATED ASSEMBLY 22-7/8 X 21-1/8 X + 4-3/4","11.72000","281.28",

Here is some of the code I am using to process the file. I don't know if I can share all of it without quite a bit of obfuscation. This field would be "$row->5"

while (my $row = $aCSV->getline ($ORIG_CSV)) { #Test row and determine if there is data. If the first element is +empty then that is the last line. #This should prevent errors, and drop the extraneous data at the e +nd of the file. if ($row->[0] =~ /\S/ ) { # Remove extra newline from field 5, 6 and 11 unless preceded + by a carriage return, fields start counting at Zero. # Field 6 and 11 replaces with space for readabililty of the +data. $row->[4] =~ s/ (?<!\x0D) \x0A //gx; $row->[5] =~ s/ (?<!\x0D) \x0A / /gx; $row->[10] =~ s/ (?<!\x0D) \x0A / /gx; # cleanup or truncate fields. We only want data that is usabl +e. # these are replacements so if the string within \Q \E doens' +t exist it isn't replaced # They are specific for each field. REMEMBER Fields start cou +nting at ZERO $row->[5] =~ s/ \Q,\E / /gx; # replace comma in field with s +pace. This makes CSV easier to import. $row->[9] =~ s/ \QRevision : \E //gx; $row->[10] =~ s/ \Q,\E / /gx; # replace comma in field with s +pace. This makes CSV easier to import. $row->[12] =~ s/ \QProduction Item: \E//gx; $row->[13] =~ s/ \x23 //gx; # hexidecimal 0x23 is "#" and +this messes with the regex in one step, but two works. $row->[13] =~ s/ \QWO : \E //gx; # remove the remainder junk + "WO : " $row->[14] =~ s/ \QCutlist color: \E //gx; # Some fields have extra whitespace. Lets clean that up becau +se we want to print some of this data. $row->[5] =~ s/\h+/ /g; $row->[10] =~ s/\h+/ /g; # Something to test on $row->[5] /#[^ ]+/ # Testing print some fields. REMEMBER 1 aka Zero print "$row->[5]\t$row->[9]\t$row->[12]\t$row->[13]\n"; push @rows, $row; } else { last; } }

Since I am new at this please tell me if I should do something different for clarity. Also, if there is a more efficient way to do this then clarity probably wins out, for later troubleshooting. Thanks!


In reply to Extract data from CSV field. by JobC

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.