A couple of solutions for you. It is possible to put an extra qualifier on the split regex. In the first example below, I say split on white space but only if those spaces are preceded by a digit or the / character. This is done by a positive look behind assertion. So a name like "District of Columbia" has the spaces preserved and no split happens on those spaces.

In the second example below, I used the same extra qualifier trick and said remove spaces but only if the spaces are preceded by a letter. Then I did a split on the result.

Note that the chomp is not necessary in the second case. When splitting on the default of \s+, space characters are in the set of [space,\n\r\f\t]. Since \n is in that set, it is removed. In the first example a chomp() is needed because the condition of the split was modified.

The seek statement just "rewinds" the DATA file handle. The DATA file handle starts out positioned at the first byte after the __DATA__ statement. $begin is used to remember what that byte is so that I can go back. If I had done a seek DATA,0,0; that would have moved the file pointer to right before the "hashbang" line. If for some reason you would like for a Perl program to read itself, that is one way!

#!/usr/bin/perl -w use strict; my $begin = tell(DATA); #to rewind DATA later on while (<DATA>) { chomp; # (?<=\d) is a positive look behind assertion # a digit or / must preceed the \s+ in order to split # upon it. Note chomp is necessary because the # trailing \n will not be removed because there is # no digit in HA. my @tokens = split(/(?<=\d|\/)\s+/, $_); print join("\n",@tokens),"\n"; } =prints like: >cds:ADD23250 A/District of Columbia/INS17/2009 2009/10/26 HA =cut seek DATA,$begin,0; #rewinds DATA back to beginning while (<DATA>) { s/(?<=[a-zA-Z])\s+//g; #remove spaces if preceeded by letter my @tokens = split; print join("\n",@tokens),"\n"; } =prints like: >cds:ADD23250 A/DistrictofColumbia/INS17/2009 2009/10/26 HA =cut __DATA__ >cds:ADD75048 A/Brussels/INS71/2009 2009/10/30 HA >cds:ADF58353 A/Germany-MV/HGW4/2009 2009/12/ HA >cds:ADF58351 A/Germany-MV/HGW6/2009 2009/12/ HA >cds:ADU76781 A/England/94780010/2009 2009/10/22 HA >cds:AEA30293 A/Netherlands/2223b/2009 2009/11/18 HA >cds:ADD23250 A/District of Columbia/INS17/2009 2009/10/26 HA >cds:ADX98640 A/San Diego/INS13/2009 2009/10/19 HA >cds:ADD74978 A/San Diego/INS54/2009 2009/10/12 HA >cds:ADF27925 A/Texas/JMS407/2010 2010/01/11 HA >cds:ADM95824 A/Finland/661/2009 2009/10/26 HA >cds:ADD97035 A/Wisconsin/629-D00036/2009 2009/09/15 HA

In reply to Re: How to substitute something from only between two specified charecters by Marshall
in thread How to substitute something from only between two specified charecters by ZWcarp

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.