droberts2014 has asked for the wisdom of the Perl Monks concerning the following question:

Hi I have a series of data as below: Whats thee best way to split this apart into different columns That is date, country and then data? if there are multiple entries then just make more colums Thanks for the ehelp!!
30.6.89 CH 2454/89-7<cr> 30.6.89 CH 2454/89-7 25.1.94ch209/94-6;8.12.94ch3714/94-1 25.1.94 ch 209/94-6 ; 8.12.94 ch 3714/94-1 8.4.94 ch 1047/94-0 22.4.94 ch 1255/94-7 9.5.94 CH 1441/94-4 19.10.94 ch 3138/94-2 ; 16.2.95 CH 445/95-3 8.6.95 ch 1676/95-5 22.11.95 CH 3300/95 ; 28.6.96 CH 1621/96 18.12.95 ch 3562/95 8.3.96 ch 612/96 6.9.96 JP 229081/95 (?) 6.9.95 JP 229081/95 20.5.97 USA (prov.) 60/047; 168 20.5.97 ch(pct) pct ib97/00575 20.5.97 pct(ch) pct/ib97/00575 6.6.97 ch 1373/97 4.7.97 de 19728671.2 27.8.97 ch 2001/97 9.9.97 CH 2123/97 9.9.97 ch 2110/97 ; 1.4.98 ch 778/98 13.5.98 us 09/078; 173 8.10.97 ch 2355/97 27.11.97 CH 2743/97 8.12.97 ch 2825/97 3.2.98 ch 248/98 1.4.98 ch 778/98 1.4.98 ch 1998 0778/98 7.4.98 ch 822/98 8.5.98 ch 1038/98 28.5.98 us 09/085; 593 21.8.98 CH 1718/98 9.9.98 ch 1841/98 3.9.98 ep 98116634.1 26.10.98 ch 2154/98 22.10.98 pct pct/ib98/01700 26.10.98 ch 2155/98 4.11.98 us 09/185; 536 ; 9.11.98 INDe 3309/del/98 8.12.98 ch 1998 2437/98 8.12.98 ch 2437/98 22.1.99 ch 123/99 26.3.99 ch 579/99 31.3.99 ch 606/99 19.5.99 ch 939/99 14.7.99 ch 1999 1303/99 22.7.99 ch 1342/99 6.9.99 pct pct/ib99/01510 06.09.99 pct pct/ib99/01510 4.10.99 PCT PCT/IB99/01618 ; 11.10.99 PCT PCT/IB99/01660 18.10.99 ch 1894/99 5.11.99 us 60/163; 563 10.11.99 ch 2058/99 29.3.00 us 09/537; 357 17.12.99 EP 99125212.3 8.2.00 CH 2000 0248/00 20.3.00 ch 0523/00 16.5.00 ep 00110429.8 5.6.00 US 09/586; 754 15.6.00 pct pct/ib00/00804 10.7.00 ch 2000 1354/00 13.9.00 PCT PCT/IB00/01303 ; 13.9.00 US 60/232144 13.9.00 PCT PCT/IB00/01303 ; 13.9.00 US 60/232; 144 17.11.00 pct pct/ib00/01693 14.9.00 ch 2000 1785/00 20.10.00 ch 2000 2060/00 5.12.00 usa 09/729; 241 5.12.00 us 09/729; 241 6.12.00 PCT PCT/IB00/01806 20.12.00 pct pct/ib00/01934 1.2.01 pct pct/ib01/00126 16.2.01 us 09/784; 121 7.3.01 pct pct/ib01/00318 22.3.01 us 60/278; 046 ; 2.4.01 us 09/825; 526 27.3.01 pct pct/ib01/00524 12.4.01 pct pct/ib01/00631 22.5.01 pct pct/ib01/00902 11.6.01 pct pct/ib01/01018 ; 14.6.01 pct pct/ib01/01047 27.8.01 pct pct/ib01/01541 ; 10.9.01 US 60/318; 695 7.09.01 pct pct/ib01/01630 7.9.01 pct pct/ib01/01630 18.10.01 us 09/982; 648 22.11.01 pct pct/ib01/02210 5.12.01 pct pct/ib01/02394 13.12.01 pct pct/ib01/02520 10.1.02 pct pct/ib02/00155 18.1.02 pct pct/ib02/00157 22.1.02 PCT pct/ib02/00174 29.1.02 pct pct/ib02/00304 8.3.02 pct pct/ib02/00730 pct 13.3.02 PCT/IB02/00767 13.3.02 IB PCT/IB02/00767 28.3.02 US 60/369; 115 pct 13.5.02 pct/ib02/01661 16.5.2002 Pct pct/ib02/01763 29.5.02 pct pct/ib02/01961 pct 19.03.03 pct/ib03/01079 21.4.03 us 10/421; 216 21.04.2003 us 10/421216 7.5.03 EP 03010264.4 2.6.03 pct pct/ib03/02415 10.10.03 EP 03 022 775.5 23.10.03 PCT/IB03/04726 us 31.10.03 60/516; 548 ; PCT 31.10.03 PCT/IB03/04867

Replies are listed 'Best First'.
Re: Split of text
by GrandFather (Saint) on Apr 09, 2014 at 04:27 UTC

    How many of those lines of data do we actually need as test cases? Is there something special we should be noticing about the lines that we require more than two sample lines?

    Maybe it's hidden by the sample data, but I don't see the code you have tried nor a description of how that code fails. In fact I can't even spot the example expected output. See I know what I mean. Why don't you?.

    Perl is the programming world's equivalent of English
Re: Split of text
by NetWallah (Canon) on Apr 09, 2014 at 05:59 UTC
    You could try it using the split function, but , since you posted data without <code> tags, it seems to be hiding html, so you would probably be better off using an html parser like HTML::Parser. On second thoughts, HTML::DOM may make more sense for a well structured source.

    If you shows us your perl code and attempts at parsing, with the results expected and achieved, and articulate specific issues you encounter, we will be happy to assist you get closer to desired results.

            What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?
                  -Larry Wall, 1992

Re: Split of text
by hdb (Monsignor) on Apr 09, 2014 at 06:21 UTC

    Here is a little code segment to get you started:

    while(<DATA>){ my $date; $date = $1 if s/(\d+\.\d+\.\d+)\s*//; my $ccy; $ccy = $1 if s/([a-z]+)\s*//i; print "Date=$date, Currency=$ccy, Rest=$_\n" if $date && $ccy; }

    assuming you have your data in the DATA segment. Dealing with processing the remaing fields is left as an exercise. You might want to consider splitting on semicola...

Re: Split of text
by CountZero (Bishop) on Apr 09, 2014 at 12:09 UTC
    I see you have different formats of data:
    30.6.89 CH 2454/89-7<cr><o:p></o:p> 25.1.94ch209/94-6;8.12.94ch3714/94-1<o:p></o:p> 25.1.94 ch 209/94-6 ; 8.12.94 ch 3714/94-1<o:p></o:p> 4.10.99 PCT PCT/IB99/01618 ; 11.10.99 PCT PCT/IB99/01660<o:p></o:p> 20.5.97 USA (prov.) 60/047; 168<o:p></o:p> 20.5.97 ch(pct) pct ib97/00575<o:p></o:p> us 31.10.03 60/516; 548 ; PCT 31.10.03 PCT/IB03/04867 pct 19.03.03 pct/ib03/01079<o:p></o:p>
    And I may have missed some others.

    Are these indeed each different formats or are it typos perhaps? Some lines seem to have two records on one line? Or did you forget to add a newline between the records?

    Is the whitespace a simple space (one or more) or are it perhaps tabs?

    It would help us all if you could show us the desired result for each of the above lines.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: Split of text
by locked_user sundialsvc4 (Abbot) on Apr 09, 2014 at 17:44 UTC

    Well, the short-answer would be, “very carefully!”   Because, even in this snippet of data, I see inconsistencies.   Some lines appear to begin with pct while others do not.   The last line of your example is very different.

    It will be crucial that you design your program to be suspicious.   It should aggressively test every assumption that it makes, so that it will die (on its own ... descriptively ...) when it encounters any line of data that does not perfectly meet those assumptions.   This is because, in the real world, programs such as this one are the only way for anyone to know whether there are any inconsistencies in the input-data.   (Yes, you are effectively “debugging” that upstream program, and yes, on a very-regular basis you will find bugs in it.)   You need to design these programs so that, if they run to completion, then you have in this a very strong indicator that all of the data ... and there could of course be many megabytes of it per-run ...is okay.   And that, therefore, the results produced are probably reliable.

    Put such tests into the program from the very start, until you are absolutely sure all is well.   Then, and only then ... leave them in!

      It will be crucial that you design your program ...

      You offer much wise advice. Unfortunately, I think droberts2014 isn't interested in designing anything. I think droberts2014 thought it would be worth spending thirty seconds of time to plunk a great wadge of (probably unusable) 'data' down in the middle of the site, slap "if there are multiple entries then just make more colums Thanks for the ehelp!!" on it, and sit back and see what happened. (Anyone notice any cross-posting? Wouldn't be surprised...)