Split of text

droberts2014 has asked for the wisdom of the Perl Monks concerning the following question:

Hi I have a series of data as below: Whats thee best way to split this apart into different columns That is date, country and then data? if there are multiple entries then just make more colums Thanks for the ehelp!!



30.6.89 CH 2454/89-7<cr>

30.6.89 CH 2454/89-7

25.1.94ch209/94-6;8.12.94ch3714/94-1

25.1.94 ch 209/94-6 ; 8.12.94 ch 3714/94-1

8.4.94 ch 1047/94-0

22.4.94 ch 1255/94-7

9.5.94 CH 1441/94-4

19.10.94 ch 3138/94-2 ; 16.2.95 CH 445/95-3

8.6.95 ch 1676/95-5

22.11.95 CH 3300/95 ; 28.6.96 CH 1621/96

18.12.95 ch 3562/95

8.3.96 ch 612/96

6.9.96 JP 229081/95 (?)

6.9.95 JP 229081/95

20.5.97 USA (prov.) 60/047; 168

20.5.97 ch(pct) pct ib97/00575

20.5.97 pct(ch) pct/ib97/00575

6.6.97 ch 1373/97

4.7.97 de 19728671.2

27.8.97 ch 2001/97

9.9.97 CH 2123/97

9.9.97 ch 2110/97 ; 1.4.98 ch 778/98

13.5.98 us 09/078; 173

8.10.97 ch 2355/97

27.11.97 CH 2743/97

8.12.97 ch 2825/97

3.2.98 ch 248/98

1.4.98 ch 778/98

1.4.98 ch 1998 0778/98

7.4.98 ch 822/98

8.5.98 ch 1038/98

28.5.98 us 09/085; 593

21.8.98 CH 1718/98

9.9.98 ch 1841/98

3.9.98 ep 98116634.1

26.10.98 ch 2154/98

22.10.98 pct pct/ib98/01700

26.10.98 ch 2155/98

4.11.98 us 09/185; 536 ; 9.11.98 INDe 3309/del/98

8.12.98 ch 1998 2437/98

8.12.98 ch 2437/98

22.1.99 ch 123/99

26.3.99 ch 579/99

31.3.99 ch 606/99

19.5.99 ch 939/99

14.7.99 ch 1999 1303/99

22.7.99 ch 1342/99

6.9.99 pct pct/ib99/01510

06.09.99 pct pct/ib99/01510

4.10.99 PCT PCT/IB99/01618 ; 11.10.99 PCT PCT/IB99/01660

18.10.99 ch 1894/99

5.11.99 us 60/163; 563

10.11.99 ch 2058/99

29.3.00 us 09/537; 357

17.12.99 EP 99125212.3

8.2.00 CH 2000 0248/00

20.3.00 ch 0523/00

16.5.00 ep 00110429.8

5.6.00 US 09/586; 754

15.6.00 pct pct/ib00/00804

10.7.00 ch 2000 1354/00

13.9.00 PCT PCT/IB00/01303 ; 13.9.00 US 60/232144

13.9.00 PCT PCT/IB00/01303 ; 13.9.00 US 60/232; 144

17.11.00 pct pct/ib00/01693

14.9.00 ch 2000 1785/00

20.10.00 ch 2000 2060/00

5.12.00 usa 09/729; 241

5.12.00 us 09/729; 241

6.12.00 PCT PCT/IB00/01806

20.12.00 pct pct/ib00/01934

1.2.01 pct pct/ib01/00126

16.2.01 us 09/784; 121

7.3.01 pct pct/ib01/00318

22.3.01 us 60/278; 046 ; 2.4.01 us 09/825; 526

27.3.01 pct pct/ib01/00524

12.4.01 pct pct/ib01/00631

22.5.01 pct pct/ib01/00902

11.6.01 pct pct/ib01/01018 ; 14.6.01 pct pct/ib01/01047

27.8.01 pct pct/ib01/01541 ; 10.9.01 US 60/318; 695

7.09.01 pct pct/ib01/01630

7.9.01 pct pct/ib01/01630

18.10.01 us 09/982; 648

22.11.01 pct pct/ib01/02210

5.12.01 pct pct/ib01/02394

13.12.01 pct pct/ib01/02520

10.1.02 pct pct/ib02/00155

18.1.02 pct pct/ib02/00157

22.1.02 PCT pct/ib02/00174

29.1.02 pct pct/ib02/00304

8.3.02 pct pct/ib02/00730

pct 13.3.02 PCT/IB02/00767

13.3.02 IB PCT/IB02/00767

28.3.02 US 60/369; 115

pct 13.5.02 pct/ib02/01661

16.5.2002 Pct pct/ib02/01763

29.5.02 pct pct/ib02/01961

pct 19.03.03 pct/ib03/01079

21.4.03 us 10/421; 216

21.04.2003 us 10/421216

7.5.03 EP 03010264.4

2.6.03 pct pct/ib03/02415

10.10.03 EP 03 022 775.5

23.10.03 PCT/IB03/04726

us 31.10.03 60/516; 548 ; PCT 31.10.03 PCT/IB03/04867
[download]

Comment on Split of text Download Code

Replies are listed 'Best First'.
Re: Split of text by GrandFather (Saint) on Apr 09, 2014 at 04:27 UTC
How many of those lines of data do we actually need as test cases? Is there something special we should be noticing about the lines that we require more than two sample lines? Maybe it's hidden by the sample data, but I don't see the code you have tried nor a description of how that code fails. In fact I can't even spot the example expected output. See I know what I mean. Why don't you?. Perl is the programming world's equivalent of English	[reply]
Re: Split of text by NetWallah (Canon) on Apr 09, 2014 at 05:59 UTC
You could try it using the split function, but , since you posted data without <code> tags, it seems to be hiding html, so you would probably be better off using an html parser like HTML::Parser. On second thoughts, HTML::DOM may make more sense for a well structured source. If you shows us your perl code and attempts at parsing, with the results expected and achieved, and articulate specific issues you encounter, we will be happy to assist you get closer to desired results. What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against? -Larry Wall, 1992	[reply]
Re: Split of text by hdb (Monsignor) on Apr 09, 2014 at 06:21 UTC
Here is a little code segment to get you started: `while(<DATA>){ my $date; $date = $1 if s/(\d+\.\d+\.\d+)\s//; my $ccy; $ccy = $1 if s/([a-z]+)\s//i; print "Date=$date, Currency=$ccy, Rest=$_\n" if $date && $ccy; }` [download] assuming you have your data in the DATA segment. Dealing with processing the remaing fields is left as an exercise. You might want to consider splitting on semicola...	[reply] [d/l]
Re: Split of text by CountZero (Bishop) on Apr 09, 2014 at 12:09 UTC
I see you have different formats of data: `30.6.89 CH 2454/89-7<cr><o:p></o:p> 25.1.94ch209/94-6;8.12.94ch3714/94-1<o:p></o:p> 25.1.94 ch 209/94-6 ; 8.12.94 ch 3714/94-1<o:p></o:p> 4.10.99 PCT PCT/IB99/01618 ; 11.10.99 PCT PCT/IB99/01660<o:p></o:p> 20.5.97 USA (prov.) 60/047; 168<o:p></o:p> 20.5.97 ch(pct) pct ib97/00575<o:p></o:p> us 31.10.03 60/516; 548 ; PCT 31.10.03 PCT/IB03/04867 pct 19.03.03 pct/ib03/01079<o:p></o:p>` [download] And I may have missed some others. Are these indeed each different formats or are it typos perhaps? Some lines seem to have two records on one line? Or did you forget to add a newline between the records? Is the whitespace a simple space (one or more) or are it perhaps tabs? It would help us all if you could show us the desired result for each of the above lines. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply] [d/l]
Re: Split of text by locked_user sundialsvc4 (Abbot) on Apr 09, 2014 at 17:44 UTC
Well, the short-answer would be, “very carefully!” Because, even in this snippet of data, I see inconsistencies. Some lines appear to begin with `pct` while others do not. The last line of your example is very different. It will be crucial that you design your program to be suspicious. It should aggressively test every assumption that it makes, so that it will die (on its own ... descriptively ...) when it encounters any line of data that does not perfectly meet those assumptions. This is because, in the real world, programs such as this one are the only way for anyone to know whether there are any inconsistencies in the input-data. (Yes, you are effectively “debugging” that upstream program, and yes, on a very-regular basis you will find bugs in it.) You need to design these programs so that, if they run to completion, then you have in this a very strong indicator that all of the data ... and there could of course be many megabytes of it per-run ...is okay. And that, therefore, the results produced are probably reliable. Put such tests into the program from the very start, until you are absolutely sure all is well. Then, and only then ... leave them in!
Re^2: Split of text by AnomalousMonk (Archbishop) on Apr 09, 2014 at 18:38 UTC
It will be crucial that you design your program ... You offer much wise advice. Unfortunately, I think droberts2014 isn't interested in designing anything. I think droberts2014 thought it would be worth spending thirty seconds of time to plunk a great wadge of (probably unusable) 'data' down in the middle of the site, slap "if there are multiple entries then just make more colums Thanks for the ehelp!!" on it, and sit back and see what happened. (Anyone notice any cross-posting? Wouldn't be surprised...)	[reply]