Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I parse a file of the following structure:
Titel Text (A12-3) 3-123.7 Just another text 3-123.8 Some more text A12.34 Another item B56.78 Yet another item Another Titel Text (B23-9) 1-22a.b Just another text 2-3cd.e Some more text W12.34 Another item Z56.78 Yet another item
The lines with parenthesis at the end are to become the titels, the other lines should be linked to these titels this way:
Titel Text (A12-3);3-123.7 Just another text Titel Text (A12-3);3-123.8 Some more text Titel Text (A12-3);A12.34 Another item Titel Text (A12-3);B56.78 Yet another item
As you see some the "other lines" should begin with the certain pattern, sometimes they do not and build a single line. I tried to break these lines with a newline character in the following way (thanks to toolic and Marschall since I used some fragments from their earlier advices) but the actual script seems to ignore the added newline.
use strict; use warnings; my $outcome; my $previous; while(<DATA>) { $outcome = ""; chomp; $_=~ s/\s(\d\-\d\w{2}(\.\w+)?)/\n$1/g; $_=~ s/\s([A-Z]\d{2}(\.\d+)?)/\n$1/g; if (/\s?\d\-\d\w{2}(\.\w+)?.+|\s?[A-Z]\d{2}(\.\d+)?.+|\(\w+\-\ +d+\)$/) { if (/\(\w+\-\d+\)$/ ) { $previous = $_; } else { $outcome = "$previous;$_"; } $outcome=~s/^\s+$//g; print "$outcome\n"; } } __DATA__ Titel Text (A12-3) 3-123.7 Just another text 3-123.8 Some more text A12.34 Another item B56.78 Yet another item Another Titel Text (B23-9) 1-22a.b Just another text 2-3cd.e Some more text W12.34 Another item Z56.78 Yet another item Some trash Some trash
Where do I make the mistake(s)? I work on Win32 with ActivePerl distribution. Thank you very much in advance! VE

Replies are listed 'Best First'.
Re: Insert newline
by thewebsi (Scribe) on Sep 14, 2011 at 17:41 UTC

    I see what you are trying to do here. The while ( <DATA> ) loop iterates through lines of the data file, and then you are creating more lines inside the loop that you want to iterate through them as if they were additional lines in the data file. Unfortunately, you would have to add them to the actual data file for that effect. To do what you want, you can add another loop to iterate through the new lines you've added:

    !/usr/bin/perl use strict; use warnings; my $title; while ( <DATA> ) { chomp; if ( /\(\w+\-\d+\)$/ ) { $title = $_; } else { s/(?:^|\s)(\d\-\d\w{2}(?:\.\w+)?|[A-Z]\d{2}(?:\.\d+)?)/\n$1/g; foreach my $line ( split ( "\n" ) ) { next if $line !~ /^(\d\-\d\w{2}(?:\.\w+)?|[A-Z]\d{2}(?:\.\d+)?)/ +; print "$title;$line\n"; } } } __DATA__ Titel Text (A12-3) 3-123.7 Just another text 3-123.8 Some more text A12.34 Another item B56.78 Yet another item Another Titel Text (B23-9) 1-22a.b Just another text 2-3cd.e Some more text W12.34 Another item Z56.78 Yet another item Some trash Some trash

    Output:

    Titel Text (A12-3);3-123.7 Just another text Titel Text (A12-3);3-123.8 Some more text Titel Text (A12-3);A12.34 Another item Titel Text (A12-3);B56.78 Yet another item Another Titel Text (B23-9);1-22a.b Just another text Another Titel Text (B23-9);2-3cd.e Some more text Another Titel Text (B23-9);W12.34 Another item Another Titel Text (B23-9);Z56.78 Yet another item
      Thank you so much!!!
      I'll test this on the "big file" now.
      Many many thanks!
      VE
Re: Insert newline
by wwe (Friar) on Sep 14, 2011 at 15:37 UTC
    I'm not sure I undersand the input formats right.

    Input: You have aline which contains a title followed by one or more lines which contain one or more item numbers per line. A title starts allways on a new line consists of a string and ends with a ID (char with a number) included in parentesis. Items always start with ID which consists of (capital) characters, numbers followed by space followed by a string.

    Output: A title should start on a new line, it should be followed by a semicolon, followed by an item. this should be repeated for all items till new title starts. the output of you data should be:

    Titel Text (A12-3);3-123.7 Just another text Titel Text (A12-3);3-123.8 Some more text Titel Text (A12-3);A12.34 Another item Titel Text (A12-3);B56.78 Yet another item Another Titel Text (B23-9);1-22a.b Just another text Another Titel Text (B23-9);2-3cd.e Some more text Another Titel Text (B23-9);W12.34 Another item Another Titel Text (B23-9);Z56.78 Yet another item
      Yes, the input format is correct as you wrote. Additionally there are some lines in other formats in this huge file which will be excluded through the first "if". There can be more than one title line, the script always takes the last one (resp. the next one to the first item which belongs to the title) - this is provided through the second "if".
      Unfortunately some items start not on a new line but follow the previous item on the same line. That is why I try to make new lines there (the first two regexes). This seems to be somehow ignored later: though these lines come as a new lines in the output they are not preceded by "title followed by semicolon" as needed.
      Cannot solve this, need your help.
      Thanks in advance. VE
Re: Insert newline
by Not_a_Number (Prior) on Sep 14, 2011 at 18:50 UTC

    This seems to work with your sample data (some additional test cases added):

    use strict; use warnings; use 5.010; my $pat1 = qr |(\d-\d\w{2}\.\w+)|; # eg 3-123.7 2-3cd.e my $pat2 = qr |([A-Z]\d{2}\.\d+)|; # eg A12.34 Z56.78 my $title; while ( <DATA> ) { chomp; if ( /(.+)(\(.+\))$/ ) { # Better regex for 'title' lines?? $title = "$1$2;"; } else { next unless /$pat1|$pat2/; my @items = grep length, split /$pat1|$pat2/; say $title, splice @items, 0, 2 while @items; } } __DATA__ Titel Text (A12-3) 3-123.7 Just another (small) text 3-123.8 Some more text A12.34 Another item B56.78 Yet another item Another Titel Text (B23-9) Some trash here 12-22a **This is trash*** 1-22a.b Just another text 2-3cd.e Some more text W12.34 Another item Z56.78 Yet another item Z56.78 And another!! Z56.7a And another!!! Some trash

    Update: Cat walked on keyboard as I was posting. Please advise if you detect paw marks.

      Thank you guys for your great help! I tested your advices on the "big file". More important, I seem to understand some of your code :-)
      The patterns
      /^(\d\-\d\w{2}(?:\.\w+)?|[A-Z]\d{2}(?:\.\d+)?)/
      must have the second part
      (?:\.\w+)? resp. (?:\.\d+)?)
      since there are some items ID as just M32 or 6-317
      It seems that I cannot use grep length ... and splice ... with the "?"-part since the item text will be cut in pieces. Perhaps I do not notice something (I am a novice in perl) since I learned the grep length construction only now after reading your code.
      There are some "trash lines" with parenthesis so that the title line can be distinguished as it has parenthesis part at the end only.
      By now the code of thewebsi seems to work the best with the file. There are still some trash lines in the output but comparatively few and the can be filtered out by data content.
      Thank you all again - for the code and for the class hour!
      VE

        The patterns (...) must have the second part

        (?:\.\w+)? resp. (?:\.\d+)?)

        That's easy, just add the 'second part':

        my $pat1 = qr '(\d-\d\w{2}(?:\.\w+)?)'; my $pat2 = qr '([A-Z]\d{2}(?:\.\d+)?)';

        Concerning grep length, its purpose is simply to filter out empty (and undef) items from the list created by splitting the line on the /$pat1|$pat2/ regex.

        hth, dave

        Update: Concerning the title line, you say that it 'can be distinguished as it has parenthesis part at the end only'. My tentative regex provides for this. But you also say 'There are some "trash lines" with parenthesis'. Well, what if these "trash lines" actually end with a parenthesised item, eg:

        Rabbit rabbit rabbit (rabbit!)

        What I meant by 'Better regex' was something to replace the second .+ by something that matches the ID code(?) of your titles. The only two examples you give are B23-9 and A12-3, so perhaps /[A-Z]\d{2}-\d/ would work. Otherwise, adjust accordingly.

Re: Insert newline
by pvaldes (Chaplain) on Sep 14, 2011 at 18:38 UTC

    maybe something like this

    # open file as usual # foreach line of the file # DON'T CHOMP, (still) if (/\(/) { # I have found a "(", this must be a title, so I: chomp; push @titles, $_; # now this is the more problematic part cause you need # to extend your search to several lines # but you are processing line by line, the idea is that # we have a title, now we stop and parse text. # 1- pick up all between ) and ( and save this to a $var # ie: with \).*?\( # unfinished yet, take a look to substr # 2 -split the different items in this text-block @texts = split /\n/, $var; pop @texts # the last value is the next title, # so we can discard it safely # we enter a inner loop for all items in the array while (@texts) {print $titles[-1] . '; '. $_} } else { next # this is not a title, whe forget this line } # end of the if-else block } # end of the foreach-block

    This code is not complete, only a rough draft showing the idea: you have two problems here, 1) - chase the titles one by one (this should be very easy) and 2) - to enter a "inner parse mode" after a title was found, stop the hunting and process all text until the next title line"

    Update: you need to wide the @texts array after each foreach pass, so insert a @texts = (); line just after # DON'T CHOMP