Re: Try this

Thank you all for your efforts. I see I should have been clearer about what I want. Here goes.

The script I'm writing will process about 180,000 lines of a text file, each of which is the title of work of music. The data is a mess-- there's no consistency at all. My job is to put it into a consistent format.

To take out the instrument type, I used this regex:

    if ( $work =~ / (For [^,^\(^#^-]+)/ )
...
[download]

...to grab anything between the word "For" and a comma, open parenth pound sign, or hyphen. No problem. But I saw that some works have the text "Transcribed, For" OR "Arranged, For". In those cases, I want to grab the word "Transcribed" OR "Arranged" as well.

state-o-dis-array's solution would seem work, but doesn't. It actually doesn't even catch a simple example like the first below:

INPUT:
Tirana Alla Spagnola (Rossinizzatta), For Soprano & Piano (Péchés De Vieillesse, Book 1), Qr Iv/30
À La Chapelle Sixtine, Transcribed, For Orchestra, S 360 (Lw G26)
All Through The Night, Traditional Welsh Song Arranged, For Mixed Voices

DESIRED POST-REGEX OUTPUT:
Tirana Alla Spagnola (Rossinizzatta), (Péchés De Vieillesse, Book 1), Qr Iv/30
À La Chapelle Sixtine, S 360 (Lw G26)
All Through The Night, Traditional Welsh Song

(NOTE that I'm saving the instrument type in another variable, but that's not the problem.) To answer abualiga's question, I need the '?' non-greedy quantifier because not every line will have "Transcribed" or "Arranged" in it. I realize I could use a few different regex's connected with the OR operator '||', but I have so many different cases that it starts getting very long and tedious. It may come to that, though.

To answer ww and space_monk, I think I need to have the '\b' word boundaries in there so I can use the '?'.

Comment on Re: Try this Download Code

Replies are listed 'Best First'.
Re^2: Try this by AnomalousMonk (Archbishop) on Nov 02, 2012 at 01:14 UTC
Here's an approach for the 'simple' input you give as an example. A larger chunk (but still a reasonable amount!) of more realistic input might yield a better solution. I see some other, similar postings from you – is there another thread on this with more data? Notes: I use `\x23` instead of a `'#'` character in the `[^-,^(\x23]` character set below because of a peculiarity of my little command-line processor. You should just use '#' instead. (BTW: I'm not sure what all the `'^'` (carat) characters were doing in this set as originally posted, so I left one in there just for good luck!) The input text I use does not have accented characters. I can't display these easily on my console and so cannot test them. ~~Because I do not use accented characters in my test input text, the regexes are untested with such characters.~~ (Update: See Update 1 below.) As to the input text: Note that word `'Arranged'` in the third record (i.e., line) has no preceding comma: `'Welsh Song Arranged'`. Is this an example of real input, or a posting tyop? In any event, the code as it stands handles this variation. I concentrate on extracting what I take to be the critical fields from each record: the title of the piece and its source. You can stitch them together how you want, with commas, whitespace, whatever. Sorry for any wrap-around in the code listing. >perl -wMstrict -le "my @input = ( 'Tirana Alla Spagnola (Rossinizzatta), For Soprano & Piano (Peches +De Vieillesse, Book 1), Qr Iv/30', 'A La Chapelle Sixtine, Transcribed, For Orchestra, S 360 (Lw G26)' +, 'All Through The Night, Traditional Welsh Song Arranged, For Mixed +Voices', ); ;; my $not_source = qr{ [^-,^(\x23] }xms; my $kruft = qr{ \s* ,? \s* }xms; my $ar_tr = qr{ Arranged \| Transcribed }xms; my $at_for = qr{ $kruft $ar_tr? $kruft For $not_source+ $kruft }xms; my $rx_title = qr{ (?! $at_for) . }xms; ;; for (@input) { print qq{[[$_]]}; my ($title, $source) = m{ \A \s* ($rx_title+) $at_for (.?) \s \z }xms ; print qq{:$title: :$source:}; } " [[Tirana Alla Spagnola (Rossinizzatta), For Soprano & Piano (Peches De + Vieillesse, Book 1), Qr Iv/30]] :Tirana Alla Spagnola (Rossinizzatta): :(Peches De Vieillesse, Book 1 +), Qr Iv/30: [[A La Chapelle Sixtine, Transcribed, For Orchestra, S 360 (Lw G26)]] :A La Chapelle Sixtine: :S 360 (Lw G26): [[All Through The Night, Traditional Welsh Song Arranged, For Mixed Vo +ices]] :All Through The Night, Traditional Welsh Song: :: [download] Updates: I have since tried this code as a regular source file with accented characters, and it seems to work. Actually, `m{ \A \s* (.?) $at_for (.?) \s* \z }xms` works just as well (for the limited test set), is probably a bit faster.	[reply] [d/l] [select]
Re^2: Try again :-) by space_monk (Chaplain) on Nov 02, 2012 at 09:56 UTC
Okay: `#!/bin/perl my $file='regex.txt'; open my $fh, "<", $file or die "could not open $file: $!"; while (<$fh>) { chomp; if ( /(.?)\s?(Transcribed\|Arranged)?,\s+For\s+([^(,]+),?(.)/) { print "$1$4\n"; print STDERR "# Instrument: $3 T/A:$2\n"; } }` [download] Using your input, `./program.pl 2>/dev/null` output is: `Tirana Alla Spagnola (Rossinizzatta)(Péchés De Vieillesse, Book 1), Qr + Iv/30 À La Chapelle Sixtine, S 360 (Lw G26) All Through The Night, Traditional Welsh Song` [download]	[reply] [d/l] [select]