Thank you all for your efforts. I see I should have been clearer about what I want. Here goes.
The script I'm writing will process about 180,000 lines of a text file, each of which is the title of work of
music. The data is a mess-- there's no consistency at all. My job is to put it into a consistent
format.
To take out the instrument type, I used this regex:
if ( $work =~ / (For [^,^\(^#^-]+)/ )
...
...to grab anything between the word "For" and a comma, open parenth pound sign, or hyphen. No problem. But
I saw that some works have the text "Transcribed, For" OR "Arranged, For". In those cases, I want to grab
the word "Transcribed" OR "Arranged" as well.
state-o-dis-array's solution would seem work, but doesn't. It actually doesn't even catch a simple example
like the first below:
INPUT:
Tirana Alla Spagnola (Rossinizzatta), For Soprano & Piano (Péchés De Vieillesse, Book 1), Qr Iv/30
À La Chapelle Sixtine, Transcribed, For Orchestra, S 360 (Lw G26)
All Through The Night, Traditional Welsh Song Arranged, For Mixed Voices
DESIRED POST-REGEX OUTPUT:
Tirana Alla Spagnola (Rossinizzatta), (Péchés De Vieillesse, Book 1), Qr Iv/30
À La Chapelle Sixtine, S 360 (Lw G26)
All Through The Night, Traditional Welsh Song
(NOTE that I'm saving the instrument type in another variable, but that's not the problem.)
To answer abualiga's question, I need the '?' non-greedy quantifier because not every line will have
"Transcribed" or "Arranged" in it. I realize I could use a few different regex's connected with the OR operator '||', but I have so many different cases that it starts getting very long and tedious. It may come to that, though.
To answer ww and space_monk, I think I need to have the '\b' word boundaries in there so I can use the '?'.
| [reply] [d/l] |
>perl -wMstrict -le
"my @input = (
'Tirana Alla Spagnola (Rossinizzatta), For Soprano & Piano (Peches
+De Vieillesse, Book 1), Qr Iv/30',
'A La Chapelle Sixtine, Transcribed, For Orchestra, S 360 (Lw G26)'
+,
'All Through The Night, Traditional Welsh Song Arranged, For Mixed
+Voices',
);
;;
my $not_source = qr{ [^-,^(\x23] }xms;
my $kruft = qr{ \s* ,? \s* }xms;
my $ar_tr = qr{ Arranged | Transcribed }xms;
my $at_for = qr{
$kruft $ar_tr? $kruft For $not_source+ $kruft
}xms;
my $rx_title = qr{ (?! $at_for) . }xms;
;;
for (@input) {
print qq{[[$_]]};
my ($title, $source) =
m{ \A \s* ($rx_title+) $at_for (.*?) \s* \z }xms
;
print qq{:$title: :$source:};
}
"
[[Tirana Alla Spagnola (Rossinizzatta), For Soprano & Piano (Peches De
+ Vieillesse, Book 1), Qr Iv/30]]
:Tirana Alla Spagnola (Rossinizzatta): :(Peches De Vieillesse, Book 1
+), Qr Iv/30:
[[A La Chapelle Sixtine, Transcribed, For Orchestra, S 360 (Lw G26)]]
:A La Chapelle Sixtine: :S 360 (Lw G26):
[[All Through The Night, Traditional Welsh Song Arranged, For Mixed Vo
+ices]]
:All Through The Night, Traditional Welsh Song: ::
Updates:
- I have since tried this code as a regular source file with accented characters, and it seems to work.
- Actually, m{ \A \s* (.*?) $at_for (.*?) \s* \z }xms works just as well (for the limited test set), is probably a bit faster.
| [reply] [d/l] [select] |
#!/bin/perl
my $file='regex.txt';
open my $fh, "<", $file
or die "could not open $file: $!";
while (<$fh>) {
chomp;
if ( /(.*?)\s?(Transcribed|Arranged)?,\s+For\s+([^(,]+),?(.*)/) {
print "$1$4\n";
print STDERR "# Instrument: $3 T/A:$2\n";
}
}
Using your input, ./program.pl 2>/dev/null output is:
Tirana Alla Spagnola (Rossinizzatta)(Péchés De Vieillesse, Book 1), Qr
+ Iv/30
À La Chapelle Sixtine, S 360 (Lw G26)
All Through The Night, Traditional Welsh Song
| [reply] [d/l] [select] |