Here's an approach for the 'simple' input you give as an example. A larger chunk (but still a reasonable amount!) of more realistic input might yield a better solution. I see some other, similar postings from you – is there another thread on this with more data?
Notes:
-
I use \x23 instead of a '#' character in the [^-,^(\x23] character set below because of a peculiarity of my little command-line processor. You should just use '#' instead. (BTW: I'm not sure what all the '^' (carat) characters were doing in this set as originally posted, so I left one in there just for good luck!)
-
The input text I use does not have accented characters. I can't display these easily on my console and so cannot test them.
-
Because I do not use accented characters in my test input text, the regexes are untested with such characters.
(Update: See Update 1 below.)
-
As to the input text: Note that word 'Arranged' in the third record (i.e., line) has no preceding comma: 'Welsh Song Arranged'. Is this an example of real input, or a posting tyop? In any event, the code as it stands handles this variation.
-
I concentrate on extracting what I take to be the critical fields from each record: the title of the piece and its source. You can stitch them together how you want, with commas, whitespace, whatever.
-
Sorry for any wrap-around in the code listing.
>perl -wMstrict -le
"my @input = (
'Tirana Alla Spagnola (Rossinizzatta), For Soprano & Piano (Peches
+De Vieillesse, Book 1), Qr Iv/30',
'A La Chapelle Sixtine, Transcribed, For Orchestra, S 360 (Lw G26)'
+,
'All Through The Night, Traditional Welsh Song Arranged, For Mixed
+Voices',
);
;;
my $not_source = qr{ [^-,^(\x23] }xms;
my $kruft = qr{ \s* ,? \s* }xms;
my $ar_tr = qr{ Arranged | Transcribed }xms;
my $at_for = qr{
$kruft $ar_tr? $kruft For $not_source+ $kruft
}xms;
my $rx_title = qr{ (?! $at_for) . }xms;
;;
for (@input) {
print qq{[[$_]]};
my ($title, $source) =
m{ \A \s* ($rx_title+) $at_for (.*?) \s* \z }xms
;
print qq{:$title: :$source:};
}
"
[[Tirana Alla Spagnola (Rossinizzatta), For Soprano & Piano (Peches De
+ Vieillesse, Book 1), Qr Iv/30]]
:Tirana Alla Spagnola (Rossinizzatta): :(Peches De Vieillesse, Book 1
+), Qr Iv/30:
[[A La Chapelle Sixtine, Transcribed, For Orchestra, S 360 (Lw G26)]]
:A La Chapelle Sixtine: :S 360 (Lw G26):
[[All Through The Night, Traditional Welsh Song Arranged, For Mixed Vo
+ices]]
:All Through The Night, Traditional Welsh Song: ::
Updates:
- I have since tried this code as a regular source file with accented characters, and it seems to work.
- Actually, m{ \A \s* (.*?) $at_for (.*?) \s* \z }xms works just as well (for the limited test set), is probably a bit faster.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.