Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Regex Parsing Chars in a Line

by kel (Sexton)
on Nov 24, 2019 at 20:30 UTC ( [id://11109157]=perlquestion: print w/replies, xml ) Need Help??

kel has asked for the wisdom of the Perl Monks concerning the following question:

Greeting Monks!

Its been a while since my last post, and my skills are starting to rust.... However: I need to retitle a large collection and most of my Perl scripts work, and remarkably well for most situations. But I am being vexed by a particular situation. I use hyphens as *field* seperators in parsing. Normally books have a simple author-title field with a single hyphen. Not a problem. Some books have a author-series-title field. I use split to rearrange when necessary. <\p>

Now some go hypen happy with such as 'A A Milne - Winnie-The-Pooh and Silver-Bear vol5-12 - Xi Press - Peking (1998)' For my purposes I wish only to keep the first and last parenthesis

#Reduce excessive hyphens (to underscores) $name =~ s/^(.+\-.+)(\-)+(.+\-.+$)/$1\_$3/g ;

This does work, but it often requires multiple runs.

What is peculiar is that I do have a filter

$name =~ s/(\[.+)(\-)(.+\])/$1_$3/ig;

That removes underscores in brackets, but does not work on first attempt, and is much later in the code. I have tried look-behinds and look-aheads without much success with this type of problem.

(such as $name =~ s/(\()(.+)(?!\))/$1$2\)/g ; #to close parenthesis at end) <\p>

Which hasically is :

Permitting a word or character at first and last instance, and filtering all middle instances. <\p>

Many thanks in advance!

K

Replies are listed 'Best First'.
Re: Regex Parsing Chars in a Line
by tybalt89 (Monsignor) on Nov 25, 2019 at 03:35 UTC

    Did you mean "first and last hyphen" ? If so:

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11109157 use warnings; my $line = 'A A Milne - Winnie-The-Pooh and Silver-Bear vol5-12 - Xi P +ress - Peking (1998)'; print "$line\n"; $line =~ s/-\K.*(?=-)/ $& =~ tr'-'_'r /e; print "$line\n";

    Outputs:

    A A Milne - Winnie-The-Pooh and Silver-Bear vol5-12 - Xi Press - Pekin +g (1998) A A Milne - Winnie_The_Pooh and Silver_Bear vol5_12 _ Xi Press - Pekin +g (1998)

    If this is not what you want, how about providing several test cases along with expected output.

      This seems to be what I am looking for: A simple (if complex) bit of code to simply keep firat and last hyphens in a book or multimedia title. I am unfamiliar with the K operator, and a quick google comes up with nothing. I do not normally use tr, but see here where it should be in my toolbox. Many thanks!
Re: Regex Parsing Chars in a Line
by swl (Parson) on Nov 24, 2019 at 20:53 UTC

    The . metacharacter matches spaces as well. From your input data it would appear that any hyphen surrounded by spaces is what you want to separate on? For example qr/\s\-\s/

    You might also look at using Text::CSV_XS for the parsing. See https://metacpan.org/pod/Text::CSV_XS#sep for how to use multiple characters.

      By .+ I am targeting adjacent to anything, rather than spaces. And converting all but the first and last to underscores. I see a typo in my description that unnecessarily obfuscates....

        Maybe you could avoid the regex approach and use split and join?

        # untested my @parts = split /-/, $input; my $first = shift @parts; my $last = pop @parts; my $reassembled; # then $reassembled = $first . '-' . join ('_', @parts) . '-' . $last; # or something like this $reassembled = join '_', @parts; $reassembled = join '-', ($first, $reassembled, $last);

        The above assumes no CSV type quoting issues with embedded separators, for which a proper CSV parser like Text::CSV is needed.

        I see a typo in my description ...

        If you have not already done so (I don't see any Update), you might consider updating the errant node to clarify the mistake. Please see How do I change/delete my post?


        Give a man a fish:  <%-{-{-{-<

Re: Regex Parsing Chars in a Line
by rsFalse (Chaplain) on Nov 24, 2019 at 21:23 UTC
    Hi.
    So what is your question? What do you want to achieve and what is your data? Can you elaborate more? You say about parentheses in one place, then about hyphens in another.

    I'll try to rewrite last of your regex in more readable form with '/x' modifier:
    $name =~ s/ (\() # $1 (.+) # $2 (?!\)) /$1$2\)/gx;
      The parenthesis repacement with the lookahead works. I thought that the solution to the hyphen issue should work with lookahead/behinds, but it didnt. I apoolgize for the confusion, but that example was only to show that the basic mechanism worked on Perl v26 in Win. Oddly, some operators seem to be recalcitrant on Win, as opposed to Linux. Even with stripping permissions/ownership of target files. I am looking for a clean method for parsing hyphens, and my current method often requires a second run.
Re: Regex Parsing Chars in a Line
by AnomalousMonk (Archbishop) on Nov 26, 2019 at 08:30 UTC

    Having a field separator character that may appear unescaped within a field seems like a bad idea. If you can discriminate the existing, genuine field separators well enough to convert non-field separator hyphens to underscores for disambiguation, it should be possible instead to convert the true field separators to unambiguous characters as your very first step and maybe preserve a bit more of your sanity:

    c:\@Work\Perl\monks>perl -wMstrict -le "my $rec = 'A A Milne - Winnie-The-Pooh and Silver-Bear vol5-12 - Xi P +ress - Peking (1998)'; print qq{'$rec'}; ;; my $rx_old_sep = qr{ \s+ - \s+ }xms; my $new_sep = '|'; ;; $rec =~ s{ $rx_old_sep }{$new_sep}xmsg; print qq{'$rec'}; " 'A A Milne - Winnie-The-Pooh and Silver-Bear vol5-12 - Xi Press - Peki +ng (1998)' 'A A Milne|Winnie-The-Pooh and Silver-Bear vol5-12|Xi Press|Peking (19 +98)'
    The "fixed" file could then be written to disk to await further processing at your leisure.

    Or, again assuming existing field separators are sufficiently unambiguous, just split each record to an array as the very first step and do all processing on the array elements.


    Give a man a fish:  <%-{-{-{-<

      The problem is actually that in reality there are often enough extra hyphens *with* surrounding spaces. And it is often enough to see title fields with a space on only one side, which can vary is the space is present at all. The use of the pipe pulls at me as something to consider, though I would need to test it in both Win and Linux.

      The fun part is hyphenated names. Since these are normally unspaced I can capture them and replace with spaces. The method I currently use is to split with a hyphen and to chack the size of the first var. I will try to do this with a regex with /\w{1,7}\-\w.+\-/x check and see if that works.

      My main concern is the lack of infinite lookbehinds, though the newly discovered \K operator (thanks to this thread! It seems to work wonders on scripts I have adapted to it. ). Are there any modules that add extra capability to the Perl regex? It seems Python has one, but I am far too much a noob in that language to do any productive scripting there yet! (And i really do hate the idea of immutable strings...)

        In the OP you wrote:

        I use hyphens as *field* seperators in parsing.
        I still don't understand if the files you are processing are produced by someone else in an insane format over which you have no control, or if you are generating these files yourself. If the former, you have my deepest sympathy; been there, done that. If the latter, I beg you either to use a reasonable separator character or to use Tux's excellent (and fast!) Text::CSV_XS module, which can both parse and generate CSV files (since this is what you seem to be trying to do). (And CSV really means Character Separated Values, so don't get hung up on commas.) There's also a Pure Perl Text::CSV_PP non-XS module; see Text::CSV for details.

        ... pipe ... I would need to test it in both Win and Linux.

        The only thing to remember about pipe is that it is a regex metacharacter, so it must be suitably escaped in any split or  qr// m// s/// pattern. I am aware of no differences between Windoze and *nix Perls as regards regex behavior or CSV file access, and such concerns are ameliorated if you use a module like Text::CSV_XS.

        My main concern is the lack of infinite lookbehinds ...

        I believe support for generalized variable-width (not infinite; nothing's infinite!) lookbehinds was added with Perl version 5.30 or thereabouts. You'll need to check this...

        Are there any modules that add extra capability to the Perl regex?

        I don't believe that regex operators can be overloaded as can general Perl operators. I have a vague recollection of having read somewhere on PM that it's possible to replace the entire Perl RE with another; this was described in terms of "It's possible, but..." and it was a big but!


        Give a man a fish:  <%-{-{-{-<

Re: Regex Parsing Chars in a Line
by BillKSmith (Monsignor) on Nov 25, 2019 at 14:53 UTC
    If you accept swl's suggestion that the separator should be qr/ - /, and only use the first two fields, the Milne entry would not be an exception. 'Series' fields would still be a problem. Neither your text nor your code tell us anything about how you might recognize them. You have not even provided a single example. Of course, you could avoid this problem by using the last field instead of the second. Then Milne would be a problem again, not because of the extra hyphens, but because it does not fit the format.
    Bill
      The actual code is real simple but an actual script would be deadly if run in a wrong directory. I have neglected to write safety features in them. But in essence it is simply: Parsing a directory of media files. In a loop, running each file through dozens of filters to reformat names and titles, and to remove any unnecessary desiderata. I use hyphens as my main field seperators, so only need two: Author and Title, but can accomodate a third. For example it is important in Fiction to keep Author field first, as that is used to further parse into categories, which is often parsed from from the Title (and extra Series/Subtitle) fields. The opposite is true in Nonfiction, where the author is often optional (as the publisher may be more relevant). Plus, there are *many* scripts and functions, each used as needed. Some will create underscores, and encapsulate dates, others will remove them.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11109157]
Approved by jcb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (6)
As of 2024-03-29 09:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found