timesink has asked for the wisdom of the Perl Monks concerning the following question:

Greetings wise monks,

I want to write a script where I want to sort Fansubs For that purpose I want to find out which group made the file, the name of the file etc

My problem is the following:
I start by trying to find out which group released the file.

Let's say the file is:
[Ureshii]_Amatsuki_-_07_[VORBIS-H264][BCCEA15E].mkv
which I write into $teststring(at the moment of course ;) )

The result has to be "Ureshii"

I now have the following code:
$teststring =~ /([A-Z]*)]/i; print $teststring."\n"; print $1."\n";
Works just fine.
The real problem starts if I have more then one group (more then one way of naming the releases).

The file could also be:
Ureshii Amatsuki - 07 [VORBIS-H264][BCCEA15E].mkv (or something like that)

There are various possiblities and I can't find a way to create a RegEx to retrieve the first word (the group).
I tried variants of the code because I thought "The RegEx tries to retrieve something that matches my 'input'" So first something thats in [A-Z] (by /i I tried to include [a-z] and make the regex shorter --> easier to understand) then an unkown length of characters (should also be in [A-Z]) Then it will find a space, a "]" or and "_" and the search stops (or should :P).

Does anyone of you know a way how I can do that or knows a site that explains RegEx? I visited selfhtml http://de.selfhtml.org/perl/sprache/regexpr.htm (a german site about all kinds of languages) but after 3-5 hours trying to "fix" my RegEx I feel kinda... helpless :(

Any hint would really help me, so I hope someone finds the time to give me some advice :)

Bye for now,
timesink

-----------------------
Edit says: "problem solved"


thanks apl for the page :D forgot about the documentation there :(
selfhtml is a nice site but (of course) this one is much more "complete"
so with this site & the help of another programmer I updated my regex to:
$teststring =~ /([A-Z]*)[_\][:space:]]/i
I have to include every new "end"-sign but I think that's ok :)

I tried to set up something like this before but it didn't work because I had a syntax error in it.

Around three hours of 'work' and then something like this xD But that's programming, huh ;)?

Now I'll try to include the other info like used Codec/Checksum etc but I think with the solution above that I will be able to solve this "remaining" problem on my own.

Thanks for everyone who helped me, I'm really grateful to you all :)

Bye for now,
timesink

Replies are listed 'Best First'.
Re: Problem with RegEx & various "endings"
by dragonchild (Archbishop) on Jul 01, 2008 at 13:54 UTC
    It sounds like you need to normalize first, then extract. So, maybe the first thing is to $string =~ s/_/ //g; to convert all the underscores into spaces.

    Another strategy could be to get rid of all the things you can easily identify. So, you know where the file extension (.mkv) is, so get rid of it. The BCCEA15E, the VORBIS, and the "- 07" can all also go. Then, you're left with a much smaller thing to work with.


    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: Problem with RegEx & various "endings"
by apl (Monsignor) on Jul 01, 2008 at 13:55 UTC
      thanks for the page :D forgot about the documentation there :(
      selfhtml is a nice site but (of course) this one is much more "complete"
      so with this site & the help of another programmer I updated my regex to:
      $teststring =~ /([A-Z]*)[_\][:space:]]/i
      I have to include every new "end"-sign but I think that's ok :)

      I tried to set up something like this before but it didn't work because I had a syntax error in it.

      Around three hours of 'work' and then something like this xD But that's programming, huh ;)?

      Now I'll try to include the other info like used Codec/Checksum etc but I think with the solution above that I will be able to solve this "remaining" problem on my own.

      Thanks for everyone who helped me, I'm really grateful to you all :)

      Bye for now,
      timesink
        Actually, if it really is _so simple_ and the rule is "use the first word" you don't need to specify the stop characters, greediness of the '+' will do the job for you:
        $teststring =~ /(\w+)/ ...
Re: Problem with RegEx & various "endings"
by waldner (Beadle) on Jul 01, 2008 at 14:30 UTC
    Maybe I don't understand correctly...but what about
    $teststring =~ /^[[]?([^] _]+)/;
    this puts in $1 the first word of the string, where "word" means "the first sequence of letters ended by "]", " " or "_". Add other delimiters if needed. In the case of "]", it also checks for an optional leading "[".
      thanks for your solution :)
      With the help of apl & another programmer I came up with a similar solution but I will compare both and see which one is better (probably not mine :()

      And I'm really sorry if I didn't ask clearly enough :(
Re: Problem with RegEx & various "endings"
by pjotrik (Friar) on Jul 01, 2008 at 14:00 UTC
    I wouldn't try to fit everything into one regexp, try creating a series of regexes like:
    my $group; if ($teststring =~ /^\[(\w+)]/) { $group = $1; } elsif ($teststring =~ /^(\w+)/ { $group = $1; ...
    (the r.e.s may not be perfect, i'm just writing them without testing)