PoorLuzer has asked for the wisdom of the Perl Monks concerning the following question:

I have been thinking about a regular expression that can transform a list like this:

1. 10.Things.I.Hate.About.You[1999]DvDrip[Eng]-Ray 699.68 MB 2. 100.Feet.2008.DvDRip-FxM 701.14 MB 3. 11 - 14 1 286.22 MB 4. 13_going_on_30(2004)[Brizzly] 700.23 MB ... 1 523. Waz 699.93 MB 1 524. We.Own.the.Night[2007]DvDrip[Eng]-Ray 700.87 MB 1 525. Webs [2003]DVDRip[Xvid AC3[5.1]-RoCK&BlueLadyRG 1 347.70 MB

into:
10.Things.I.Hate.About.You[1999]DvDrip[Eng]-Ray,699.68 MB 100.Feet.2008.DvDRip-FxM,701.14 11 - 14,1286.22 13_going_on_30(2004)[Brizzly],700.23 ... Waz,699.93 We.Own.the.Night[2007]DvDrip[Eng]-Ray,700.87 Webs [2003]DVDRip[Xvid AC3[5.1]-RoCK&BlueLadyRG,1347.70
Assumption : The filesize is never > 9999.99MB

So far I have a partially working regex:

^[^\.]+\. (.+?) (?:([0-9])(?: ))?([0-9]+\.[0-9]{2}) MB.*$

that maps to

$1:$2$3

to complete the transformation.

I used the colon because no desktop OS would allow that in a filename, so I am safe.

I built the regex without any formal method (i.e, via using intution) and that very same intution tells me this regex is horrifically complicated and slow!

I wish RegExBuddy had a online version or something similar.

How do I build a better RegEx for the same? Hints, tips...

Is there any free/open tool that will allow me to profile my regex (except writing a Perl script)?

Replies are listed 'Best First'.
Re: Regex hackery
by markkawika (Monk) on Jun 12, 2009 at 18:33 UTC
    That regex isn't too bad. Your assumption about : not being allowed in a filename is completely wrong. About the only character not allowed in a filename is a directory separator, such as / on Unix.

    But apart from that, I have a few minor suggestions on your regex.

    1. Inside a character class [] a period does not need to be escaped.
    2. You should use /x. It makes your regex easier to read.
    3. \d is usually preferred to [0-9]. It makes your regex more portable.
    4. You have an unnecessary set of parens in your regex: (?: ).

    Rewritten, it would read:

    / ^ [^.]+ \. \s # Ignore the line numbers (.+?) # Capture the file name (?: (\d) \s # Capture the optional leading size digit ) ? ( \d+ \. \d {2} # Capture the rest of the size ) \s MB .* $ /x
    After this, your file name is $1, and your size is, as you stated, $2$3.

    And yes, there is an ambiguity, where if the line was:

    1. go 2 123.45 MB
    The regex would parse "go" as the file name and "2123.45" as the file size. There's no way around this given the format of the input.

      Your assumption about : not being allowed in a filename is completely wrong. About the only character not allowed in a filename is a directory separator

      You're wrong about the OP being completely wrong. In Windows, the colon is the device indicator. On old Macs, the colon is the directory separator.

Re: Regex hackery
by ikegami (Patriarch) on Jun 12, 2009 at 18:03 UTC

    Note that you will have ambiguities

    3. go 2 286.22 MB
    go 2,286.22 -or- go,2286.22

    I used the colon because no desktop OS would allow that in a filename, so I am safe.

    I guess you only tried the single one that doesn't allow you (Windows). All unixy system allows you to have colons in file names, incl linux and Macs.

    (Doh! I meant to call my Mac using roomy to confirm that the Mac desktop allows you to use colons before posting this, but it seems I submitted the post without even realising it. Feel free to correct me.)

      $ uname -s Darwin $ touch test:file $ ls test:file test:file
      Yep, file names with colons are okay on macosx.
        I asked about the GUI. I don't expect it to be any different, but the OP specifically talked about desktops.
Re: Regex hackery
by planetscape (Chancellor) on Jun 13, 2009 at 12:27 UTC
      Excellent stuff!