Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone,
I am writing a regression tool and require to match error in written ids within the strings. Following instances would make it clearer:

Required syntax is comma separated ids.
Ids should be expected in any form like this:
123
123456,34567889
12345,CWG123456,1234
(i.e. a numeric string with more than 3 digits or alphanumeric string starting with CWG only)
Classes of Errors observed by me in the string for pattern matching can be of following scenarios:
1) "Blah Blah 10-20 m can be taken into consideration 1234556"
2) 123,30,40
3) http://www.takeithere.com/123456789/fig1,987643,34467889
According to my pattern it removes all the metacharacters found in the coming string and joins 10-20 into 1020 which is more than 3 digits and considers this to be an id. Can you please provide me with an idea to screen Ids in required format only otherwise display error specifically for the ones which do not have proper syntax.

  • Comment on Finding specific alphanumeric IDs from the string

Replies are listed 'Best First'.
Re: Finding specific alphanumeric IDs from the string
by Grimy (Pilgrim) on May 21, 2012 at 08:53 UTC
    Removing meta-characters isn't a good idea. If I understood your problem correctly, as soon as there's a non-comma, non-alphanumeric character, the ID is incorrect and you should stop there. Example code:
    for (split ',', $ids) { die "$_: Invalid ID" unless /^(CWG)?(\d{3,})$/; print "ID $2"; }

      hi,
      Thanks for replying but comma is not a constant separator to use inside the split...
      Ids might be enveloped within some string text or spaces like shown in one of my examples in error classes I observed...since these are the syntactical errors I need to catch and flag.
      Can you help me in this scenario?

        local $/; # Slurp for (<DATA> =~ /(?:CWG)?\d{3,}/g) { print "ID: $_\n"; } __DATA__ 123 123456,34567889 12345,CWG123456,1234 "Blah Blah 10-20 m can be taken into consideration 1234556" 123,30,40 http://www.takeithere.com/123456789/fig1,987643,34467889
        Outputs:
        ID: 123 ID: 123456 ID: 34567889 ID: 12345 ID: CWG123456 ID: 1234 ID: 1234556 ID: 123 ID: 123456789 ID: 987643 ID: 34467889

        But is there a precise criteria that you can use to distinguish IDs from other numbers? What if it said "Blah Blah 100m can be taken into consideration 1234556", 100 isn't an ID but would still be matched.
Re: Finding specific alphanumeric IDs from the string
by Anonymous Monk on May 23, 2012 at 00:36 UTC

    If you know that the identifier always begins with CWG and thereafter consists of zero-or-more alphanumerics, write a regex that looks for i"(comma?) .. CWG .. alphanumeric."/i Use the "/g" modifier to allow the regex to be used more than once in the same string. If you know that CWG never occurs at the start of the string, the comma is merely part of the leading string that you are looking for. Do not try to remove characters first since this disrupts the structure of the string and introduces the possibility for errors.

      Hello,
      Thanks for replying but even I am trying to find some common pattern in my data... I have got into thinking if there is some reverse method of attacking this problem where data is so unpredictable.