Thanks guys,

I appologize for not explaining this better.

Here's what actual data looks like:

"OSTG ThinkGeek Slashdot ITMJ Linux.com NewsForge freshmeat Newsletter +s PriceGrabber Jobs Broadband ; SourceForge.net Search Advanced Log I +n - Create Account SF.net Home About Supporters Site;News Create;Proj +ect Subscribe Newsletter Compile;Farm Projects Software;Map Create;Pr +oject New;Releases Top;Projects New;Projects Help;Wanted My Page Summ +ary Projects Tracker Tasks Donations Preferences Help Get;Support Doc +umentation Site;Updates Priority;Support Site;Status ; ; Provide feed +back on this page ; Recently changed page ; Site Status SF.net Projec +ts Phatsoft TMR Summary ; ; Phatsoft TMR ; Donate to project ; Stats +- Activity: 82.51% RSS Advanced Summary Admin Home;Page Forums Tracke +r Bugs Support;Requests Patches Feature;Requests Mail Screenshots New +s Files ; ; TMR is a lightweight reminder utility that works with Win +dows on scheduled tasks. Like the yellow stickies, the program pops u +p with a message at a set time, shuts down or restarts your computer, + opens a file, or starts an application. ; ; Download Phatsoft TMR ; +; ; Project Admins : lucamartinetti Operating System : All 32-bit MS +Windows (95/98/NT/2000/XP) License : GNU General Public License (GPL) + Category : (None Listed) Need Support? : See the support instruction +s provided by this project ; ; Latest News ; ; 1.3.0.2 - BIG FIXES - +Update reccomended! ; 2003-09-22 TMR on TechTV's 'Call for Help' ; 20 +03-03-17 News archive » ; ; Public Areas ; ; Bugs : (11 open / +22 total) Bug Tracking System Support Requests : (12 open / 17 total) + Tech Support Tracking System Patches : (0 open / 0 total) Patch Trac +king System Feature Requests : (18 open / 26 total) Feature Request T +racking System Public Forums : (11 messages in 2 forums) Mailing List +s : (1 total) ; ; Project Details ; ; Project Admins : lucamartinetti + Developers : 2 Development Status : 5 - Production/Stable Intended A +udience : End Users/Desktop License : GNU General Public License (GPL +) Operating System : All 32-bit MS Windows (95/98/NT/2000/XP) Program +ming Language : C++ Translations : English User Interface : Win32 (MS + Windows) Project UNIX name : tmr Registered : 2003-01-25 03:25 Activ +ity Percentile (last week) : 82.51 View project activity statistics V +iew list of RSS feeds available for this project ; ; ; About SourceFo +rge.net About OSTG Privacy Statement Terms of Use Advertise Get Suppo +rt RSS Powered by the SourceForge® collaborative development envi +ronment from VA Software ©Copyright 2006 - OSTG Open Source Tech +nology Group, All Rights Reserved"

This is basically, a SourceForge project summary page where I strip all HTML and end up with this text. I want to be able to extract as many project attributes as possible, such as: Programming Language, Translations, Developers, Activity, Topic and so on.

Now here's where the problem comes, some projects will have some attributes and other projects won't have them. For example, not all projects have the "User Interface Attribute" or the "Donors" attribute, etc... This basically makes for not being able to depend on the order in which these attributes appear in the text.

So because the order is inconsistent from a project to another I can't do something like:

if ( $txt =~ /Programming Language : (.*) License :/i ) { $result = $1; }

and would have to rely on something else. Is there any way to extract the pattern that matches a label and have something like:

if ( $txt =~ /Programming Language : (.*) $LABEL_PATTERN/i ) { $result = $1; }

? This way I can make a list of all possible labels and look for each one individually


In reply to Re^4: RegEx question by mariuspopovici
in thread RegEx question by mariuspopovici

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.