mazdajai has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to pull the craigslist rss feed and dump them into a multidimensional array for further parsing without any parser. I am having difficulty to on indexing them in the array. Code:
#!/usr/bin/env perl use 5.012; use LWP::Simple; use Data::Dumper; my @feeds = ( 'http://newyork.craigslist.org/search/fua?query=window%20fan&&format +=rss', 'http://newyork.craigslist.org/search/fua?sort=rel&query=wire+shelf& +format=rss' ); my (@search,@listings); my $s = 0; my $l = 1; my $u = 2; foreach my $feed (@feeds) { $_ = get $feed; die "Couldn't download $feed" unless defined; $search[$s] = [ m/<title>njdirector\\w+\s\|\s(.+?)<\/title>/g ]; $search[$s][$l] = [ m/<title><\!\[CDATA\[(.+?)\]\]><\/title>/g ]; $search[$s][$l][$u] = [ m/<link>(http:\/\/newyork.craigslist.org\ +/.+?)<\/link>/g ]; $s++; $l++; $u++; } say search; say Dumper \@search;
Expected output:
[ 'furniture search "wire shelf"', [ 'Moving - rugs-bike-art-ikea-barstools (Williamsburg) &#x0024;50', [ 'http://newyork.craigslist.org/brk/fuo/5118292497.html', ] ] ]
Current output:
$VAR1 = [ [ 'furniture search "window fan"', [ 'West Elm Parsons Square Table (Upper West Side) &#x0024 +;220', 'Moving - rugs-bike-art-ikea-barstools (Williamsburg) &# +x0024;50', [ 'http://newyork.craigslist.org/mnh/fuo/5122290773.html +', 'http://newyork.craigslist.org/brk/fuo/5118292497.html +', 'http://newyork.craigslist.org/que/fuo/5122167358.html +', 'http://newyork.craigslist.org/que/fuo/5121397241.html +', 'http://newyork.craigslist.org/que/fuo/5121521426.html +', 'http://newyork.craigslist.org/que/fuo/5121228197.html +', 'http://newyork.craigslist.org/que/fuo/5120454542.html +', 'http://newyork.craigslist.org/que/fuo/5120553519.html +', 'http://newyork.craigslist.org/wch/fuo/5120380233.html +', 'http://newyork.craigslist.org/brx/fud/5074875686.html +', 'http://newyork.craigslist.org/que/fuo/5119921410.html +', 'http://newyork.craigslist.org/mnh/fuo/5119643145.html +', 'http://newyork.craigslist.org/mnh/fuo/5119539455.html +', 'http://newyork.craigslist.org/mnh/fuo/5119482741.html +', 'http://newyork.craigslist.org/que/fuo/5119098116.html +', 'http://newyork.craigslist.org/que/fuo/5117747060.html +', 'http://newyork.craigslist.org/que/fuo/5116931245.html +', 'http://newyork.craigslist.org/que/fuo/5116011925.html +', 'http://newyork.craigslist.org/que/fuo/5115456123.html +', 'http://newyork.craigslist.org/que/fuo/5115029455.html +', 'http://newyork.craigslist.org/que/fuo/5113653309.html +', 'http://newyork.craigslist.org/que/fuo/5112399069.html +', 'http://newyork.craigslist.org/que/fuo/5111956635.html +', 'http://newyork.craigslist.org/mnh/fuo/5111276788.html +', 'http://newyork.craigslist.org/que/fuo/5110999938.html +' ], 'Dresser + AC unit for Sale-Both in Excellent Condition +(Astoria)', 'Table, Fan, Trash bin, Window blinds, Bed Frame, Twin M +attress (ELMHURST)', 'MOVING SALE - Dresser, AC Unit, TV (Astoria)', 'MOVING SALE - Dresser, AC Unit, TV (Astoria)', 'Table, Fan, Trash bin, Window blinds, Bed Frame, Twin M +attress (ELMHURST)', 'Carved Dining Room Chairs -wood (4) (Rye) &#x0024;125', 'Handyman Furniture Assembly + Tv Mount &amp; Shelves - +A/C (Manhattan Bronx Harlem Queens brooklyn) &#x00 24;1', 'Table, Fan, Trash bin, Window blinds, Bed Frame, Twin M +attress (ELMHURST)', 'Black Iron Coat Rack (Upper West Side) &#x0024;50', 'MOVE OUT SALE PICK UP ANYTIME TODAY!!! (Midtown East) & +#x0024;100', 'high quality ecru stationary, complete household furnit +ure set (Midtown East) &#x0024;183', 'Table, Fan, Trash bin, Window blinds, Bed Frame, Twin M +attress (ELMHURST)', 'Table, Fan, Light, Trash bin, Window blinds, Bed Frame, + Twin Mattress (ELMHURST)', 'Table, Fan, Light, Trash bin, Window blinds, Bed Frame, + Twin Mattress (ELMHURST)', 'Table, Fan, Light, Trash bin, Window blinds, Bed Frame, + Twin Mattress (ELMHURST)', 'Table, Fan, Light, Grocery cart, Trash bin, Window blin +ds (ELMHURST)', 'Table, Fan, Light, Grocery cart, Trash bin, Window blin +ds (ELMHURST)', 'Bed Frame, Twin Mattress, Microwave, AC, Study Table, F +an, Light (ELMHURST)', 'Bed Frame, Twin Mattress, Microwave, AC, Table, Fan, Li +ght (ELMHURST)', 'Bed Frame, Twin Mattress, Microwave, AC, Table, Fan, Li +ght (New york city)', 'SofaBed / Bed Frames / Curtains / Dresser / Fan / Windo +w Shades (Midtown East) &#x0024;10', 'Bed Frame, Twin Mattress, Microwave, AC, Table, Fan, Li +ght (Elmhurst)' ] ], [ 'furniture search "wire shelf"', undef, [ 'Moving - rugs-bike-art-ikea-barstools (Williamsburg) &# +x0024;50', 'Ethan Allen Media Center (Forest Hills) &#x0024;2900', 'Metal Wire Rack (Murray Hill) &#x0024;15', [ 'http://newyork.craigslist.org/brk/fuo/5118292497.html +', 'http://newyork.craigslist.org/que/fuo/5122101733.html +', 'http://newyork.craigslist.org/mnh/fuo/5121901736.html +', 'http://newyork.craigslist.org/mnh/fuo/5121610909.html +', 'http://newyork.craigslist.org/brk/fuo/5121582440.html +', 'http://newyork.craigslist.org/fct/fuo/5108510077.html +', 'http://newyork.craigslist.org/fct/fuo/5114664267.html +', 'http://newyork.craigslist.org/mnh/fuo/5120558162.html +', 'http://newyork.craigslist.org/mnh/fuo/5110714008.html +', 'http://newyork.craigslist.org/jsy/fuo/5113964211.html +', 'http://newyork.craigslist.org/lgi/fuo/5119989644.html +', 'http://newyork.craigslist.org/jsy/fuo/5119396119.html +', 'http://newyork.craigslist.org/mnh/fuo/5101671234.html +', 'http://newyork.craigslist.org/wch/fuo/5104111597.html +', 'http://newyork.craigslist.org/mnh/fud/5114135009.html +', 'http://newyork.craigslist.org/brk/fuo/5118718185.html +', 'http://newyork.craigslist.org/mnh/fuo/5118701467.html +', 'http://newyork.craigslist.org/mnh/fuo/5118465075.html +', 'http://newyork.craigslist.org/que/fuo/5107986299.html +', 'http://newyork.craigslist.org/que/fuo/5117445116.html +', 'http://newyork.craigslist.org/brk/fuo/5093360327.html +', 'http://newyork.craigslist.org/brk/fuo/5106235500.html +', 'http://newyork.craigslist.org/mnh/fuo/5095039134.html +', 'http://newyork.craigslist.org/mnh/fuo/5094808060.html +', 'http://newyork.craigslist.org/brk/fuo/5100382731.html +' ], 'Closet Maid Shelf Track Wardrobe (Bushwick) &#x0024;80' +, 'ETHAN ALLEN ROBINSON TV MEDIA CENTER BOOKCASE *** (STAM +FORD, CT) &#x0024;3595', 'JUST REDUCED! Lillian August Reclaimed Wood Primitive +Storage Cabinet (Wilton) &#x0024;325', 'Omega Chrome Wire Shelf Truck (Flatiron) &#x0024;50', 'Beautiful Pottery Barn Lyla Bar (Harlem / Morningside) +&#x0024;450', 'Wire Shelving with adjustable shelves (Jersey City) &#x +0024;60', 'SCROLL HEADBOARD - NEW IN BOX (Farmingdale) &#x0024;50' +, 'RARE 300cm Pax DISCONTINUED Stordal Sliding Doors Wardr +obe / Room Divi &#x0024;650', '4-tier standing shelf **Excellent Condition (Upper East + Side) &#x0024;25', 'Cheap Mirrored Ikea Wardrobes- $150 (Scarsdale) &#x0024 +;150', '4 assorted bookcases, $15 &amp; up (East Village)', 'TV Stand (Williamsburg) &#x0024;60', 'Industrial wire Shelf (Gramercy) &#x0024;60', 'Heavy Duty Wire Shelving Unit (Gramercy) &#x0024;50', 'Ethan Allen Media Center (Forest Hills) &#x0024;2900', '4 Tier/Shelf Metal Shoe Rack/Shelving Unit Storage Orga +nizer (Elmhurst) &#x0024;15', 'used 6 shelf plastic container - $10 (Bensonhurst) ((Be +nsonhurst)) &#x0024;10', '1575 vintage FISHER PRICE CONTRUX mixed Lot Parts &#x00 +24;150', '3 - Tier Vintage Storage Shelf (Upper East Side) &#x002 +4;40', 'Black Sturdy 2-tier INTERMETRO Shelf (Upper East Side) + &#x0024;50', 'Black wire shelf 6ft x 3ft x 14in &#x0024;35' ] ] ];

Replies are listed 'Best First'.
Re: Parse HTML into multidimensional array
by GotToBTru (Prior) on Jul 14, 2015 at 20:14 UTC

    First off, parsing HTML using regexes is pretty much universally regarded as a bad idea. It can work, but it's usually easier to use one of the HTML modules. Especially if the HTML is likely to change in the future.

    Having said that, get the full HTML that get returns, and look at your regexes. Your expected output suggests you just wanted the first data, but you have the g qualifier on them which means the entire input will be checked for matches. That might be why you are getting much more than you expected.

    Dum Spiro Spero
      Looking at the source html, do you think it is impossible to pull the content into correct index? The purpose to achieve this without HTML::treebuilder so I can learn Perl without relying modules. I know it sounds silly but a lot time I won't understand the background using module.
        so I can learn Perl without relying [on] modules
        I would suggest that learning how to use HTML::TreeBuilder and in particular HTML::TreeBuilder::XPath would be a far more fruitful experience than entangling yourself in regular expressions. Learning powerful and robust modules will allow you to do more with Perl, not less.

        Impossible? Absolutely not, and easier with Perl than with any other language. But .. that's not saying much. I can sympathize with the desire to really learn Perl, but it will do you good to start to recognize that learning to use CPAN is part of learning to use Perl.

        If you're insistent on using regexes, start with the full html returned by your get command, and build the regex incrementally. A site like this can help you with that. And good luck! You can post questions here if you get stuck, but be prepared to hear "why aren't you using a module for this?" every time!

        Dum Spiro Spero
Re: Parse HTML into multidimensional array
by thomas895 (Deacon) on Jul 15, 2015 at 05:24 UTC

    You already have an RSS feed of the stuff you want, why not use something like XML::RSS to parse it easily and reliably?

    As for your aversion of modules: don't. Modules are part of why Perl is so useful. You don't have to understand all of what a module does, just use one to get your work done.
    Then, if you have time left, you can peruse the source code.

    -Thomas
    "Excuse me for butting in, but I'm interrupt-driven..."
      Thanks everyone for the inputs and their comments. This is what I have so far, with a help of my friend and I will definitely check out XML::RSS.
      my $Feeds = [ 'http://newyork.craigslist.org/search/fua?query=window%20fan&&form +at=rss', 'http://newyork.craigslist.org/search/fua?sort=rel&max_price=50&qu +ery=wire+shelf&format=rss' ]; my $Results; for my $Index (0 .. $#$Feeds) { $_ = get $Feeds->[$Index]; s/\n//g; $Results->{$Index} = { 'Query' => (/<title>craigslist newyork \| furniture search "(. ++?)"<\/title>/i)[0], 'Title' => [/<title><!\[CDATA\[(.+?)\]\]><\/title>/gi], 'Link' => [/<link>(http:.+?)<\/link>/gi], 'Description' => [/<description><!\[CDATA\[(.+?)\]\]><\/descri +ption>/gi] }; }