comment on

Hi, I need to extract certain pieces of information from a website. There's a <p> tag with 5 <span> tags inside it. One of the spans has a class, so no problem, but the other 4 are just <span>info</span>. This is how the code looks in the website. I'm using Firebug.

<p class="itinerari-info">
<span>
<b> Itinerario </b>
Danimarca, fiordi norvegesi, Germania
</span>
<span class="DepartureDateTitle">
<b>Data partenza</b>
17&nbsp;giugno&nbsp;2012
</span>
<span>
<b> Nave </b>
Costa Fortuna
</span>
<span>
<b> N.ro giorni crociera &nbsp; </b>
7
</span>
<span>
<b> Porto di partenza </b>
Copenhagen
</span>
</p>
[download]

My perl knowledge is limited to the first nine chapters of "Learning Perl" (and that doesn't mean I understand everything, especially sub routines) I don't have any other programming skills.

This is the code I have so far:

#!/usr/bin/perl -w
use LWP::Simple;
use HTML::TreeBuilder;
use strict;
my $base='http://www.costacrociere.it';
my $url='/it/lista_crociere/capitali_nord_europa-201206.html';
my $page = get($base.$url) or die $!;
my $p = HTML::TreeBuilder->new_from_content( $page );

my @trips= $p->look_down(_tag=>'p',class=>'itinerari-info')->as_text;
foreach my $trip (@trips){
   print $trip;
}
[download]

And this is the output:

Itinerario Danimarca, fiordi norvegesi, Germania Data partenza 17�giugno�2012 Nave Costa Fortuna N.ro giorni crociera � 7 Porto di partenza Copenhagen Documenti di viaggio Passaporto�o�Carta d'identit� valida per l'espatrio Possono essere disponibili le seguenti tariffe

So, this outputs all the information in one string and also with some strange characters, but I can use a regex to fix that. The main issue is that I need every string to be independent from each other (As if I wanted to add a title prior to each information itself). I see that the spans have <b>whatever</b> tags, but I just can't seem to understand how I could use those to do what I want. Like I said, my experience is close to zero. I've been trying different stuff with arrays and hashes and right now I just want to burn the computer. If the Monks could help me I would greatly appreciate it. Thank you so much!

In reply to Parsing HTML by marcoss

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Perl Monk, Perl Meditation
	PerlMonks