Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

HTML::Element newline character

by usr345 (Sexton)
on Jul 10, 2011 at 12:15 UTC ( [id://913596]=perlquestion: print w/replies, xml ) Need Help??

usr345 has asked for the wisdom of the Perl Monks concerning the following question:

I am parsing this HTML:
<td valign="top">9506 PASTURE DRIVE OOLTEWAH, TN 37363</td>

But when I run this for the parsed HTML::Element:

print $tds[1]->as_trimmed_text();

it gives me:

9506 PASTURE DRIVE OOLTEWAH, TN 37363

Without the newline character after drive. So I can't reliably get the city name. How to preserve newlines in HTML::Element, HTML::TreeBuilder?

Replies are listed 'Best First'.
Re: HTML::Element newline character
by wfsp (Abbot) on Jul 10, 2011 at 14:43 UTC
    From the docs:
    $h->as_trimmed_text
    This is just like as_text(...) except that leading and trailing whitespace is deleted, and any internal whitespace is collapsed.
    Here, "collapsed" means replaced with one space (just as a browser would). So, as_text may do what you want. I would be worried about how consistent the (text part) of the HTML is though. I think I might consider another approach to "reliably get the city name".

      I meant something different. Not the whitespace, but the newline character. The original HTML is like this (I'll point the spaces and newlines):

      <td valign="top">9506 PASTURE DRIVE\s\n<---- OOLTEWAH, TN 37363</td>

      So the Street number and street name are separated from city, state, zip be a newline. But both as_text and as_trimmed_text cut this newline. Because some cities consist of 2 words it will be painful to parse them.

      Is it possible to preserve the newline?

        Not the whitespace, but the newline character.

        Newline is whitespace

        But both as_text and as_trimmed_text cut this newline. ... Is it possible to preserve the newline?

        No they don't. The whitespace is already gone before you call either of those methods. All you had to do was

        $ perldoc HTML::TreeBuilder |grep -i space Do not represent the text content of elements. This saves spac +e if $root->ignore_ignorable_whitespace(value) whitespace text nodes in the tree. Default is true. (In fact, +I'd be $root->no_space_compacting(value) This determines whether TreeBuilder compacts all whitespace st +rings contiguous whitespace in the document is turned into a single +space. But that's not done if no_space_compacting is set to 1. Setting no_space_compacting to 1 might be useful if you want t +o read Redirects to HTML::Element:: delete_ignorable_whitespace $ perldoc HTML::Element |grep -i space $h->delete_ignorable_whitespace() whitespace. You should not use this if $h under a 'pre' element. "\t", or some number of spaces, if you specify it). whitespace is deleted, and any internal whitespace is collapsed. This will not remove hard spaces, unicode spaces, or any other non + ASCII white space unless you supplye the extra characters as a string Tabs are expanded to however many spaces it takes to get to the ne +xt 8th
        #!/usr/bin/perl -- use strict; use warnings; use HTML::TreeBuilder; use Test::More qw' no_plan '; Main(@ARGV); exit(0); sub Main { is( OneT('<html><body></body></html>'), undef, 'no tag means undef not empty string' ); is( OneT('<html><title></title><body></body></html>'), '', 'no content' ); is( OneT('<html><title> </title><body></body></html>'), ' ', 'space' ); is( OneT(qq'<html><title>a\nb</title><body></body></html>'), "a\nb", 'a newline b' ); } ## end sub Main sub OneT { my ( $html, $expect, $name ) = @_; my $tree = HTML::TreeBuilder->new(); $tree ->no_space_compacting(1); $tree->parse($html); return eval { $tree->look_down(qw' _tag title')->as_text }; } ## end sub OneT __END__

        As the replies you've already had hint, but perhaps don't make explicit enough, white space (including line breaks, tabs, spaces, etc.) is special in HTML in that it is largely ignored. In general any amount of adjacent white space in HTML can be replaced with a single space. HTML is not an appropriate way to store information that depends on white space for interpretation!

        Where does the HTML you are trying to process come from? It would be better to either structure the data in a table if you must use HTML, or use a format appropriate to managing the data such as csv.

        True laziness is hard work

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://913596]
Approved by wfsp
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (3)
As of 2024-04-25 18:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found