Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^2: HTML::HTML5::Parser weirdness

by djh (Novice)
on Feb 23, 2020 at 21:06 UTC ( #11113352=note: print w/replies, xml ) Need Help??


in reply to Re: HTML::HTML5::Parser weirdness
in thread HTML::HTML5::Parser weirdness

Hi Toby. Very pleased to see you here. The file is just like twenty or thirty others, scraped by a cron job and I've checked permissions and content several times. I'll try using parse_string etc if I put together an SSCCE.

I think Corion's post indicates that the problem isn't with the particular file, although it does seem that particular file is triggering the problem. But the identical result he got with a non-existent file is a strong suggestion that the problem lies elsewhere. In particular finding out where those funky <head/> and <body/> strings come from is my main focus at present.

Replies are listed 'Best First'.
Re^3: HTML::HTML5::Parser weirdness
by tobyink (Canon) on Feb 24, 2020 at 00:36 UTC

    The document is being parsed as having no contents in the head and no contents in the body. Head and body elements are still parsed though, because in the HTML5 model, all HTML documents have a head and a body. You're using XML::LibXML to output the document, and XML::LibXML will typically output an empty HTML element like <blah />. So that's why you're seeing those in the output. I wouldn't expect that they're in the input.

    The problem is that it's not seeing anything at all in the head and body in the input. Probably because of a parsing error too extreme to recover from. But I'd need to see the file to be sure.

Re^3: HTML::HTML5::Parser weirdness
by Your Mother (Bishop) on Feb 23, 2020 at 21:30 UTC
    In particular finding out where those funky <head/> and <body/> strings come from is my main focus at present.

    I would suggest that is a waste of time. It’s almost certainly indicative of the lack of head and body content. Those are merely how tags/elements/nodes without content are rendered. They are exactly equivalent to this style <head></head>. The problem lies elsewhere.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11113352]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (5)
As of 2020-08-13 15:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which rocket would you take to Mars?










    Results (74 votes). Check out past polls.

    Notices?