http://qs321.pair.com?node_id=11113349


in reply to HTML::HTML5::Parser weirdness

As per Corion's post, make sure you're able to read the file okay — file exists, has correct permissions, etc. Try reading the file using normal Perl open, readline, etc, then sending the string to parse_string on the parser object, rather than using parse_file.

If that still doesn't work, try emailing the file to the author of that module, tobyink @ cpan.org. He can be very helpful sometimes. :)

Replies are listed 'Best First'.
Re^2: HTML::HTML5::Parser weirdness
by djh (Novice) on Feb 23, 2020 at 21:06 UTC

    Hi Toby. Very pleased to see you here. The file is just like twenty or thirty others, scraped by a cron job and I've checked permissions and content several times. I'll try using parse_string etc if I put together an SSCCE.

    I think Corion's post indicates that the problem isn't with the particular file, although it does seem that particular file is triggering the problem. But the identical result he got with a non-existent file is a strong suggestion that the problem lies elsewhere. In particular finding out where those funky <head/> and <body/> strings come from is my main focus at present.

      The document is being parsed as having no contents in the head and no contents in the body. Head and body elements are still parsed though, because in the HTML5 model, all HTML documents have a head and a body. You're using XML::LibXML to output the document, and XML::LibXML will typically output an empty HTML element like <blah />. So that's why you're seeing those in the output. I wouldn't expect that they're in the input.

      The problem is that it's not seeing anything at all in the head and body in the input. Probably because of a parsing error too extreme to recover from. But I'd need to see the file to be sure.

      In particular finding out where those funky <head/> and <body/> strings come from is my main focus at present.

      I would suggest that is a waste of time. It’s almost certainly indicative of the lack of head and body content. Those are merely how tags/elements/nodes without content are rendered. They are exactly equivalent to this style <head></head>. The problem lies elsewhere.

Re^2: HTML::HTML5::Parser weirdness
by djh (Novice) on Mar 04, 2020 at 17:45 UTC

    tobyink wrote:

    "If that still doesn't work, try emailing the file to the author of that module, tobyink @ cpan.org. He can be very helpful sometimes. :)"

    Hi Toby,

    I sent an email to you enclosing the files on 2020-02-27 and I sent another as a reminder this morning. But I haven't received an acknowledgment or anything else in reply. Did you receive them?

    Thanks, Dave

      As I said previously, I had written to tobyink @ cpan.org as he invited but received no reply. So I posted here to enquire but have heard nothing from him since. I can see that he's been visiting this site and posting in other threads since, so I'm not sure what to infer? I still have the problem of his module refusing to process a particular HTML file. Does anybody else have any suggestions as to how to proceed?

        In a previous post you said:

        While I appreciate the benefits of SSCCE, I think the effort I would need to construct one in this case outweighs the benefits. But I may do so if I'm still stuck after a while.

        As it's now "after a while" and you're still apparently stuck, maybe now is the time to expend that effort and produce the SSCCE?