Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^2: Parsing HTML/XML with Regular Expressions (XML::Twig)

by haukex (Archbishop)
on Oct 17, 2017 at 11:25 UTC ( [id://1201490]=note: print w/replies, xml ) Need Help??


in reply to Re: Parsing HTML/XML with Regular Expressions (XML::Twig)
in thread Parsing HTML/XML with Regular Expressions

Thanks very much for the contribution! Regarding the DATA and &nbsp; issues, see my reply here - although I assume you meant $twig->parse(*DATA) instead of $twig->parse(<DATA>)? With the updated example in the root node, your code works!

And yes, I assumed someone might take up the challenge of actually using a regex - but of course then I'd have to try to break it with more test cases ;-)

Replies are listed 'Best First'.
Re^3: Parsing HTML/XML with Regular Expressions (XML::Twig)
by Discipulus (Canon) on Oct 17, 2017 at 19:45 UTC
    You presumed ~right about DATA filehandle.

    The xmltwig.org and docs specify parse    $string or \*OPEN_FILEHANDLE among twig's methods.

    So you are right: I had to pass an handle not an iterator (?) like <DATA>

    I dunno when I took this bad habit but if you look at this and this other one and this other too and probably many others of mines, $twig->parse(<DATA>) works!!

    So $twig->parse(<DATA>) does not works with your example but i can confirm that passing the filehandle $twig->parse(\*DATA) or even $twig->parse(*DATA) works as expected.

    Can be that wrong form works (at least sometimes) because of the XML::Twig ability to parse streams of XML?

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

      In those three examples you linked to, right before you say <DATA> you do $/='';, which enables "paragraph mode", it's as if the input record separator $/ was /\n\n+/.

      So you are right: I had to pass an handle not an iterator (?) like <DATA>

      <DATA> is the equivalent of readline(DATA), and since readline is being called in list context, it'll read all the records from the handle and return a list of them. So as long as your __DATA__ section doesn't contain any empty lines, it's essentially the same as a slurp - this is probably why the "wrong form" still works.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1201490]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (3)
As of 2024-04-26 05:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found