Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re^2: Parsing HTML/XML with Regular Expressions (XML::Twig)

by holli (Abbot)
on Oct 17, 2017 at 09:58 UTC ( [id://1201488]=note: print w/replies, xml ) Need Help??


in reply to Re: Parsing HTML/XML with Regular Expressions (XML::Twig)
in thread Parsing HTML/XML with Regular Expressions

some working regex solution
That's certainly possible. It was possible to produce a regex that parses all of Perl, why not one for HTML?


holli

You can lead your users to water, but alas, you cannot drown them.

Replies are listed 'Best First'.
Re^3: Parsing HTML/XML with Regular Expressions (regex)
by RonW (Parson) on Oct 19, 2017 at 00:05 UTC
    It was possible to produce a regex that parses all of Perl, why not one for HTML?

    There is a regex to parse XML (so, therefore, XHTML): XML Shallow Parsing

    That regex produces a list of strings that will need further processing. Shallow parsing is mostly useful for XML-to-XML filtering. Technically, this challenge could be considered filtering, just not to XML. Will need to keep track of <div> nesting to find the end of the contained text.

    # Not tested and assumes proper nesting of <div> elements (and valid X +ML syntax) # (Warning: Messy hack. Read at your own risk.) my $nest = 0; my $out = ''; my @elements = $xml =~ /$XML_SPE/g; # see http://www.cs.sfu.ca/~camero +n/REX.html#AppA for (@elements) { if (/^<div/) { $nest++ if ($nest > 0); # only increment if inside an interest +ing <div> next unless (/class\h*=\h*['"]data['"]/); # \h is horizontal w +hite space next unless (/id\h*=\h*['"](\w+)['"]/); $out .= ", $1="; $nest = 1 if ($nest == 0); # if this is the outer most interes +ting <div> next; } $nest--, next if (/^<\/div/); next if (/^[<]/); # skip other mark-up $out .= $_ if ($nest > 0); } $out =~ s/^, //; say "$out\n";

    Update: Changed title to indicate (regex)

      Interesting post, thank you! I tested it and except that I had to strip non-word characters out of the values, it mostly works - it doesn't pick up the id of the Sunday Saturday entry, and it also picks up the values "bbbdddeeeggg", but overall it's a very interesting start. Regexes are a fine tool for lexing, and by adding some logic around them keeping track of the nested tags etc., it's basically like building a simple parser.

        I tried it and got no output. Did you fix something in my code?

        I did add a statement to output the list of elements from the shallow parse regex. As far as I can tell, it split out the elements correctly, but it left the embedded newlines in the mark-up elements.

        For example, the following:

        </div >

        became </div\n>

        In the case of the Sunday div:

        <div title=" class='data' id='Foo'>Bar" id="Seven" class="data">&#xA0;Sunda&#121;</div>

        became:

        <div title=" class='data' id='Foo'>Bar"\nid="Seven" class="data"> &#xA0;Sunda&#121; </div>

        So, I added tr/\n/ / for (@elements); to get rid of the embedded newlines. Still no output (other than the dump of the elements list).

        I did encounter an unexpected error: Variable "$XML_SPE" is not imported at extractor.pl line 46. So, I changed:

        my @elements = $xml =~ /$XML_SPE/g;

        to:

        my @elements = $xml =~ /$::XML_SPE/g;

        I don't have time to try to debug my code, now. Will try, later.

        Current code:

        And the output:

        Thanks. Also, you've got me curious. I still haven't tested it, but I'm guessing the interesting title attribute for the Sunday division is part of the problem. You said it didn't pick up the id. I would have thought my code would have picked up id='Foo'. About the bbbdddeeeggg I'm thinking my code had trouble finding the correct </div>.

        I will try it and look at the list of elements generated by the shallow parsing regex.

Re^3: Parsing HTML/XML with Regular Expressions (XML::Twig)
by soonix (Canon) on Oct 17, 2017 at 11:45 UTC
    I am not sure wether such a regex would fit even into the 18 Exabyte-limit of most modern file systems …
    :-)
      Perl is a bit more complex to parse than HTML, don't you think?


      holli

      You can lead your users to water, but alas, you cannot drown them.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1201488]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (6)
As of 2024-04-25 10:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found