Re^2: Parsing HTML/XML with Regular Expressions (XML::Twig)

Replies are listed 'Best First'.
Re^3: Parsing HTML/XML with Regular Expressions (regex) by RonW (Parson) on Oct 19, 2017 at 00:05 UTC
It was possible to produce a regex that parses all of Perl, why not one for HTML? There is a regex to parse XML (so, therefore, XHTML): XML Shallow Parsing That regex produces a list of strings that will need further processing. Shallow parsing is mostly useful for XML-to-XML filtering. Technically, this challenge could be considered filtering, just not to XML. Will need to keep track of `<div>` nesting to find the end of the contained text. # Not tested and assumes proper nesting of <div> elements (and valid X +ML syntax) # (Warning: Messy hack. Read at your own risk.) my $nest = 0; my $out = ''; my @elements = $xml =~ /$XML_SPE/g; # see http://www.cs.sfu.ca/~camero +n/REX.html#AppA for (@elements) { if (/^<div/) { $nest++ if ($nest > 0); # only increment if inside an interest +ing <div> next unless (/class\h=\h['"]data['"]/); # \h is horizontal w +hite space next unless (/id\h=\h['"](\w+)['"]/); $out .= ", $1="; $nest = 1 if ($nest == 0); # if this is the outer most interes +ting <div> next; } $nest--, next if (/^<\/div/); next if (/^[<]/); # skip other mark-up $out .= $_ if ($nest > 0); } $out =~ s/^, //; say "$out\n"; [download] Update: Changed title to indicate (regex)	[reply] [d/l] [select]
Re^4: Parsing HTML/XML with Regular Expressions (XML::Twig) by haukex (Archbishop) on Oct 19, 2017 at 16:27 UTC
Interesting post, thank you! I tested it and except that I had to strip non-word characters out of the values, it mostly works - it doesn't pick up the `id` of the ~~`Sunday`~~ `Saturday` entry, and it also picks up the values "`bbbdddeeeggg`", but overall it's a very interesting start. Regexes are a fine tool for lexing, and by adding some logic around them keeping track of the nested tags etc., it's basically like building a simple parser.	[reply] [d/l] [select]
Re^5: Parsing HTML/XML with Regular Expressions (regex) by RonW (Parson) on Oct 19, 2017 at 23:50 UTC
I tried it and got no output. Did you fix something in my code? I did add a statement to output the list of elements from the shallow parse regex. As far as I can tell, it split out the elements correctly, but it left the embedded newlines in the mark-up elements. For example, the following: `</div >` [download] became `</div\n>` In the case of the Sunday div: `<div title=" class='data' id='Foo'>Bar" id="Seven" class="data"> Sunday</div>` [download] became: `<div title=" class='data' id='Foo'>Bar"\nid="Seven" class="data">  Sunday </div>` [download] So, I added `tr/\n/ / for (@elements);` to get rid of the embedded newlines. Still no output (other than the dump of the elements list). I did encounter an unexpected error: `Variable "$XML_SPE" is not imported at extractor.pl line 46.` So, I changed: `my @elements = $xml =~ /$XML_SPE/g;` [download] to: `my @elements = $xml =~ /$::XML_SPE/g;` [download] I don't have time to try to debug my code, now. Will try, later. Current code: Read more... (6 kB) And the output: Read more... (3 kB)	[reply] [d/l] [select]
Re^6: Parsing HTML/XML with Regular Expressions (regex) by haukex (Archbishop) on Oct 20, 2017 at 09:03 UTC
Re^7: Parsing HTML/XML with Regular Expressions (regex) by RonW (Parson) on Oct 23, 2017 at 22:33 UTC
Re^7: Parsing HTML/XML with Regular Expressions (regex) by RonW (Parson) on Oct 20, 2017 at 21:29 UTC
Re^5: Parsing HTML/XML with Regular Expressions (regex) by RonW (Parson) on Oct 19, 2017 at 22:13 UTC
Thanks. Also, you've got me curious. I still haven't tested it, but I'm guessing the interesting title attribute for the Sunday division is part of the problem. You said it didn't pick up the id. I would have thought my code would have picked up `id='Foo'`. About the `bbbdddeeeggg` I'm thinking my code had trouble finding the correct `</div>`. I will try it and look at the list of elements generated by the shallow parsing regex.	[reply] [d/l] [select]
Re^3: Parsing HTML/XML with Regular Expressions (XML::Twig) by soonix (Canon) on Oct 17, 2017 at 11:45 UTC
I am not sure wether such a regex would fit even into the 18 Exabyte-limit of most modern file systems … :-)	[reply]
Re^4: Parsing HTML/XML with Regular Expressions (XML::Twig) by holli (Abbot) on Oct 17, 2017 at 17:27 UTC
Perl is a bit more complex to parse than HTML, don't you think? holli You can lead your users to water, but alas, you cannot drown them.	[reply] [d/l]
Re^5: Parsing HTML/XML with Regular Expressions (XML::Twig) by soonix (Canon) on Oct 18, 2017 at 06:22 UTC
Of course, my comment is an exaggeration, but at least the combination HTML5 + CSS3 is Turing-complete.	[reply]


Do you know where your variables are?
	PerlMonks