http://qs321.pair.com?node_id=11117402

corfuitl has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I am trying to implement a function to split a sentence with XML tags in 3 parts. To be more clear, I want to parse a sentence with XML tags so that I will not have the wrapping tags. Input sentences are:

<d id="43">Text </d> here <a id="33"/> <b id="33"/> Text <d id="43">text</d> here <d id="43">text here</d> <d id="43">text here</d> <d id="44">text here</d>

Output should be

start: "", middle: "<d id="43">Text </d> here", end: " <a id="33"/>" start: "<b id="33"/> ", middle: "Text <d id="43">text</d> here", end: +"" start: "<d id="43">", middle: "text here", end: "</d>" start: "", middle: "<d id="43">text here</d> <d id="44">text here</d>" +, end: ""

I started my code but I don't think it's efficient. Any suggestions? a, b and c tags are self tags while d is always paired.

my $segment = $_; my $start =""; my $end =""; my $middle =""; while ($segment =~ /^(<[a|b|c] id=\".*?\"\/>)/ || $segment =~ /^(\ +s+)/){ $start .= $1; $segment =~ s/^\Q$1\E//; } while ($segment =~ /(\s+)$/ || $segment =~ /(<[a|b|c] id=\".*?\"\/ +>)$/){ $end = "$1$end"; $segment =~ s/\Q$1\E$//; } while ($segment =~ /^(<d id=\".*?\">).*?(<\/d>)/){ ---- } print "start: \"$start\", middle: \"$middle\", end: \"$end\"\n";

Replies are listed 'Best First'.
Re: Getting start and end xml tags
by Corion (Patriarch) on May 28, 2020 at 12:41 UTC

      Hi

      Thank you for your time. I don't have experience on these modules, so this is why I haven't used in my code :(

        It sounds as if now is a good time to familiarize yourself with one of these modules.

        When I inevitably hit the omg I gotta parse some XML* stage, I found it easier to understand an event-based parser, but you may not. It was a long time ago, but I think the module I found easiest to work with and understand was XML::Simple (oh but don't use that either!) (SAX (wikipedia) based parsing), which is event-based (found a start tag!, etc), but I can't really remember since it's been a very long time (circa 2004**); I think the book I bought at the time was, Perl and XML.

        * I am a Perl programming, Jeb! Not a Java monkey!

        ** Yes, everyone hits this stage as they're trucking along. It's a good sign. But please listen to those wiser than you (not me, I mean the others telling you to not parse XML with regexes)

Re: Getting start and end xml tags
by marto (Cardinal) on May 28, 2020 at 12:45 UTC

    Similar to your previous questions (e.g. Match text from txt to html), have you considered using a proper parser? What is the source/generator of this XML?

      Hi

      Thank you for your reply. It's not similar to my previous question.

        Your previous questions about working with HTML/XML have had solutions provided using proper parsing modules. The post I linked to was about working with sentences split up by HTML tags, here you have XML. How is this not similar?

        Unless you are dealing with very trivial, very static (X|HT|SG)ML markup the answer to "How do I do X with my FOOML" is pretty much invariably going to entail starting off "Use a FOOML parsing module, then . . .". See Why a regex *really* isn't good enough for HTML and XML, even for "simple" tasks for a recent, good pathological example why.

        Even if you think you've really got something trivial you're then betting against yourself that that format will never change and have baked an unspoken brittleness into your code.

        The cake is a lie.
        The cake is a lie.
        The cake is a lie.

Re: Getting start and end xml tags
by haukex (Archbishop) on May 28, 2020 at 14:20 UTC
    I am trying to implement a function to split a sentence with XML tags in 3 parts. To be more clear, I want to parse a sentence with XML tags so that I will not have the wrapping tags.

    Sorry, I don't understand how this matches up with the examples you gave, in particular why '<d id="43">text here</d>' gets split but the other <d> tags don't. I suspect this is an XY Problem, so perhaps you could explain why you want to split them in the peculiar way, and how you plan on using these split up values. I have a suspicion that a solution that splits up all the tags (optionally reassembling them later) will be easier for you.

    In any case, do not use regular expressions to parse XML/HTML.