Getting start and end xml tags

corfuitl has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to implement a function to split a sentence with XML tags in 3 parts. To be more clear, I want to parse a sentence with XML tags so that I will not have the wrapping tags. Input sentences are:

<d id="43">Text </d> here <a id="33"/>
<b id="33"/> Text <d id="43">text</d> here 
<d id="43">text here</d>
<d id="43">text here</d> <d id="44">text here</d>
[download]

Output should be

start: "", middle: "<d id="43">Text </d> here", end: " <a id="33"/>"
start: "<b id="33"/> ", middle: "Text <d id="43">text</d> here", end: 
+""
start: "<d id="43">", middle: "text here", end: "</d>"
start: "", middle: "<d id="43">text here</d> <d id="44">text here</d>"
+, end: ""
[download]

I started my code but I don't think it's efficient. Any suggestions? a, b and c tags are self tags while d is always paired.

 my $segment = $_;
    my $start ="";
    my $end ="";
    my $middle ="";
    
    while ($segment =~ /^(<[a|b|c] id=\".*?\"\/>)/ || $segment =~ /^(\
+s+)/){
        $start .= $1;
        $segment =~ s/^\Q$1\E//;
    }

    while ($segment =~ /(\s+)$/ || $segment =~ /(<[a|b|c] id=\".*?\"\/
+>)$/){
        $end = "$1$end";
        $segment =~ s/\Q$1\E$//;
    }
        
    while ($segment =~ /^(<d id=\".*?\">).*?(<\/d>)/){
    ----
    }

    print "start: \"$start\", middle: \"$middle\", end: \"$end\"\n";
[download]

Comment on Getting start and end xml tags Select or Download Code

Replies are listed 'Best First'.
Re: Getting start and end xml tags by Corion (Patriarch) on May 28, 2020 at 12:41 UTC
Have you looked at XML::LibXML::SAX or XML::Parser or XML::Twig? All can give you the start, "middle" and end of an XML tag.	[reply]
Re^2: Getting start and end xml tags by corfuitl (Sexton) on May 28, 2020 at 13:00 UTC
Hi Thank you for your time. I don't have experience on these modules, so this is why I haven't used in my code :(	[reply]
Re^3: Getting start and end xml tags by Corion (Patriarch) on May 28, 2020 at 13:05 UTC
It sounds as if now is a good time to familiarize yourself with one of these modules.	[reply]
Re^3: Getting start and end xml tags by perlfan (Vicar) on May 28, 2020 at 13:35 UTC
When I inevitably hit the omg I gotta parse some XML* stage, I found it easier to understand an event-based parser, but you may not. It was a long time ago, but I think the module I found easiest to work with and understand was XML::Simple (oh but don't use that either!) (SAX (wikipedia) based parsing), which is event-based (found a start tag!, etc), but I can't really remember since it's been a very long time (circa 2004*); I think the book I bought at the time was, Perl and XML. I am a Perl programming, Jeb! Not a Java monkey! ** Yes, everyone hits this stage as they're trucking along. It's a good sign. But please listen to those wiser than you (not me, I mean the others telling you to not parse XML with regexes)	[reply]
Re^4: Getting start and end xml tags by haukex (Archbishop) on May 28, 2020 at 13:42 UTC
Re^5: Getting start and end xml tags by perlfan (Vicar) on May 28, 2020 at 14:07 UTC
Some notes below your chosen depth have not been shown here
Re: Getting start and end xml tags by marto (Cardinal) on May 28, 2020 at 12:45 UTC
Similar to your previous questions (e.g. Match text from txt to html), have you considered using a proper parser? What is the source/generator of this XML?	[reply]
Re^2: Getting start and end xml tags by corfuitl (Sexton) on May 28, 2020 at 12:58 UTC
Hi Thank you for your reply. It's not similar to my previous question.	[reply]
Re^3: Getting start and end xml tags by marto (Cardinal) on May 28, 2020 at 13:14 UTC
Your previous questions about working with HTML/XML have had solutions provided using proper parsing modules. The post I linked to was about working with sentences split up by HTML tags, here you have XML. How is this not similar?	[reply]
Re^3: Getting start and end xml tags by Fletch (Bishop) on May 28, 2020 at 13:20 UTC
Unless you are dealing with very trivial, very static (X\|HT\|SG)ML markup the answer to "How do I do X with my FOOML" is pretty much invariably going to entail starting off "Use a FOOML parsing module, then . . .". See Why a regex really isn't good enough for HTML and XML, even for "simple" tasks for a recent, good pathological example why. Even if you think you've really got something trivial you're then betting against yourself that that format will never change and have baked an unspoken brittleness into your code. The cake is a lie. The cake is a lie. The cake is a lie.	[reply]
Re: Getting start and end xml tags by haukex (Archbishop) on May 28, 2020 at 14:20 UTC
I am trying to implement a function to split a sentence with XML tags in 3 parts. To be more clear, I want to parse a sentence with XML tags so that I will not have the wrapping tags. Sorry, I don't understand how this matches up with the examples you gave, in particular why `'<d id="43">text here</d>'` gets split but the other `<d>` tags don't. I suspect this is an XY Problem, so perhaps you could explain why you want to split them in the peculiar way, and how you plan on using these split up values. I have a suspicion that a solution that splits up all the tags (optionally reassembling them later) will be easier for you. In any case, do not use regular expressions to parse XML/HTML.	[reply] [d/l] [select]


Just another Perl shrine
	PerlMonks