Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Trouble in text manipulation

by thirilog (Acolyte)
on Aug 05, 2010 at 07:28 UTC ( [id://853040]=perlquestion: print w/replies, xml ) Need Help??

thirilog has asked for the wisdom of the Perl Monks concerning the following question:

Hi Gurus, Am trying to club the <text entries together those having same y & page attribute values from the below input. Am performing this individually those having two & three <text> entries with in <font> block:
<font...><text..><text..></font> & <font...><text..><text..><text..></font>
when trying with <font..> block having two <text> entries it's not working please clarify me what's the defect?
Input a.xml
-----------
<font size="12" face="IJCINN+AvantGarde-Bold" color="#1BADEB"> <text x="198" y="200" width="32" height="12" page="vii">Part I</text> <text x="242" y="200" width="75" height="12" page="vii">Introduction</ +text> <text x="329" y="200" width="7" height="12" page="vii">2</text> </font> <font size="9" face="IJCINN+AvantGarde-Bold" color="#231F20"> <text x="183" y="221" width="47" height="9" page="vii">Chapter 1</text +> </font> <font size="10" face="IJCIOP+Frutiger-Light" color="#231F20"> <text x="242" y="220" width="121" height="10" page="vii">Managers and +Management</text> <text x="373" y="220" width="6" height="10" page="vii">2</text> </font> <font size="9" face="IJCINN+AvantGarde-Bold" color="#231F20"> <text x="198" y="234" width="32" height="9" page="vii">History</text> <text x="195" y="246" width="35" height="9" page="vii">Module</text> </font> <font size="12" face="IJCINN+AvantGarde-Bold" color="#1BADEB"> <text x="194" y="292" width="36" height="12" page="vii">Part II</text> <text x="242" y="292" width="54" height="12" page="vii">Planning</text +> <text x="308" y="292" width="15" height="12" page="vii">56</text> </font> ------------------ Code ------------------ use warnings; use strict; undef $/; open(A3a,"a.xml") or die "$!"; open(B3a, ">a5.xml") or die("Sorry!"); my $tab_space = 16; my ($xa, $ya, $wida, $heiga, $paga, $inxa, $xaa, $yaa, $widaa, $heigaa +, $pagaa, $inxaa); my ($content2, $wid_new1); $content2 = <A3a>; $content2 =~s/\n//gi; while($content2 =~m/<font size="(.*?)" face="(.*?)" color="(.*?)"><tex +t x="(.*?)" y="(.*?)" width="(.*?)" height="(.*?)" page="(.*?)">(.*?) +<\/text><text x="(.*?)" y="(.*?)" width="(.*?)" height="(.*?)" page=" +(.*?)">(.*?)<\/text><\/font>/msgi){ $xa = $4; $ya = $5; $wida = $6; $heiga = $7; $paga = $8; $inxa = $9; $xaa = $10; $yaa = $11; $widaa = $12; $heigaa = $13; $pagaa = $14; $inxaa = $15; if ($ya == $yaa && $paga == $pagaa){ $wid_new1 = $wida + $widaa + $tab_space; $content2 =~s/<font size=\"(.*?)\" face=\"(.*?)\" color=\"(.*? +)\"><text x=\"(.*?)\" y=\"(.*?)\" width=\"(.*?)\" height=\"(.*?)\" pa +ge=\"(.*?)\">(.*?)<\/text><text x=\"(.*?)\" y=\"(.*?)\" width=\"(.*?) +\" height=\"(.*?)\" page=\"(.*?)\">(.*?)<\/text><\/font>/<font>\n<tex +t x=\"$xa\" y=\"$ya\" width=\"$wid_new1\" height=\"$heiga\" page=\"$p +aga\">$inxa~~~$inxaa<\/text>\n/msgi; } else { $content2 =~s/<font size=\"(.*?)\" face=\"(.*?)\" color=\"(.*? +)\"><text x=\"(.*?)\" y=\"(.*?)\" width=\"(.*?)\" height=\"(.*?)\" pa +ge=\"(.*?)\">(.*?)<\/text><text x=\"(.*?)\" y=\"(.*?)\" width=\"(.*?) +\" height=\"(.*?)\" page=\"(.*?)\">(.*?)<\/text><\/font>/<font>\n<tex +t x=\"$xa\" y=\"$ya\" width=\"$wida\" height=\"$heiga\" page=\"$paga\ +">$inxa<\/text>\n<text x=\"$xaa\" y=\"$yaa\" width=\"$widaa\" height= +\"$heigaa\" page=\"$pagaa\">$inxaa<\/text>\n<\/font>\n/msgi; } print B3a $content2; } close (A3a); close (B3a);

but it's working fine for three <text..> entries present in in <font..> block
use warnings; undef $/; open(A3,"a.xml") or die "$!"; open(B3, ">a4.xml") or die("Sorry!"); my $tab_space = 16; my ($xx, $yy, $wid, $heig, $pag, $inx, $xx1, $yy1, $wid1, $heig1, $pag +1, $inx1, $inx2, $xx2, $yy2, $wid2, $heig2, $pag2, $xx3, $yy3, $wid3, + $heig3, $pag3, $inx3, $size, $face, $color); my ($content1, $wid1_new, $wid_new1); $content1 = <A3>; $content1 =~s/\n//gi; while($content1 =~m/<font size="(.*?)" face="(.*?)" color="(.*?)"><tex +t x="(.*?)" y="(.*?)" width="(.*?)" height="(.*?)" page="(.*?)">(.*?) +<\/text><text x="(.*?)" y="(.*?)" width="(.*?)" height="(.*?)" page=" +(.*?)">(.*?)<\/text><text x="(.*?)" y="(.*?)" width="(.*?)" height="( +.*?)" page="(.*?)">(.*?)<\/text><\/font>/gi){ $size = $1; $face = $2; $color = $3; $xx1 = $4; $yy1 = $5; $wid1 = $6; $heig1 = $7; $pag1 = $8; $inx1 = $9; $xx2 = $10; $yy2 = $11; $wid2 = $12; $heig2 = $13; $pag2 = $14; $inx2 = $15; $xx3 = $16; $yy3 = $17; $wid3 = $18; $heig3 = $19; $pag3 = $20; $inx3 = $21; if ($yy1 == $yy2 && $yy2 == $yy3 && $pag1 == $pag2 && $pag2 == $pa +g3){ $wid1_new = $wid1 + $wid2 + $wid3 + $tab_space; print B3 "<text x=\"$xx1\" y=\"$yy1\" width=\"$wid1_new\" heig +ht=\"$heig1\" page=\"$pag1\">$inx1^^^$inx2%%%$inx3<\/text>\n"; } else { print B3 "<font size=\"$size\" face=\"$face\" color=\"$color\" +>\n<text x=\"$xx1\" y=\"$yy1\" width=\"$wid1\" height=\"$heig1\" page +=\"$pag1\">$inx1<\/text>\n"; print B3 "<text x=\"$xx2\" y=\"$yy2\" width=\"$wid2\" height=\ +"$heig2\" page=\"$pag2\">$inx2<\/text>\n"; print B3 "<text x=\"$xx3\" y=\"$yy3\" width=\"$wid3\" height=\ +"$heig3\" page=\"$pag3\">$inx3<\/text>\n</font>\n"; } } close (A3); close (B3);
Could someone please help me?
Thanks in advance,
Thirilog

Replies are listed 'Best First'.
Re: Trouble in text manipulation
by davido (Cardinal) on Aug 05, 2010 at 08:07 UTC

    The best help you're going to receive is for someone to suggest that you use a parsing module instead of a monster regular expression to handle this task. You're not manipulating text, you're manipulating what looks to me a lot like HTML. Use an HTML parser. Regular Expression based solutions to parsing HTML are fragile, and difficult to maintain.

    It may be possible to 'fix' the issue you're having using some variation of the regular expression you've constructed, but it will break again as soon as the input takes on some characteristic you weren't anticipating. A good parser won't balk at textual nuances that could hamper regular expressions solutions.


    Dave

      Thanks Dave!
Re: Trouble in text manipulation
by murugu (Curate) on Aug 05, 2010 at 11:15 UTC
    Thirilog,

    From the input(a.xml) it seems that you are handling XML file. Its better to use XML modules to fiddle with XML rather than with regular expressions.

    Kindly take a look at XML::Twig. You can use XPATH expressions to access the elements/attributes as you wish.

    Regards,
    Murugesan Kandasamy
    use perl for(;;);

Re: Trouble in text manipulation
by graff (Chancellor) on Aug 06, 2010 at 04:46 UTC
    I did try to use the two versions of code in the OP, but neither of them produced any output at all for the sample data you provided. And I'm not about to figure out why -- the time would be better spent starting over with a better approach.

    I think this bit from the OP is the "goal", but I'm not quite sure what you mean by this:

    Am trying to club the <text entries together those having same y & page attribute values from the below input.

    Do you mean: if there are two (or three, or more) "text" elements within one "font" element, and they all have the same attribute values for "y" and "page", you want them to be collapsed together into a single "text" element? If that's what you mean, that's an intriguing task, which I was able to solve using XML::LibXML. (Other monks could probably do it more neatly -- "I am just an egg" when it comes to DOM manipulation...)

    In order to get that data to work with XML::LibXML (or any XML parser), I needed to add a "root" element around the set of "font" elements. Here's the code with the parsable version of the data attached:

    When I ran that, it printed the following results to STDOUT -- but of course I don't know if this is what you really want...
    <?xml version="1.0"?> <doc> <font size="12" face="IJCINN+AvantGarde-Bold" color="#1BADEB"> <text x="198" y="200" width="32" height="12" page="vii">Part I Introdu +ction 2 </text> </font> <font size="9" face="IJCINN+AvantGarde-Bold" color="#231F20"> <text x="183" y="221" width="47" height="9" page="vii">Chapter 1 </tex +t> </font> <font size="10" face="IJCIOP+Frutiger-Light" color="#231F20"> <text x="242" y="220" width="121" height="10" page="vii">Managers and +Management 2 </text> </font> <font size="9" face="IJCINN+AvantGarde-Bold" color="#231F20"> <text x="198" y="234" width="32" height="9" page="vii">History</text> <text x="195" y="246" width="35" height="9" page="vii">Module</text> </font> <font size="12" face="IJCINN+AvantGarde-Bold" color="#1BADEB"> <text x="194" y="292" width="36" height="12" page="vii">Part II Planni +ng 56 </text> </font> </doc>
      Great you are. Hats off Graff!
      You're just like that covered my requirement, am going to use your code straight away by adding a small computation. In additional I need to sum the x axis values of 'text" elements get collapsed.
      As am new for perl modules bit hard to understand your logic; will manipulate the code per my req.
      The hardest part is: Even I loaded the XML::LibXML module its showing error like cant locate the loadable object for module XML::LibXML...
      Can't locate loadable object for module XML::LibXML in @INC (@INC cont +ains: C:/P erl/lib C:/Perl/site/lib .) at C:/Perl/lib/DynaLoader.pm line 153 BEGIN failed--compilation aborted at C:/Perl/lib/XML/LibXML.pm line 15 +3. Compilation failed in require at tocgen_v3.pl line 4. BEGIN failed--compilation aborted at tocgen_v3.pl line 4.
      Don't know what the problem is, keep on trying...
      And...regarding the necessity of root element in input XML; I have it already; I was missed this to provide.
      Many Thanks to All
        Looks like a problem with how XML::LibXML was installed on your machine, maybe involving a dependency that is external to the module -- like maybe the GNU LibXML package is missing, or was installed in a "non-standard" path?

        GNU LibXML (a separate thing from cpan modules) needs to be installed first, and when you (re)install the cpan XML::LibXML modules, that installation process needs to know where to find the GNU LibXML stuff.

        It could be that your current XML::LibXML installation was done "by force", which would put all the XML/LibXML/*.pm files where they belong, even though there's no actual linkage to the GNU library (hence making the modules unusable). Good luck with (re)installing...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://853040]
Approved by wfsp
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (5)
As of 2024-03-29 05:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found