http://qs321.pair.com?node_id=230601

wanadlan has asked for the wisdom of the Perl Monks concerning the following question:

This node falls below the community's threshold of quality. You may see it by logging in.
  • Comment on How can I extract text from XML document and after that put the extracted text to original place?

Replies are listed 'Best First'.
Re: How can I extract text from XML document and after that put the extracted text to original place?
by rdfield (Priest) on Jan 28, 2003 at 15:14 UTC
    in other word, the full source code
    Not a chance. However, you might want to check CPAN: A quick search of the Monestary might turn up a few suggestions too. Have you looked in the Tutorials? Perhaps a book like XML and Perl, recently reviewed here by davorg might be of use.

    rdfield

Re: How can I extract text from XML document and after that put the extracted text to original place?
by davorg (Chancellor) on Jan 28, 2003 at 15:25 UTC

    This sounds like a perfect use for a SAX filter.

    Process your file and event at a time. When you get a text event, run it through your spell checker. Write either the original event or the corrected text into a new file.

    --
    <http://www.dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

(jeffa) Re: How can I extract text from XML document and after that put the extracted text to original place?
by jeffa (Bishop) on Jan 28, 2003 at 19:05 UTC
    As the others have already kindly explained, please Read the Kind Manual. However, i feel inclined to throw a 'MU' at this problem. Why would one even want to spell check an XML document? XML is generally used to hold data that is later going to be transformed into something else. Why not run the spell checker before you even create the XML or during the transformation or, in the case of plain text, after the trasnformation is finished?

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    

      What if your XML is a document, that will be later converted to HTML, PDF an text? Then it makes sense to spell-check it the XML instead of one of the target formats. Granted your XML editor probably has a spell checker already, but if you use a pure text editor that has no spell checking capability (ed? ;--) to create short XML documents, or if you receive them from other authors that do not spell check them, then it might make sense to spell check the XML as a separate step.

Re: How can I extract text from XML document and after that put the extracted text to original place?
by boo_radley (Parson) on Jan 28, 2003 at 15:53 UTC
    I can provide you with source code for a reasonable price. msg me if you're interested.
XML::Simple for looping through an XML structure
by Coruscate (Sexton) on Jan 28, 2003 at 18:24 UTC

    You might want to look at the handy dandy XML::Simple module. Look at the XMLin() and XMLout() methods. XMLin() allows you to read in an XML document. Loop through the data structure that is returned from XMLin(), run your spell check on the data within in, then write the final results back to the XML document via XMLout().

    As for exporting the tags and the text between the tags to two separate files and then putting them back together, just say 'NO'. On a large XML file, this would be extremely slow and you'd be doing much more work than necessary.


          C:\>shutdown -s
          >> Could not shut down computer:
          >> Microsoft is logged in remotely.
        

      XML::Simple would probably not work here as it is designed for data-oriented XML and would not properly handle XML documents that include <p>some <i>mixed content</i> like this</p>.

      As for this method being a problem for very large files, in that case the bottleneck would not be the processing time but more likely the time spent using the spell checker interractively. If that's really a problem (a huge file with very few spelling mistakes) you can always do it chunk by chunk using... say... XML::Twig ;--)

Re: How can I extract text from XML document and after that put the extracted text to original place?
by Anonymous Monk on Jan 29, 2003 at 11:20 UTC
    I only want the method that I had given. It because I have built my own spell checker (also use ispell). I want extract text from xml file (to one file) because I want the extracted text is show in 'textarea box' in my application (with access text file). So user (who don't know about xml) can check their xml document without seeing xml tag. It will easy user check their xml doc.

    Moreover, I do not make new spell checker. It will waste my time. I must do this application as soon as possilble. Plz.
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: How can I extract text from XML document and after that put the extracted text to original place?
by wanadlan (Initiate) on Jan 28, 2003 at 17:07 UTC
    My problem is not on spell checker but on extract text from xml document and put text in original place in xml.
    eg: 1. spell error on "tex" <xmltag>tex</xmltag>. 2. extract tex to one file to do spell checking. extract <xmltag> to other file. content of file 1: tex content of file 2: <xmltag> </xmltag> 3. check the content of file 1 : tex --> text 4. after spell checking content of file 1: text content of file 2: <xmltag> </xmltag> 5. combine the content of this two file -produce new xml file that contain: <xmltag>text</xmltag>

    I hope you all can consider this problem. Thanx you.
      Well, i finally convinced myself that i have already written something very similar to this: (jeffa) Re: XML Search and Replace. That combined with my Lingua::Ispell review yielded the following:
      use strict; use warnings; use XML::Parser; use XML::Writer; use Lingua::Ispell qw(spellcheck); # change me to the output of 'which ispell' $Lingua::Ispell::path = '/path/to/ispell'; my $writer = XML::Writer->new(); my $parser = XML::Parser->new( Handlers => { Init => \&handle_Init, Start => \&handle_Start, Char => \&handle_Char, End => \&handle_End, Final => \&handle_Final, } ); $parser->parse(*DATA); sub handle_Init { $writer->xmlDecl('UTF-8'); $writer->doctype('xml'); } sub handle_Start { my($self,$name,%atts) = @_; $writer->startTag($name,%atts); } sub handle_Char { my($self,$text) = @_; for my $r (spellcheck($text)) { if ($r->{type} eq 'miss') { $text =~ s/$r->{term}/$r->{misses}->[0]/; } } $writer->characters($text); } sub handle_End { my($self,$name) = @_; $writer->endTag($name); } sub handle_Final { $writer->end(); } __DATA__ <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE xml> <xml> <stuff class="spelled wrong"> <item>ys we ave no banans</item> <item>els in my hooverkraft</item> </stuff> <stuff class="spelled right"> <item>yes we have no bananas</item> <item>eels in my hovercraft</item> </stuff> </xml>
      The important part is the handle_Char() subroutine. Right now, it simply replaces the mispelled item with the first 'miss' ispell coughs up. You will need to add an interface that allows a user to choose which miss they really want. That should be fairly simple - print the list of misses out for the user along with each misses' index to $r->{misses} and have them enter the index number. Also note that my script uses the built-in DATA filehandle for input and stdout for output -- you will want to change these. Good luck, and remember that this is Just One Way To Do It -- there are many more. :)

      jeffa

      L-LL-L--L-LL-L--L-LL-L--
      -R--R-RR-R--R-RR-R--R-RR
      B--B--B--B--B--B--B--B--
      H---H---H---H---H---H---
      (the triplet paradiddle with high-hat)
      
Re: How can I extract text from XML document and after that put the extracted text to original place?
by wanadlan (Initiate) on Jan 28, 2003 at 16:23 UTC
    I'm sorry. I hope you all can help me. Not mean full source code but I want u all help me to solve this problem and give me tips to do this. plz.