Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Unable to get more than one line

by winefm (Initiate)
on Jan 02, 2003 at 14:09 UTC ( #223771=perlquestion: print w/replies, xml ) Need Help??

winefm has asked for the wisdom of the Perl Monks concerning the following question:

I have some doubts in pattern matching in perl. I am unable to take more than one line in xml file, the file is as below:
<profile> <emp> <name>Mahesh</name> <age>24</age> <address>New york</address> <desig>Developer</desig> </emp> </profile>
I need to extract data from <emp>.*</emp>, so that I used code like below:
open(F, "<prof.xml"); while(<F>) { $_ =~m/(<emp>.*<\/emp>)/m); print "$1\n"; }
but I am unable to extract data, Please help me.

update (broquaint): removed HTML tags and added <code> tags

Replies are listed 'Best First'.
Re: Unable to get more than one line
by gjb (Vicar) on Jan 02, 2003 at 14:19 UTC

    Don't, don't, don't do XML (or HTML) parsing/data extraction with regular expressions. XML has a tree structure and in general can't be described by a regular expression.

    Use a tool such as XML::Simple or XML::Parser to handle such a job.

    Just my 2 cents, -gjb-

      I would add one caveat - if you need to handle many many files, and your file format is fixed, regular expressions will be much faster than parsing the XML. I got burned by going the virtuous route on a set of 60K Reuters wire stories - there was an order of magnitude speed difference between regular expressions and XML::Parser.

      I have found this to be a tradeoff with many XML tools - the right way to do it tends also to be slow, resource intensive, or both. XSLT comes to mind. It is frustrating, but hopefully a temporary growing pain.

        As far as speed goes, I think you'll find that XML::LibXML is the fastest XML parser on the block and somewhat preferable than XML::Parser.

        -- vek --

        Hear, hear.

        The few times I have had to deal with XML, I find that people tend to pay lip service to it, and manage to emit badly formed XML far more often than they get it right. Lone & characters in text being the worst offense. In order to use XML parsing tools, you first have to run a cleanup script over the received data so that the tools don't curl up and die.

        Furthermore, the XML in question is usually being emitted from an old program that has been modified to produce XML today, when in the past it was producing plain old data. By extension, it means that XML you get to deal with has a rigid structure, not at all free-form as the spec might make you think.

        I would hazard a bet and say that the majority of XML used is to get one system to speak to another system. I would guess that the number of instances where one system has to deal with incoming XML instance from multiple sources is quite small in comparison.

        If you are in the position of getting data from one system to another you usually have control over how and when the format is changed. When you have that much control over the environment, simple methods suffice.

        For instance, to paraphrase some old code I have, you can get a lot of mileage out of Perl's wonderful ... operator (not to be confused with ..).

        #! /usr/bin/perl -w use strict; my @stuff = grep { /<emp>/ ... /<\/emp>/ } <DATA>; __DATA__ <profile> <emp> <name>Mahesh</name> <age>24</age> <address>New york</address> <desig>Developer</desig> </emp> </profile> <junk> <morejunk /> </junk> <profile> <emp> <name>Mahesh2</name> <age>242</age> <address>New york2</address> <desig>Developer2</desig> </emp> </profile>

        You might ask what happens when a new element is added. Well, surprise! you will be obliged to modify your script that parses XML too, if you want to do anything with it.

        Don't get me wrong, I am a big fan of XML, but I think it suffers from too much hype. People seem to be happy to use it even when simpler methods exist.


        print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u'
Re: Unable to get more than one line
by broquaint (Abbot) on Jan 02, 2003 at 14:21 UTC
    This is because you're only matching a line at a time, whereas you want to match the whole file e.g
    ## die() if we can't open the file open(F, "prof.xml") or die("ack: $!"); ## join together all the lines into a single string my $xml = join '', <F>; # see. also C<local $/> ## assign $data to the capture in the regex ## also note the use of the 's' modifier (see. man perlre) my($data) = $xml =~ m{(<emp>.*</emp>)}s;
    But if you're working with XML you'll be wanting an XML parser to make your life easier. Firstly there's the basic XML::Parser, but that'll probably be a little clunky for your needs so you may want to use XML::Simple instead.
    HTH

    _________
    broquaint

Re: Unable to get more than one line
by hiseldl (Priest) on Jan 02, 2003 at 14:27 UTC

    If you want to write a short script, check out the XML::Simple module. If you are writing an application take a look at XML::Parser or XML::Twig.

    Here is a snippet that may be helpful:

    #!/usr/bin/perl.exe -w use strict; use XML::Simple; use Data::Dumper; my $ref = XMLin("./data.xml"); print Dumper($ref); print $ref->{emp}->{name},$/; print $ref->{emp}->{age},$/; print $ref->{emp}->{address},$/
    ...and just for completeness here is the 'data.xml' file that I used:
    <profile> <emp> <name>Mahesh</name> <age>24</age> <address>New york</address> <desig>Developer</desig> </emp> </profile>
    ...and what my results looked like:
    $ ./script.pl $VAR1 = { 'emp' => { 'desig' => 'Developer', 'address' => 'New york', 'age' => '24', 'name' => 'Mahesh' } }; Mahesh 24 New york

    --
    hiseldl
    What time is it? It's Camel Time!

Re: Unable to get more than one line
by Sifmole (Chaplain) on Jan 02, 2003 at 14:22 UTC
    Your problem is that the  while (<F>) will only read a single line, and your  emp and /emp do not occure on the same line.

    To do what you are trying to do here you could use  local $/; before the  while, to make Perl slurp in the whole data as one line. You will also need to alter the regex to use the s option ( which also has an extra paren at the end ):

    $_ =~m/(<emp>.*<\/emp>)/s;
    The /s says (simplified) to treat the whole string as a single line -- don't stop patterns at the newlines.

    Of course if you are doing more than just toy-work with XML you might want to check out www.cpan.org and check out the libraries available there.

Re: Unable to get more than one line
by jimc (Sexton) on Jan 02, 2003 at 21:56 UTC
    SYNOPSIS
    use XML::Simple;
    use Data::Dumper;
    my $ref = XMLin($filename);
    print Dumper $ref;
    
    from there, its a simple structure deref. you should use the right tool for the job. regexs are great, but not for evrythg

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://223771]
Approved by rob_au
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (7)
As of 2022-06-28 18:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My most frequent journeys are powered by:









    Results (92 votes). Check out past polls.

    Notices?