Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Parse a large string

by 1001jrlight (Initiate)
on Mar 10, 2009 at 11:34 UTC ( [id://749560]=perlquestion: print w/replies, xml ) Need Help??

1001jrlight has asked for the wisdom of the Perl Monks concerning the following question:

Hello,
I would be greatful for any help. I am trying to parse a large string of data. Problem is, is that the string is all condensed to one line. I can think of a couple of different ways to parse and print the data I need, if it were broken up into different lines, but since it's all technically on one line. I don't know what to do. I need to search for one of three expressions, and if it matches, print the sentance that contains it. For example:

data.txt
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla commodo dignissim dui. Mauris egestas nunc non justo. Praesent consectetur pharetranulla. Mauris sed magna. Fusce sit amet lectus. Aliquam bibendum mi sollicitudin nulla. Pellentesque volutpat. Morbi ac nibh ut mauris tempor molestie. Nullam sit amet mi at neque lacinia suscipit. Nunc sem erat, porta fermentum, tempus sed, porttitor et, nibh. Nulla turpis orci, egestas eget, lacinia id, tincidunt vel, ligula.Donec sit amet libero. Pellentesque ac felis vel erat interdum elementum. Praesent luctus tellus sit amet velit. Cras lacinia molestie nibh. Suspendisse cursus. Sed facilisis magna id nisl blandit malesuada. Cras commodo. Nam gravida dolor eu purus. Sed et velit. Nulla rhoncus hendrerit lectus. Ut nisi. Nam suscipit eros accumsan quam. Nam ornare. Morbi a ipsum non urna adipiscing tempus. Duis in dui a enim malesuada tempus. Ut vehicula sollicitudin tellus. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Vivamus gravida adipiscing purus. Phasellus varius nisi et mauris.Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed ullamcorper erat sit amet magna. Sed porta nisi quis leo. Integer elementum elit vel libero. Fusce vulputate magna sed nisi imperdiet fringilla. Nullam quis augue. Suspendisse mauris tortor, sollicitudin non, posuere ut, bibendum id, enim. Aenean id purus. Donec pretium. Nam blandit nisi at elit. Fusce ac erat et quam porta eleifend. Sed imperdiet bibendum nulla. Morbi varius sagittis justo. Phasellus hendrerit ullamcorper risus. Phasellus nisl ante, ullamcorper nec, pellentesque quis, rutrum in, ligula. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.Nulla facilisi. Suspendisse commodo diam ut dui. Mauris neque est, consequat vitae, vestibulum vel, sodales quis, mauris. Ut pharetra mauris sit amet metus. Nulla hendrerit sapien eleifend massa. Aliquam lacinia tempus augue. Nullam congue congue lectus. Suspendisse nulla lectus, rhoncus eu, dapibus et, tempus ut, sapien. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Cras libero. Curabitur scelerisque metus quis tortor facilisis ornare. Aenean sodales ante vitae eros. Suspendisse potenti. Integer auctor nisi a diam. Mauris tristique laoreet leo. Integer eu tortor. Quisque lacinia mauris et elit.Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Nam luctus mauris nec lorem. Suspendisse leo est, ornare quis, volutpat quis, imperdiet quis, diam. Duis vestibulum. Vestibulum iaculis diam in mauris. Donec sollicitudin. Proin justo turpis, vestibulum in, lacinia sit amet, ultrices sit amet, dolor. Sed placerat, leo non facilisis dictum, lectus ligula vulputate purus, id varius metus turpis sed tortor. Praesent eu erat ut justo imperdiet cursus. Donec et magna id diam pulvinar sodales. Sed eu libero sit amet tortor mollis pretium. Nullam a orci. Proin ac massa.

This text file is comprised of one line of data. I can do a regexp search within an if statement and it will find the line I need, but on print, it will print the entire line.

For example I know that the sentance I am looking for will always start with "Nullam" and will end with either "augue" or "libero".

Thanks for your help.

Replies are listed 'Best First'.
Re: Parse a large string
by ELISHEVA (Prior) on Mar 10, 2009 at 12:29 UTC

    If your one long string is coming from a file, you don't have to use a new line to divide it up into sentences. You can set $/ to any string, including a period - see perlvar for details. This has the additional advantage of not consuming large amounts of memory by trying to slurp in a long chunk of text without any new lines all at once.

    If the one long string is coming from elsewhere, you can still make use of $/ by opening an input stream on a string. Here's an example of a solution that lets you freely choose the starting words, ending words, and end of sentence marker. It uses both $/ and a string turned into an input stream.

    use strict; use warnings; sub findWords { my ($sText, $sEndSentence, $reStart, $reEnd)=@_; local $/=$sEndSentence; open(LONG_STRING, "<", \$sText) || die "Can't open string"; while (my $line = <LONG_STRING>) { chomp $line; if (($line =~ $reStart) && ($line =~ $reEnd)) { print "match: $line\n"; } else { print "no match: $line\n"; } } } findWords("a b c. Nullam est libero.Nullam est augue", '.' , qr(^\s*Nullam) , qr((?:libero|augue)\s*$));

    Best, beth

Re: Parse a large string
by Anonymous Monk on Mar 10, 2009 at 12:17 UTC

    Split the line into sentences first:

    my $data = getDataFromFile(); my @sentences = split /(?<\.)\s+(?=[A-Z)/, $data;
Re: Parse a large string
by hbm (Hermit) on Mar 10, 2009 at 12:32 UTC
    use strict; use warnings; { local $/ = undef; print map { "$_.\n" } grep { m/^Nullam[^.]+(?:augue|libero)$/i } split/\.\s+/, <DATA>; } __DATA__ Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla commodo + dignissim dui. Mauris egestas nunc non justo. Praesent consectetur p +haretranulla. Mauris sed magna. Fusce sit amet lectus. Aliquam bibend +um mi sollicitudin nulla. Pellentesque volutpat. Morbi ac nibh ut mau +ris tempor molestie. Nullam sit amet mi at neque lacinia suscipit. Nu +nc sem erat, porta fermentum, tempus sed, porttitor et, nibh. Nulla t +urpis orci, egestas eget, lacinia id, tincidunt vel, ligula.Donec sit + amet libero. Pellentesque ac felis vel erat interdum elementum. Prae +sent luctus tellus sit amet velit. Cras lacinia molestie nibh. Suspen +disse cursus. Sed facilisis magna id nisl blandit malesuada. Cras commodo. Nam gravida dolor eu purus. Sed et velit. Nul +la rhoncus hendrerit lectus. Ut nisi. Namsuscipit eros accumsan quam. + Nam ornare. Morbi a ipsum non urna adipiscing tempus. Duis in dui a +enim malesuada tempus. Ut vehicula sollicitudin tellus. Cum sociis natoque penatibus +et magnis dis parturient montes, nascetur ridiculus mus. Vivamus grav +ida adipiscing purus. Phasellus varius nisi et mauris.Lorem ipsum dol +or sit amet, con sectetur adipiscing elit. Sed ullamcorper erat sit amet magna. Sed por +ta nisi quis leo. Integer elementum elit vel libero. Fusce vulputate +magna sed nisi imperdiet fringilla. Nullam quis augue. Suspendisse ma +uris tortor, sol licitudin non, posuere ut, bibendum id, enim. Aenean id purus. Donec p +retium. Nam blandit nisi at elit. Fusce ac erat et quam porta eleifen +d. Sed imperdiet bibendum nulla. Morbi varius sagittis justo. Phasell +us hendrerit ullamcorper risus. Phasellus nisl ante, ullamcorper nec, + pellentesque quis, rutrum in, ligula. Cum sociis natoque penatibus e +t magnis dis parturient montes, nascetur ridiculus mus.Nulla facilisi +. Suspendisse commodo diam ut dui. Mauris neque est, consequat vitae, + vestibulum vel, sodales quis, mauris. Ut pharetra mauris sit amet me +tus. Nulla hendrerit sapien eleifend massa. Aliquam lacinia tempus au +gue. Nullam congue congue lectus. Suspendisse nulla lectus, rhoncus e +u, dapibus et, tempus ut, sapien. Vestibulum ante ipsum primis in fau +cibus orci luctus et ultrices posuere cubilia Curae; Cras libero. Cur +abitur scelerisque metus quis tortor facilisis ornare. Aenean sodales + ante vitae eros. Suspendisse potenti. Integer auctor nisi a diam. Ma +uris tristique laoreet leo. Integer eu to rtor. Quisque lacinia mauris et elit.Cum sociis natoque penatibus et m +agnis dis parturient montes, nascetur ridiculus mus. Nam luctus mauri +s nec lorem. Suspendisse leo est, ornare quis, volutpat quis, imperdi +et quis, diam. D uis vestibulum. Vestibulum iaculis diam in mauris. Donec sollicitudin. + Proin justo turpis, vestibulum in, lacinia sit amet, ultrices sit am +et, dolor. Sed placerat, leo non facilisis dictum, lectus ligula vulp +utate purus, id varius metus turpis sed tortor. Praesent eu erat ut justo imperdiet cu +rsus. Donec et magna id diam pulvinar sodales. Sed eu libero sit amet + tortor mollis pretium. Nullam a orci. Proin ac massa.

    Prints "Nullam quis augue."

Re: Parse a large string
by Marshall (Canon) on Mar 10, 2009 at 13:26 UTC
    This is a brain twisting regex problem.
    Here's my go at it:
    #!/usr/bin/perl -w use strict; $/=undef; my $data = <DATA>; my @text = ($data =~ /(Nullam.+(?:augue|libero)\.)/g); print @text; #__prints: #Nullam quis augue.

    Update: Including grandfather's idea and cleaning up line to account for Nullam and augue or libero being on different lines (take out the \n's if any), and print each sentence on different line, see below. The [^.]+ works well here as we don't have to worry about using say /s regx modifier to allow "." to also match newlines (by default "." matches anything except a newline).

    $/=undef; my $data = <DATA>; my @text = ($data =~ /(Nullam\b[^.]+(?:augue|libero)\.)/g); @text = map{tr/\n/ /;$_}@text; print join("\n",@text),"\n"; #print join(" ",@text); #alternative to put a space after the period.

      Consider what happens with the following string (disregarding any speeling misadventures errors and grammatical):

      Nullamie a orci. Nullam quis augue. Aliquam lacinia tempus Praugue.

      It is considered good practice to avoid using .* and .+ - they tend to be greedier than you often intend. Very often you are better to use a negated character class: [^.]+ would help a lot in this case. Also the word break anchor \b will help get intended behavior.


      True laziness is hard work
        Quite correct! As written the regex would match Nullamie as well as Nullam and the greediness would eat the first augue!

        Another way to calm greediness is the the ? modifier, .+ is a maximal match, .+? is a minimal match, like: $data =~ /(Nullam\b.+?(?:augue|libero)\.)/g); That's sometimes a good way to go and would work if we didn't have the "." to help us out here. Although I like your [^.]+ your idea looks great to me! There is more than one way to skin these regex cats!

Re: Parse a large string
by johngg (Canon) on Mar 10, 2009 at 14:06 UTC

    Read your file into the long string as before, process each sentance in a foreach loop using a global regular expression with a capture, only printing the sentance if it matches.

    use strict; use warnings; my $str = <DATA>; foreach my $sentance ( $str =~ m{([^.]+\.)\s?}g ) { next unless $sentance =~ m{^Nullam\b.*\b(?:augue|libero)}; print qq{$sentance\n}; } __END__ Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla commodo + dignissim dui. Mauris egestas nunc non justo. Praesent consectetur p +haretranulla. Mauris sed magna. Fusce sit amet lectus. Aliquam bibend +um mi sollicitudin nulla. Pellentesque volutpat. Morbi ac nibh ut mau +ris tempor molestie. Nullam sit amet mi at neque lacinia suscipit. Nu +nc sem erat, porta fermentum, tempus sed, porttitor et, nibh. Nulla t +urpis orci, egestas eget, lacinia id, tincidunt vel, ligula.Donec sit + amet libero. Pellentesque ac felis vel erat interdum elementum. Prae +sent luctus tellus sit amet velit. Cras lacinia molestie nibh. Suspen +disse cursus. Sed facilisis magna id nisl blandit malesuada. Cras com +modo. Nam gravida dolor eu purus. Sed et velit. Nulla rhoncus hendrer +it lectus. Ut nisi. Nam suscipit eros accumsan quam. Nam ornare. Morb +i a ipsum non urna adipiscing tempus. Duis in dui a enim malesuada te +mpus. Ut vehicula sollicitudin tellus. Cum sociis natoque penatibus e +t magnis dis parturient montes, nascetur ridiculus mus. Vivamus gravi +da adipiscing purus. Phasellus varius nisi et mauris.Lorem ipsum dolo +r sit amet, consectetur adipiscing elit. Sed ullamcorper erat sit ame +t magna. Sed porta nisi quis leo. Integer elementum elit vel libero. +Fusce vulputate magna sed nisi imperdiet fringilla. Nullam quis augue +. Suspendisse mauris tortor, sollicitudin non, posuere ut, bibendum i +d, enim. Aenean id purus. Donec pretium. Nam blandit nisi at elit. Fu +sce ac erat et quam porta eleifend. Sed imperdiet bibendum nulla. Mor +bi varius sagittis justo. Phasellus hendrerit ullamcorper risus. Phas +ellus nisl ante, ullamcorper nec, pellentesque quis, rutrum in, ligul +a. Cum sociis natoque penatibus et magnis dis parturient montes, nasc +etur ridiculus mus.Nulla facilisi. Suspendisse commodo diam ut dui. M +auris neque est, consequat vitae, vestibulum vel, sodales quis, mauri +s. Ut pharetra mauris sit amet metus. Nulla hendrerit sapien eleifend + massa. Aliquam lacinia tempus augue. Nullam congue congue lectus. Su +spendisse nulla lectus, rhoncus eu, dapibus et, tempus ut, sapien. Ve +stibulum ante ipsum primis in faucibus orci luctus et ultrices posuer +e cubilia Curae; Cras libero. Curabitur scelerisque metus quis tortor + facilisis ornare. Aenean sodales ante vitae eros. Suspendisse potent +i. Integer auctor nisi a diam. Mauris tristique laoreet leo. Integer +eu tortor. Quisque lacinia mauris et elit.Cum sociis natoque penatibu +s et magnis dis parturient montes, nascetur ridiculus mus. Nam luctus + mauris nec lorem. Suspendisse leo est, ornare quis, volutpat quis, i +mperdiet quis, diam. Duis vestibulum. Vestibulum iaculis diam in maur +is. Donec sollicitudin. Proin justo turpis, vestibulum in, lacinia si +t amet, ultrices sit amet, dolor. Sed placerat, leo non facilisis dic +tum, lectus ligula vulputate purus, id varius metus turpis sed tortor +. Praesent eu erat ut justo imperdiet cursus. Donec et magna id diam +pulvinar sodales. Sed eu libero sit amet tortor mollis pretium. Nulla +m a orci. Proin ac massa.

    The output.

    Nullam quis augue.

    I hope this is useful.

    Cheers,

    JohnGG

Re: Parse a large string
by sundialsvc4 (Abbot) on Mar 10, 2009 at 13:19 UTC

    You can always, if you choose, read “as much as you want” of the file, a “chunk,” into a buffer of some agreeable size. This naturally means that whatever is at the end of the buffer is probably incomplete, so, when you have divided-up the information that you've read in the agreed-upon way, set aside the last portion (presuming it to be incomplete). Then, when you read the next “chunk,” pre-pend that portion to what you have just read, and repeat the process.

Re: Parse a large string
by mwah (Hermit) on Mar 10, 2009 at 12:41 UTC
    ... print join "\n", $text =~ /(?:Nullam|Donec) [^.]+ (?:augue|libero)/gx +; ...

    Regards

    mwa

      I'm with you on this one, mwah, but wonder what you need the x modifier for - it appears to make no difference...

      A user level that continues to overstate my experience :-))
        I'm with you on this one, mwah, but wonder what you need the x modifier for - it appears to make no difference...

        Right, it doesn't make any difference here by accident or simplicity of the problem ;-)

        The /x is meant to mark withespaces in regexes as "not meant as part of the expression".

        Thanks & Regards

        mwa

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://749560]
Approved by Bloodnok
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (3)
As of 2024-04-25 16:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found