Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Extract Paragraph From Text

by perlbeginneraaa (Novice)
on Sep 08, 2015 at 06:39 UTC ( [id://1141307]=perlquestion: print w/replies, xml ) Need Help??

perlbeginneraaa has asked for the wisdom of the Perl Monks concerning the following question:

Hello guys,

I am trying to use PERL to extract paragraphs from a text. However, the code does not generate the results I expect. Here are the codes I wrote:

my $string = <<'TEXT'; Assembly and Manufacturing The Company's assembly and manufacturing operations include PCB a +ssembly and the manufacture of subsystems and complete products. Its PCB assem +bly activities primarily consist of the placement and attachment of electr +onic and mechanical components on printed circuit boards using both SMT and tra +ditional pin-through-hole ("PTH") technology. The Company also assembles subsys +tems and systems incorporating PCBs and complex electromechanical components, a +nd, increasingly, manufactures and packages final products for shipment di +rectly to the customer or its distribution channels. The Company employs just-in +-time, ship-to-stock and ship-to-line programs, continuous flow manufacturing +, demand flow processes and statistical process control. The Company has expand +ed the number of production lines for finished product assembly, burn-in and +test to meet growing demand and increased customer requirements. In addition, +the Company has invested in FICO, a producer of injection molded plastic f +or Asia electronics companies with facilities in Shenzhen, China. As OEMs seek to provide greater functionality in smaller products +, they increasingly require advanced manufacturing technologies and processes +. Most of the Company's PCB assembly involves the use of SMT, which is the leadi +ng electronics assembly technique for more sophisticated products. SMT is + a computer-automated process which permits attachment of components dire +ctly on both sides of a PCB. As a result, it allows higher integration of elec +tronic components, offering smaller size, lower cost and higher reliability t +han traditional manufacturing processes. By allowing increasingly complex +circuits to be packaged with the components placed in closer proximity to each +other, SMT greatly enhances circuit processing speed, and therefore board and sys +tem performance. The Company also provides traditional PTH electronics ass +embly using PCBs and leaded components for lower cost products.; TEXT local $/ = ""; open my ($str_fh), '<', \$string; while ( <$str_fh> ) { print "New Paragraph: $_\n","*" x 40, "\n" ; } close $str_fh;

The text is a part of annual report of this company and is available at https://www.sec.gov/Archives/edgar/data/32272/0000950147-97-000151.txt.

I expect the code returns the paragraphs, however, I got the whole text back. I am quite confused with these errors.

Moreover, is it possible to still get paragraphs separately even if the current "blank" lines do not count as paragraph separator? Would anyone help me with this issue?

Thanks so much!!! Best Regards

Replies are listed 'Best First'.
Re: Extract Paragraph From Text
by kcott (Archbishop) on Sep 08, 2015 at 07:20 UTC

    G'day perlbeginneraaa,

    Welcome to the Monastery.

    "I expect the code returns the paragraphs, however, I got the whole text back."

    You need to show us exactly what output you got. I ran your code without any problems. Here's the (cut-down) output I got:

    New Paragraph: Assembly and Manufacturing **************************************** New Paragraph: The Company's assembly and manufacturing operation +s include PCB assembly and the manufacture of subsystems and complete products. Its PCB assem +bly ... Company has invested in FICO, a producer of injection molded plastic f +or Asia electronics companies with facilities in Shenzhen, China. **************************************** New Paragraph: As OEMs seek to provide greater functionality in s +maller products, they increasingly require advanced manufacturing technologies and processes +. Most of ... performance. The Company also provides traditional PTH electronics ass +embly using PCBs and leaded components for lower cost products.; ****************************************
    "I am quite confused with these errors."

    You don't show any errors. I added this to the start of your code:

    use strict; use warnings; use autodie;

    No errors or warnings were emitted.

    — Ken

      Hi Ken, Thanks much for the reply! I expect exactly what you got here, that is, the code prints the paragraphs separately. However, when I execute the perl codes on my computer, it returns the whole text back. By errors, I mean the output is different from what I expected. Sorry for the confusion. Best Regards

        In the code posted, the record separator ($/) is set to "". The paragraphs are separated by a blank line, i.e. a line containing no characters between its start and end. The code works as posted.

        If you are using the same code and your result is a single block of text, then the paragraphs are not separated by empty lines, on your test setup.

        Are you testing with the exact code you posted here? Or is the input data in a file, and you copied it into your code to post here? What happens if you copy and run the code from your OP, using copy/paste from the raw ("download") link?

        Bottom line: the text you are processing must have its paragraphs separated by something other than a blank line.

        The way forward always starts with a minimal test.
        ... when I execute the perl codes on my computer, it returns the whole text back.

        What Perl codes? Not because I doubted kcott but just to be able to say I did so, I copied out the code of the OP and only added the use statements kcott did, and I got the same output without warnings or errors. Are you saying that you can do the same and get a different output? If so, how does it differ? (It's not enough just to say "It's not what I want.")

        Offhand, I cannot think of any environmental variable or OS peculiarity that would cause the originally posted code to differ in its behavior. We need more info about this.


        Give a man a fish:  <%-{-{-{-<

Re: Extract Paragraph From Text
by shadowsong (Pilgrim) on Sep 08, 2015 at 08:35 UTC

    perlbeginneraaa,

    I agree with kcott's assessment; based on your question your script seems to be performing its intended function as expected - in fact, I even took it a step further to apply it to the text file you mentioned in your question.

    The result of which looks like OK - I can't post the result here as it's way too long...

    However, could you be more specific as to how its result fails to meet your expectation(s)?

      Hi shadowsong, Thanks for reply! As I replied to Ken, I expect exactly what you got here, that is, the code prints the paragraphs separately. However, when I execute the perl codes on my computer, it returns the whole text back. I am not sure why this happens......
Re: Extract Paragraph From Text
by 2teez (Vicar) on Sep 08, 2015 at 07:31 UTC
      Hi 2teez, Thanks for reply! As I replied to Ken and shadowsong, I expect exactly what you got here, that is, the code prints the paragraphs separately. However, when I execute the perl codes on my computer, it returns the whole text back. I am not sure why this happens......
Re: Extract Paragraph From Text
by CountZero (Bishop) on Sep 08, 2015 at 21:17 UTC
    For what it's worth, that file has a single LF as the end of line character. If you are running your script on Windows, it might get confused as Windows expect an empty line to be a single CR+LF combination.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: Extract Paragraph From Text
by sundialsvc4 (Abbot) on Sep 08, 2015 at 12:22 UTC

    Careful, careful ... if this text was simply copy-and-pasted into a <<HereDoc, perhaps the text does not in fact contain the expected end-of-line characters.   This could therefore be something as simple (and, not-reproducible, when we try it ourselves ...) as having the wrong record separator specified to Perl.   Hard to spot, easy to fix.

    What I would do, first, is to look at the Perl source-file with a tool such as hexdump which can display the binary content of the file side-by-side with the characters.   Look, within the heredoc section, at how the lines and paragraphs are separated.   Exactly what byte sequence is used within that section.

    Further confusion can be introduced if you retrieve the file from some source, and, in handling it (e.g. to put it into a heredoc), you inadvertently mess-up the sequence or introduce more, conflicting bytes.

    For this reason, it might be advantageous to simply read the source-file directly, instead of attempting to embed it into the code.   (Which, I understand, might have been done here for the sake of example ...)

      Hi sundialsvc4, Thanks for your reply! I will look into that and check what paragraph separators the text uses. Maybe it is that in the text the paragraph separator is not a blank line, so I got the unexpected output. I am not sure about this...

        What I would expect is that text such as this might not contain any “end-of-line” character sequences at all.   Instead, the rendering engine would pour the text into the graphic container, line-by-line according to the size of the container and the selected font/font-size ... both of which presumably could change.   The only trustworthy “end-of-something” marker would be “end of paragraph,” but what might that be?   Who knows.

        In this situation, I would suggest two specific things:

        1. Get the information directly from the original source file, and do it in binary mode.   (In other words, don’t tell Perl to expect record-separators of any sort.   All you want Perl to do, is to read exactly the bytes that are there, exactly as they are.   And, you really need to read the entire file at once ... slurp!)
        2. Before writing the code to do that, look at the original source file with the hex-editor as previously discussed, to see what is actually there and what might reasonably be relied-upon.
        Don’t attempt to copy-and-paste into Perl source code:   you have no idea what your text-editor might actually do.   (And anything it might do, would only muddy the waters further.)

        Perl is an extremely powerful data-extraction tool that can most certainly do whatever-it-is that you determine needs to be done.   So, please follow-up in this thread and tell us what you’ve found.   We’ll be happy to then help you further.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1141307]
Approved by hdb
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2024-04-25 22:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found