Extract Paragraph From Text

perlbeginneraaa has asked for the wisdom of the Perl Monks concerning the following question:

Hello guys,

I am trying to use PERL to extract paragraphs from a text. However, the code does not generate the results I expect. Here are the codes I wrote:

my $string = <<'TEXT';
     Assembly and Manufacturing

     The Company's assembly and manufacturing operations include PCB a
+ssembly
and the manufacture of subsystems and complete products. Its PCB assem
+bly
activities primarily consist of the placement and attachment of electr
+onic and
mechanical components on printed circuit boards using both SMT and tra
+ditional
pin-through-hole ("PTH") technology. The Company also assembles subsys
+tems and
systems incorporating PCBs and complex electromechanical components, a
+nd,
increasingly, manufactures and packages final products for shipment di
+rectly to
the customer or its distribution channels. The Company employs just-in
+-time,
ship-to-stock and ship-to-line programs, continuous flow manufacturing
+, demand
flow processes and statistical process control. The Company has expand
+ed the
number of production lines for finished product assembly, burn-in and 
+test to
meet growing demand and increased customer requirements. In addition, 
+the
Company has invested in FICO, a producer of injection molded plastic f
+or Asia
electronics companies with facilities in Shenzhen, China.

     As OEMs seek to provide greater functionality in smaller products
+, they
increasingly require advanced manufacturing technologies and processes
+. Most of
the Company's PCB assembly involves the use of SMT, which is the leadi
+ng
electronics assembly technique for more sophisticated products. SMT is
+ a
computer-automated process which permits attachment of components dire
+ctly on
both sides of a PCB. As a result, it allows higher integration of elec
+tronic
components, offering smaller size, lower cost and higher reliability t
+han
traditional manufacturing processes. By allowing increasingly complex 
+circuits
to be packaged with the components placed in closer proximity to each 
+other, SMT
greatly enhances circuit processing speed, and therefore board and sys
+tem
performance. The Company also provides traditional PTH electronics ass
+embly
using PCBs and leaded components for lower cost products.;
TEXT

local $/ = "";
open my ($str_fh), '<', \$string;
while ( <$str_fh> ) {
     print "New Paragraph: $_\n","*" x 40, "\n" ;   
}
close $str_fh;
[download]

The text is a part of annual report of this company and is available at https://www.sec.gov/Archives/edgar/data/32272/0000950147-97-000151.txt.

I expect the code returns the paragraphs, however, I got the whole text back. I am quite confused with these errors.

Moreover, is it possible to still get paragraphs separately even if the current "blank" lines do not count as paragraph separator? Would anyone help me with this issue?

Thanks so much!!! Best Regards

Comment on Extract Paragraph From Text Download Code

Replies are listed 'Best First'.
Re: Extract Paragraph From Text by kcott (Archbishop) on Sep 08, 2015 at 07:20 UTC
G'day perlbeginneraaa, Welcome to the Monastery. "I expect the code returns the paragraphs, however, I got the whole text back." You need to show us exactly what output you got. I ran your code without any problems. Here's the (cut-down) output I got: New Paragraph: Assembly and Manufacturing ************************************** New Paragraph: The Company's assembly and manufacturing operation +s include PCB assembly and the manufacture of subsystems and complete products. Its PCB assem +bly ... Company has invested in FICO, a producer of injection molded plastic f +or Asia electronics companies with facilities in Shenzhen, China. ************************************ New Paragraph: As OEMs seek to provide greater functionality in s +maller products, they increasingly require advanced manufacturing technologies and processes +. Most of ... performance. The Company also provides traditional PTH electronics ass +embly using PCBs and leaded components for lower cost products.; ************************************** [download] "I am quite confused with these errors." You don't show any errors. I added this to the start of your code: `use strict; use warnings; use autodie;` [download] No errors or warnings were emitted. — Ken	[reply] [d/l] [select]
Re^2: Extract Paragraph From Text by perlbeginneraaa (Novice) on Sep 08, 2015 at 15:18 UTC
Hi Ken, Thanks much for the reply! I expect exactly what you got here, that is, the code prints the paragraphs separately. However, when I execute the perl codes on my computer, it returns the whole text back. By errors, I mean the output is different from what I expected. Sorry for the confusion. Best Regards	[reply]
Re^3: Extract Paragraph From Text by 1nickt (Canon) on Sep 08, 2015 at 15:28 UTC
In the code posted, the record separator (`$/`) is set to `""`. The paragraphs are separated by a blank line, i.e. a line containing no characters between its start and end. The code works as posted. If you are using the same code and your result is a single block of text, then the paragraphs are not separated by empty lines, on your test setup. Are you testing with the exact code you posted here? Or is the input data in a file, and you copied it into your code to post here? What happens if you copy and run the code from your OP, using copy/paste from the raw ("download") link? Bottom line: the text you are processing must have its paragraphs separated by something other than a blank line. The way forward always starts with a minimal test.	[reply] [d/l] [select]
Re^4: Extract Paragraph From Text by perlbeginneraaa (Novice) on Sep 08, 2015 at 15:37 UTC
Re^5: Extract Paragraph From Text by 1nickt (Canon) on Sep 08, 2015 at 15:50 UTC
Re^3: Extract Paragraph From Text by AnomalousMonk (Archbishop) on Sep 08, 2015 at 15:51 UTC
... when I execute the perl codes on my computer, it returns the whole text back. What Perl codes? Not because I doubted kcott but just to be able to say I did so, I copied out the code of the OP and only added the use statements kcott did, and I got the same output without warnings or errors. Are you saying that you can do the same and get a different output? If so, how does it differ? (It's not enough just to say "It's not what I want.") Offhand, I cannot think of any environmental variable or OS peculiarity that would cause the originally posted code to differ in its behavior. We need more info about this. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]
Re: Extract Paragraph From Text by shadowsong (Pilgrim) on Sep 08, 2015 at 08:35 UTC
perlbeginneraaa, I agree with kcott's assessment; based on your question your script seems to be performing its intended function as expected - in fact, I even took it a step further to apply it to the text file you mentioned in your question. The result of which looks like OK - I can't post the result here as it's way too long... However, could you be more specific as to how its result fails to meet your expectation(s)?	[reply]
Re^2: Extract Paragraph From Text by perlbeginneraaa (Novice) on Sep 08, 2015 at 15:27 UTC
Hi shadowsong, Thanks for reply! As I replied to Ken, I expect exactly what you got here, that is, the code prints the paragraphs separately. However, when I execute the perl codes on my computer, it returns the whole text back. I am not sure why this happens......	[reply]
Re: Extract Paragraph From Text by 2teez (Vicar) on Sep 08, 2015 at 07:31 UTC
What output exactly are you looking for? Please check this How do I post a question effectively? UPDATE: kcott got here before me! :)! If you tell me, I'll forget. If you show me, I'll remember. if you involve me, I'll understand. --- Author unknown to me	[reply]
Re^2: Extract Paragraph From Text by perlbeginneraaa (Novice) on Sep 08, 2015 at 15:28 UTC
Hi 2teez, Thanks for reply! As I replied to Ken and shadowsong, I expect exactly what you got here, that is, the code prints the paragraphs separately. However, when I execute the perl codes on my computer, it returns the whole text back. I am not sure why this happens......	[reply]
Re: Extract Paragraph From Text by CountZero (Bishop) on Sep 08, 2015 at 21:17 UTC
For what it's worth, that file has a single LF as the end of line character. If you are running your script on Windows, it might get confused as Windows expect an empty line to be a single CR+LF combination. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply]
Re: Extract Paragraph From Text by sundialsvc4 (Abbot) on Sep 08, 2015 at 12:22 UTC
Careful, careful ... if this text was simply copy-and-pasted into a `<<HereDoc`, perhaps the text does not in fact contain the expected end-of-line characters. This could therefore be something as simple (and, not-reproducible, when we try it ourselves ...) as having the wrong record separator specified to Perl. Hard to spot, easy to fix. What I would do, first, is to look at the Perl source-file with a tool such as `hexdump` which can display the binary content of the file side-by-side with the characters. Look, within the heredoc section, at how the lines and paragraphs are separated. Exactly what byte sequence is used within that section. Further confusion can be introduced if you retrieve the file from some source, and, in handling it (e.g. to put it into a heredoc), you inadvertently mess-up the sequence or introduce more, conflicting bytes. For this reason, it might be advantageous to simply read the source-file directly, instead of attempting to embed it into the code. (Which, I understand, might have been done here for the sake of example ...)
Re^2: Extract Paragraph From Text by perlbeginneraaa (Novice) on Sep 08, 2015 at 15:31 UTC
Hi sundialsvc4, Thanks for your reply! I will look into that and check what paragraph separators the text uses. Maybe it is that in the text the paragraph separator is not a blank line, so I got the unexpected output. I am not sure about this...	[reply]
Re^3: Extract Paragraph From Text by sundialsvc4 (Abbot) on Sep 09, 2015 at 22:11 UTC
What I would expect is that text such as this might not contain any “end-of-line” character sequences at all. Instead, the rendering engine would pour the text into the graphic container, line-by-line according to the size of the container and the selected font/font-size ... both of which presumably could change. The only trustworthy “end-of-something” marker would be “end of paragraph,” but what might that be? Who knows. In this situation, I would suggest two specific things: Get the information directly from the original source file, and do it in binary mode. (In other words, don’t tell Perl to expect record-separators of any sort. All you want Perl to do, is to read exactly the bytes that are there, exactly as they are. And, you really need to read the entire file at once ... slurp!) Before writing the code to do that, look at the original source file with the hex-editor as previously discussed, to see what is actually there and what might reasonably be relied-upon. Don’t attempt to copy-and-paste into Perl source code: you have no idea what your text-editor might actually do. (And anything it might do, would only muddy the waters further.) Perl is an extremely powerful data-extraction tool that can most certainly do whatever-it-is that you determine needs to be done. So, please follow-up in this thread and tell us what you’ve found. We’ll be happy to then help you further.


Perl-Sensitive Sunglasses
	PerlMonks