Extracting a chapter from text file

jwkuo87 has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone. I'm fairly new to Perl and am trying to extract a specific chapter from a text file. In the example below, I would like to retrieve the text from Chapter 2.

Table of Contents
Chapter 1. Introduction
Chapter 2. Main
Chapter 3. Conclusion

==============================

Chapter 1. Introduction
This is the introduction preceding Chapter 2.

Chapter 2. Main
This is the text contained in Chapter 2 and will contain a lot of text
+ with at least 100 words and probably somewhere around 1000-5000. 

Chapter 3. Conclusion
This is the conclusion.
[download]

The Perl script should extract "This is the text contained in Chapter 2 and will contain a lot of text with at least 100 words and probably somewhere around 1000-5000." from the file and write the output to a new file. Unfortunately, the code below only gives me the first matches, i.e. the text from the table of contents.

#!/usr/bin/perl -w
#use strict;

my $startstring='Chapter\s2\.\sMain';
my $endstring='Chapter\s3\.\sConclusion';

{
local $/;
open (SLURP, "C:\\Text\\1.txt") or die $!; 
$data = <SLURP>; 


close SLURP or die $!;

{
@finds=$data=~m/($startstring.*?$endstring)/ismo;
}

open my $OUTFILE, ">", "C:\\Text\\Chapter2\\1.txt" or die $!;
print $OUTFILE "@finds";
close $OUTFILE;
}
[download]

Is there a way to refine my search function so it works as I would like it to? Like including a rule where the startstring must be skipped if the preceding 5 strings contains "Chapter 1. Introduction" and/or the output should contain at least 100 words? Thanks in advance! :)

Comment on Extracting a chapter from text file Select or Download Code

Replies are listed 'Best First'.
Re: Extracting a chapter from text file by davido (Cardinal) on Apr 15, 2014 at 17:01 UTC
I haven't seen a solution that uses the `..` range operator, discussed in perlop. And I think this is the sort of thing it was designed for. Here's an example: while( <DATA> ) { my $in_toc = 1 .. /^={5,}/; if( !$in_toc && /^Chapter 2\./ .. /^Chapter 3\./ ) { next if /^Chapter \d+\./; print; } } __DATA__ Table of Contents Chapter 1. Introduction Chapter 2. Main Chapter 3. Conclusion ============================== Chapter 1. Introduction This is the introduction preceding Chapter 2. Chapter 2. Main This is the text contained in Chapter 2 and will contain a lot of text + with at least 100 words and probably somewhere around 1000-5000. Chapter 3. Conclusion This is the conclusion. [download] This produces the following output: `This is the text contained in Chapter 2 and will contain a lot of text + with at least 100 words and probably somewhere around 1000-5000.` [download] Substitute your own operations in place of 'print'. That probably where you would also include your word-count check. Dave	[reply] [d/l] [select]
Re: Extracting a chapter from text file by bigj (Monk) on Apr 15, 2014 at 14:43 UTC
You simply miss a g modifier on your regexp. Without in `@finds`, only the first match is inside, withit all matches will be in. Then you can just print `$finds[1]` and you will get what you want. Greetings, Janek Schleicher	[reply] [d/l] [select]
Re: Extracting a chapter from text file by duff (Parson) on Apr 15, 2014 at 14:48 UTC
Here's a simple version that does what you wanted. It's not very robust however, so you may need to tweak it. `#!/usr/bin/env perl use 5.010; use strict; use warnings; my $in_toc = 1; # we start off in the Table of Contents my $in_ch2 = 0; while (<>) { $in_toc = 0 if /^\s=+\s$/; next if $in_toc; $in_ch2 = 0 if /^Chapter 3/; say if $in_ch2; $in_ch2 = 1 if /^Chapter 2/; }` [download] You'd run it like so: programname textfile The above doesn't use your "rules", but if you wanted to, you could use a hash to keep track of the chapters from the table of contents and look for those exact chapter titles in the text and/or use length to check the length of the string. duff	[reply] [d/l]
Re: Extracting a chapter from text file by kcott (Archbishop) on Apr 16, 2014 at 10:46 UTC
G'day jwkuo87, Welcome to the monastery. The following technique uses paragraph instead of slurp mode (see `$/` in perlvar: Variables related to filehandles). This gets around matching `Chapter 2` in the TOC. I've also used the '`..`' range operator (which davido showed earlier). Finally, we stop reading the input file as soon as `Chapter 3` is found. Here's the code. Note that, for testing purposes, I've added a second (dummy) paragraph to Chapter 2 and truncated the existing first paragraph. `#!/usr/bin/env perl use strict; use warnings; { local $/ = ''; while (<DATA>) { next unless /^Chapter 2/ .. /^Chapter 3/; last if /^Chapter 3/; print; } } __DATA__ Table of Contents Chapter 1. Introduction Chapter 2. Main Chapter 3. Conclusion ============================== Chapter 1. Introduction This is the introduction preceding Chapter 2. Chapter 2. Main This is the text contained in Chapter 2 and will ... Assume more than one paragraph. Chapter 3. Conclusion This is the conclusion.` [download] Output: `Chapter 2. Main This is the text contained in Chapter 2 and will ... Assume more than one paragraph.` [download] -- Ken	[reply] [d/l] [select]
Re: Extracting a chapter from text file by jwkuo87 (Initiate) on Apr 16, 2014 at 18:59 UTC
Thank you for the suggestions! The suggestion made by bigj is a possibility, but since the location of the table of contents is random, I would have to add a function that checks which item in the array is the largest. Also, in case it matters. I'm planning to include multithreading in the code, since it has to process about 50.000 files this way. Quality is more important to me than quantity (so a few files with missing matches is okay). Is there any difference in the approaches? I will try the other codes tomorrow and report back!	[reply]


No such thing as a small change
	PerlMonks