Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Extracting a chapter from text file

by jwkuo87 (Initiate)
on Apr 15, 2014 at 14:28 UTC ( [id://1082347]=perlquestion: print w/replies, xml ) Need Help??

jwkuo87 has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone. I'm fairly new to Perl and am trying to extract a specific chapter from a text file. In the example below, I would like to retrieve the text from Chapter 2.
Table of Contents Chapter 1. Introduction Chapter 2. Main Chapter 3. Conclusion ============================== Chapter 1. Introduction This is the introduction preceding Chapter 2. Chapter 2. Main This is the text contained in Chapter 2 and will contain a lot of text + with at least 100 words and probably somewhere around 1000-5000. Chapter 3. Conclusion This is the conclusion.
The Perl script should extract "This is the text contained in Chapter 2 and will contain a lot of text with at least 100 words and probably somewhere around 1000-5000." from the file and write the output to a new file. Unfortunately, the code below only gives me the first matches, i.e. the text from the table of contents.
#!/usr/bin/perl -w #use strict; my $startstring='Chapter\s2\.\sMain'; my $endstring='Chapter\s3\.\sConclusion'; { local $/; open (SLURP, "C:\\Text\\1.txt") or die $!; $data = <SLURP>; close SLURP or die $!; { @finds=$data=~m/($startstring.*?$endstring)/ismo; } open my $OUTFILE, ">", "C:\\Text\\Chapter2\\1.txt" or die $!; print $OUTFILE "@finds"; close $OUTFILE; }
Is there a way to refine my search function so it works as I would like it to? Like including a rule where the startstring must be skipped if the preceding 5 strings contains "Chapter 1. Introduction" and/or the output should contain at least 100 words? Thanks in advance! :)

Replies are listed 'Best First'.
Re: Extracting a chapter from text file
by davido (Cardinal) on Apr 15, 2014 at 17:01 UTC

    I haven't seen a solution that uses the .. range operator, discussed in perlop. And I think this is the sort of thing it was designed for. Here's an example:

    while( <DATA> ) { my $in_toc = 1 .. /^={5,}/; if( !$in_toc && /^Chapter 2\./ .. /^Chapter 3\./ ) { next if /^Chapter \d+\./; print; } } __DATA__ Table of Contents Chapter 1. Introduction Chapter 2. Main Chapter 3. Conclusion ============================== Chapter 1. Introduction This is the introduction preceding Chapter 2. Chapter 2. Main This is the text contained in Chapter 2 and will contain a lot of text + with at least 100 words and probably somewhere around 1000-5000. Chapter 3. Conclusion This is the conclusion.

    This produces the following output:

    This is the text contained in Chapter 2 and will contain a lot of text + with at least 100 words and probably somewhere around 1000-5000.

    Substitute your own operations in place of 'print'. That probably where you would also include your word-count check.


    Dave

Re: Extracting a chapter from text file
by bigj (Monk) on Apr 15, 2014 at 14:43 UTC
    You simply miss a g modifier on your regexp. Without in @finds, only the first match is inside, withit all matches will be in. Then you can just print $finds[1] and you will get what you want.

    Greetings,
    Janek Schleicher

Re: Extracting a chapter from text file
by duff (Parson) on Apr 15, 2014 at 14:48 UTC

    Here's a simple version that does what you wanted. It's not very robust however, so you may need to tweak it.

    #!/usr/bin/env perl use 5.010; use strict; use warnings; my $in_toc = 1; # we start off in the Table of Contents my $in_ch2 = 0; while (<>) { $in_toc = 0 if /^\s*=+\s*$/; next if $in_toc; $in_ch2 = 0 if /^Chapter 3/; say if $in_ch2; $in_ch2 = 1 if /^Chapter 2/; }
    You'd run it like so: programname textfile

    The above doesn't use your "rules", but if you wanted to, you could use a hash to keep track of the chapters from the table of contents and look for those exact chapter titles in the text and/or use length to check the length of the string.

Re: Extracting a chapter from text file
by kcott (Archbishop) on Apr 16, 2014 at 10:46 UTC

    G'day jwkuo87,

    Welcome to the monastery.

    The following technique uses paragraph instead of slurp mode (see $/ in perlvar: Variables related to filehandles). This gets around matching Chapter 2 in the TOC.

    I've also used the '..' range operator (which davido showed earlier).

    Finally, we stop reading the input file as soon as Chapter 3 is found.

    Here's the code. Note that, for testing purposes, I've added a second (dummy) paragraph to Chapter 2 and truncated the existing first paragraph.

    #!/usr/bin/env perl use strict; use warnings; { local $/ = ''; while (<DATA>) { next unless /^Chapter 2/ .. /^Chapter 3/; last if /^Chapter 3/; print; } } __DATA__ Table of Contents Chapter 1. Introduction Chapter 2. Main Chapter 3. Conclusion ============================== Chapter 1. Introduction This is the introduction preceding Chapter 2. Chapter 2. Main This is the text contained in Chapter 2 and will ... Assume more than one paragraph. Chapter 3. Conclusion This is the conclusion.

    Output:

    Chapter 2. Main This is the text contained in Chapter 2 and will ... Assume more than one paragraph.

    -- Ken

Re: Extracting a chapter from text file
by jwkuo87 (Initiate) on Apr 16, 2014 at 18:59 UTC
    Thank you for the suggestions! The suggestion made by bigj is a possibility, but since the location of the table of contents is random, I would have to add a function that checks which item in the array is the largest. Also, in case it matters. I'm planning to include multithreading in the code, since it has to process about 50.000 files this way. Quality is more important to me than quantity (so a few files with missing matches is okay). Is there any difference in the approaches? I will try the other codes tomorrow and report back!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1082347]
Approved by rnewsham
Front-paged by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (5)
As of 2024-04-16 12:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found