Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: Sort text by Chapter names

by jimpudar (Pilgrim)
on May 24, 2018 at 06:53 UTC ( [id://1215134]=note: print w/replies, xml ) Need Help??


in reply to Sort text by Chapter names

This won't make a good homework answer, but may help in a real world scenario where is is not viable to read the entire book into memory.

I do realize that RAM is so cheap these days it would need to be one hell of a book to warrant this kind of treatment :D

#! /usr/bin/env perl use strict; use warnings; use autodie; # This solution will avoid reading the entire # file into memory. First we will separate the # input into one file per chapter. # This will become a filehandle in a bit my $fh; # Create a directory for the chapter output mkdir 'chapters'; # Keep a list of all chapter names my @chapter_names; # Diamond operator reads input line by line. # This can be a file or even STDIN while (<>) { if (/^Chapter/) { close $fh if $fh; # Open a new file. # The filename is the chapter name. chomp; open $fh, '>', "chapters/$_"; push @chapter_names, $_; next; } print $fh $_; } close $fh; # Now we can sort the chapter names and # read them line by line, printing on STDOUT foreach my $chapter_name (sort @chapter_names) { open $fh, '<', "chapters/$chapter_name"; print "$chapter_name\n"; while (<$fh>) { print $_; } close $fh; unlink "chapters/$chapter_name"; } rmdir 'chapters';

Best,

Jim

Replies are listed 'Best First'.
Re^2: Sort text by Chapter names
by Anonymous Monk on May 28, 2018 at 16:06 UTC
    Another alternative that I often see used here is to calculate the position and length of each unit within the file as you are scanning through it looking for markers. Put the marker-name, position, and length into an array of hashes, then sort the array by name using a custom sort-function. Retrieve each chapter directly from the original file by seeking to the proper position and reading the calculated number of bytes. (If the length might also be huge, read and write it in chunks not-to-exceed a digestible buffer-size.)

      For the fun of it ( and also to show you can seek on DATA )

      #!/usr/bin/perl # http://perlmonks.org/?node_id=1215128 use strict; use warnings; my %chapters; my $previous = undef; my $buffer; my $max = 4096; while(<DATA>) { if( /^Chapter/ ) { $chapters{$_} = $previous = [ tell(DATA) - length, length ]; } elsif( defined $previous ) { $previous->[1] += length; } } use Data::Dump 'pp'; print pp \%chapters; print "\n\n"; for ( sort keys %chapters ) { my ($start, $length) = $chapters{$_}->@*; seek DATA, $start, 0; while( $length > $max ) { read DATA, $buffer, $max; print $buffer; $length -= $max; } read DATA, $buffer, $length; print $buffer; } __DATA__ Chapter One There were lots of monkeys here and they ate all the bananas... lots more text up to hundreds of words. Chapter Nine This chapter has probably 1000 words. Chapter Two Here is the text in the second chapter... Chapter Five Here is the text in the fifth chapter... every chapter is of differing length, some long some short.

      This is an elegant solution. I do like it better than mine as it uses only half the disk space!

      This thread is a fantastic example of TMTOWTDI.

      Best,

      Jim

      A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1215134]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (6)
As of 2024-04-18 19:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found