Split and print hash based on regex

Maire has asked for the wisdom of the Perl Monks concerning the following question:

Good afternoon all,

I have a hash containing thousands of lines of text from hundreds of different files. Every time a certain phrase appears in the hash (in the SSCCE below, "This is"), I want to create a new txt file which prints both the phrase and all subsequent text until we reach the next "This is".

So, for instance, I had hoped that the script below would create six text files (named UserA_1, UserA_2 etc.) where the first file contained the text "This is line 1 from text 1 another line here which should be included in the text file with the above line.", the second file contained the text "This is line 2 from text 1", and so on.

However, although the script below creates the 6 new text files (and names them appropriately), it does not actually print anything into the files.


#!/usr/bin/perl
use strict;
use warnings;

#SSCCE:
my %mycorpus = (

            text1 => "This is line 1 from text 1
another line here which should be included in the text file with the a
+bove line.
This is line 2 from text 1
This is line 3 from text 1",
            
            
            text2 => "This is line 1 from text 2
This is line 2 from text 2
another line here which should be included in the text file with the a
+bove line.
This is line 3 from text 2",

);


my $count = 1;

foreach my $filename (sort keys %mycorpus) {
    my $outfile;
               
    while ($mycorpus{$filename} =~ /This is/g) {

    close $outfile if $outfile;
        open $outfile, '>', "UserA_$count.txt"
            or die "could not open";
            $count++;

        
    print {$outfile} $_;
}
}
[download]

I have been working on this script for nearly a week, but I can't spot my mistake(s), and thus I would be very grateful for any help.

EDIT:

I probably should have mentioned in my original post that my code here is based on a more basic script that I use to split and print text NOT stored in a hash. This script (reproduced as an SSCCE below) works successfully and returns the desired output.

my $count = 1;

my $outfile;
while (<DATA>) {
    if ( my($regex) = /This is/g) {
        close $outfile if $outfile;
        open $outfile, '>', "UserA$1_$count.txt"
            or die "could not open 'UserA$regex.txt' $!";
            $count++;
    
}
    print {$outfile} $_;

}

__DATA__
This is line 1 from text 1
another line here which should be included in the text file with the a
+bove line.
This is line 2 from text 1
This is line 3 from text 1    
This is line 1 from text 2
This is line 2 from text 2
another line here which should be included in the text file with the a
+bove line.
This is line 3 from text 2
[download]

Comment on Split and print hash based on regex Select or Download Code

Replies are listed 'Best First'.
Re: Split and print hash based on regex by choroba (Cardinal) on Mar 27, 2018 at 14:11 UTC
`print {$outfile} $_;` [download] Where do you populate $_? ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re^2: Split and print hash based on regex by Maire (Scribe) on Mar 27, 2018 at 15:42 UTC
This is a very good question, thanks! And I'm guessing from your response that this may lie at the heart of the problem? The original script that I am working with (reproduced now in an edit to my original post) used very similar syntax successfully, but I need to think about how the original script manages to populate $_ and my modified script doesn't.	[reply]
Re^3: Split and print hash based on regex by choroba (Cardinal) on Mar 28, 2018 at 03:43 UTC
`while (<DATA>)` [download] is equivalent to `while ($_ = <DATA>)` [download] which is interpreted as `while (defined($_ = <DATA>))` [download] So that's how $_ is populated in the original script. There's another question, though: How $1 is populated. Note that the matching uses `=`, not `=~`, so it's equivalent to `my($regex) = ($_ =~ /This is/g)` [download] where the parentheses after `my` enforce the list context on the match, but without a capture group in the regex, there's no way to populate $1. ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re^4: Split and print hash based on regex by Maire (Scribe) on Mar 28, 2018 at 07:20 UTC
Re^5: Split and print hash based on regex by AnomalousMonk (Archbishop) on Mar 28, 2018 at 15:09 UTC
Re: Split and print hash based on regex by AnomalousMonk (Archbishop) on Mar 27, 2018 at 18:53 UTC
WRT your first SSCCE: `while` does not automatically assign the result of its `CONDITION` evaluation to `$_` (in contrast to the `while (<FILEHANDLE>) { do_something_with($_); }` special case): `c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "foreach my $filename (qw(a b c)) { dd 'before while loop, $filename is', $filename; while ($filename) { dd 'in while loop, $_ is', $_; last; } } " ("before while loop, \$filename is", "a") ("in while loop, \$_ is", undef) ("before while loop, \$filename is", "b") ("in while loop, \$_ is", undef) ("before while loop, \$filename is", "c") ("in while loop, \$_ is", undef)` [download] Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: Split and print hash based on regex by Maire (Scribe) on Mar 28, 2018 at 07:22 UTC
Great, thank you!	[reply]
Re: Split and print hash based on regex by tybalt89 (Monsignor) on Mar 27, 2018 at 21:53 UTC
#!/usr/bin/perl use strict; use warnings; #SSCCE: my %mycorpus = ( text1 => "This is line 1 from text 1 another line here which should be included in the text file with the a +bove line. This is line 2 from text 1 This is line 3 from text 1", text2 => "This is line 1 from text 2 This is line 2 from text 2 another line here which should be included in the text file with the a +bove line. This is line 3 from text 2", ); my $count = 1; foreach my $filename (sort keys %mycorpus) { for ( $mycorpus{$filename} =~ /This is(?:(?!This is).)/sg ) { my $outputname = 'UserA_' . $count++ . '.txt'; open my $outfile, '>', $outputname or die "$! opening $outputname" +; print $outfile "$_\n"; # \n only if desired close $outfile; } } # for testing file contents system "more UserA \| cat"; [download]	[reply] [d/l]
Re^2: Split and print hash based on regex by Maire (Scribe) on Mar 28, 2018 at 07:37 UTC
Thanks!	[reply]
Re: Split and print hash based on regex by Cristoforo (Curate) on Mar 27, 2018 at 20:10 UTC
Here is a possible solution that makes use of the three argument open (with a reference to the filename). This if all the data is in a hash. #!/usr/bin/perl use strict; use warnings; #SSCCE: my %mycorpus = ( text1 => "This is line 1 from text 1 another line here which should be included in the text file with the a +bove line. This is line 2 from text 1 This is line 3 from text 1", text2 => "This is line 1 from text 2 This is line 2 from text 2 another line here which should be included in the text file with the a +bove line. This is line 3 from text 2", ); my $count = 1; foreach my $filename (sort keys %mycorpus) { my $outfile; open my $fh, '<', \$mycorpus{$filename} or die $!; while (<$fh>) { chomp; if (/^This is/) { close $outfile if $outfile; my $out = "UserA_$count.txt"; open $outfile, '>', $out or die "could not open '$out' for writing $!"; $count++; } print $outfile $_, "\n" if $outfile; } } [download] Edit: added conditional to print command ('if $outfile') Edit2: The solution offered by tybalt89, Re: Split and print hash based on regex is better than this one. His does not rely on the identifying phase to be at the front of the line of text. The post by jh also is better than this one.	[reply] [d/l]
Re^2: Split and print hash based on regex by Maire (Scribe) on Mar 28, 2018 at 07:36 UTC
Ah, very nice solution, thanks! I wasn't aware that one could "open" part of a hash in this way: that tip will save me a lot of time in the future!	[reply]
Re: Split and print hash based on regex by jh (Beadle) on Mar 27, 2018 at 16:05 UTC
Considering you use the word "split" in the title of your post, it's funny you aren't using `split` to process the text. our $all_text = join "", <ARGV>; # files, STDIN, etc. our $key_phrase = "This is "; # should not be hard-coded our $base_name = "UserA_"; our $ext = ".txt"; our @bits = split m/\Q$key_phrase\E/, $all_text; # if line 1 data includes the key phrase, element 1 will be empty: shift @bits if $all_text =~ m/^\Q$key_phrase\E/; my $count = 1; foreach my $bit (@bits) { # suggest padding the index number so files sort correctly my $filename = sprintf "%s%2.2d%s", $base_name, $count++, $ext; open FILE, ">", $filename or die "Could not write to \"$filename\": $!\n"; print FILE "$key_phrase$bit"; # put back the what split() excised close FILE; } [download] This solution assumes that you can read all the data into memory, of course, but unless it's a million lines or an ongoing TCP/IP connection or something, I rarely have issues with that.	[reply] [d/l]
Re^2: Split and print hash based on regex by Maire (Scribe) on Mar 28, 2018 at 07:26 UTC
Thanks for this. I've never (successfully!) worked with the split function before, but your script exemplifies it in a way that I (as a relative newbie) can understand, thanks!	[reply]
Re: Split and print hash based on regex by bliako (Monsignor) on Mar 27, 2018 at 15:29 UTC
Your regex does not capture anything. Shouldn't it be capturing from one "This is" to the next "This is"?	[reply]
Re^2: Split and print hash based on regex by Maire (Scribe) on Mar 27, 2018 at 15:47 UTC
What I was trying to do is ask it to look for the "This is" and then print that and everything else until the next "This is" (as opposed to capturing the text, as such (if that makes sense!). This method worked successfully in the original script (not using hashes) which I've now reproduced above. However, I will look into using a capturing regex to see if I can get this modified script to work successfully that way instead, thanks!	[reply]


Do you know where your variables are?
	PerlMonks