Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Tricky regex problem

by nysus (Vicar)
on Jul 22, 2020 at 15:05 UTC ( #11119658=perlquestion: print w/replies, xml ) Need Help??

nysus has asked for the wisdom of the Perl Monks concerning the following question:

Got a regex:

$input_text =~ s/^###*\s+[^\n]+\s+(?=^##)//gsm;

I want it to strip out markdown headers from a file that don't contain anything:

## This gets stripped ## This doesn't Because it contains this line ## More headers blah blah

But I also don't want it to strip this:

## This should not get stripped, but it does ### This should prevent it from getting stripped, but it doesn't stuff #### This should also not get stripped ##### But it does

Is there any way at all to pull this off with a regex?

$PM = "Perl Monk's";
$MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar";
$nysus = $PM . ' ' . $MCF;
Click here if you love Perl Monks

Replies are listed 'Best First'.
Re: Tricky regex problem
by Eily (Monsignor) on Jul 22, 2020 at 15:55 UTC

    Not a single regex but one way to do what you want is to read your file in paragraph mode, with $/ and split the logic:

    { local $/ = ""; # Edit: added use of local for good practice while (<DATA>) { s/^##.*//s unless /^\w/m; print; } } __DATA__ ## Remove ## Keep this ## Remove ## Also keep that

    Otherwise you could use a negative look ahead assertion (?!^\w)

    Edit: seems like I really didn't read the requirement well enough ^^"

Re: Tricky regex problem
by tybalt89 (Prior) on Jul 22, 2020 at 16:00 UTC
    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11119658 use warnings; $_ = <<END; ## This gets stripped ## This doesn't Because it contains this line ## More headers blah blah ## this also should be stripped ? ## This should not get stripped, but it does ### This should prevent it from getting stripped, but it doesn't stuff #### This should also not get stripped ##### But it does isn't some text needed here? END s/^(#+).*\n\n(?!^\1#)//gm; print;

    Outputs:

    ## This doesn't Because it contains this line ## More headers blah blah ## This should not get stripped, but it does ### This should prevent it from getting stripped, but it doesn't stuff #### This should also not get stripped ##### But it does isn't some text needed here?

    I think a larger test case may be needed...

Re: Tricky regex problem (updated)
by AnomalousMonk (Bishop) on Jul 22, 2020 at 15:34 UTC
    If that (update: oops... I meant to reply to that node) works for you, try this as a simplification (untested):
    $input_text =~ s{ ^ ([#]{2,5}) \s+ [^\n]+ \s+ (?= ^ \1 [^#]) } {}xmsg;
    Beyond that, I have to say I don't understand your requirements. Can you express them more clearly?


    Give a man a fish:  <%-{-{-{-<

      All headers that are followed by only whitespace, with the whitespace getting followed by a header with the same or fewer pounds signs as the initial header, should get stripped.

      $PM = "Perl Monk's";
      $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar";
      $nysus = $PM . ' ' . $MCF;
      Click here if you love Perl Monks

Re: Tricky regex problem
by LanX (Cardinal) on Jul 22, 2020 at 16:01 UTC
    use strict; use warnings; local $/ = "\n##"; #record separator while (<DATA>) { chomp; #print "\n<<<$_>>>\n"; # check input my ($head,$rest) = /^ (.*?) \n (.*) $/xs; # print record only if $rest contains alphanumerics print "##$_" if $rest =~ /\w/; } __DATA__ ## This gets stripped ## This doesn't Because it contains this line ## More headers blah blah ## This should not get stripped, but it does ### This should prevent it from getting stripped, but it doesn't stuff #### This should also not get stripped ##### But it does

    C:/Perl_524/bin\perl.exe -w d:/exp/pm_headers.pl ## This doesn't Because it contains this line ## More headers blah blah ### This should prevent it from getting stripped, but it doesn't stuff Compilation finished at Wed Jul 22 17:59:57

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

Re: Tricky regex problem
by nysus (Vicar) on Jul 22, 2020 at 15:11 UTC

    I came up with this, which is faintly ridiculous:

    $input_text =~ s/^##\s+[^\n]+\s+(?=^##[^#])//gsm; $input_text =~ s/^###\s+[^\n]+\s+(?=^###[^#])//gsm; $input_text =~ s/^####\s+[^\n]+\s+(?=^####[^#])//gsm; $input_text =~ s/^#####\s+[^\n]+\s+(?=^#####[^#])//gsm;

    I just googled "conditional regular expression" and it seems that is a thing and it might help me. Not sure yet.

    $PM = "Perl Monk's";
    $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar";
    $nysus = $PM . ' ' . $MCF;
    Click here if you love Perl Monks

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11119658]
Approved by haukex
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2020-10-20 12:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (210 votes). Check out past polls.

    Notices?