Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Apply regex to entire file, not just individual lines ?

by Anonymous Monk
on May 24, 2000 at 13:59 UTC ( [id://14521]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question: (regular expressions)

I'm trying to extract a specific block of recurring text from a daily-updated Web page, and output the result to a local file. I'm happy with my HTML retrieval, but then applying regex's on a line-by-line basis requires waaay too much tweeking on my part. How can I substitute across multiple lines? Preferably to the entire file.

Originally posted as a Categorized Question.

  • Comment on Apply regex to entire file, not just individual lines ?

Replies are listed 'Best First'.
Re: Apply regex to entire file, not just individual lines ?
by nuance (Hermit) on May 24, 2000 at 17:46 UTC
    You can read the entire file into a scalar variable like this
    { open(FILE, "$filename") or die "Cant open $filename\n"; local $/ = undef; $lines = <FILE>; close(FILE); }
    Then you can just use your normal regular expression, but you'll probably want to use at least one of the following modifiers (from the perlre manpage):

    m

    Treat string as multiple lines. That is, change ``^'' and ``$'' from matching at only the very start or end of the string to the start or end of any line anywhere within the string,

    s

    Treat string as single line. That is, change ``.'' to match any character whatsoever, even a newline, which it normally would not match. The /s and /m modifiers both override the $* setting. That is, no matter what $* contains, /s without /m will force ``^'' to match only at the beginning of the string and ``$'' to match only at the end (or just before a newline at the end) of the string. Together, as /ms, they let the ``.'' match any character whatsoever, while yet allowing ``^'' and ``$'' to match, respectively, just after and just before newlines within the string.

Re: Apply regex to entire file, not just individual lines ?
by juahonen (Novice) on May 24, 2000 at 17:22 UTC
    After you've opened and read the file (or web page) into an array, join all lines with join().
    open(FILE, "$filename");
    @lines = <FILE>;
    close(FILE);
    
    $content = join('', @lines);
    
    After this, $content will be single-line and it is easy to do regexp with your existing functions.
Re: Apply regex to entire file, not just individual lines ?
by vxp (Pilgrim) on Aug 16, 2002 at 16:14 UTC
    You might not want to have your WHOLE file in one variable. Depending on the size of the file, it could eat a LOT of your memory. From my own experience, it is usually enough for me to do $/ = '\n\n' and then the linebreak is 2 new lines, not one. I was parsing a bounce file when I was doing this, which was about 300megs in size, daily. thats a LONG 300mb line. $/ = '\n\n'; took care of it. i ended up with having.. smaller big lines, and was able to do what I wanted to do without consuming a lot of RAM.
Re: Apply regex to entire file, not just individual lines ?
by dsb (Chaplain) on Jan 24, 2001 at 02:49 UTC
    The key is two get the whole file into one scalar( the first 'while' loop). Then the 'g' modifier ( the condition in the second 'while' loop ) will keep the place of the last match found and continue from there until there are no matches found.
    open( FH, "filename" ) || die "couldn't open\n"; while ( <FH> ) { $data .= $_; } while ( $data =~ m/PATTERN/g ) { # executed code # executed code...etc. }
    -kel
RE: Apply regex to entire file, not just individual lines ?
by KM (Priest) on May 25, 2000 at 08:28 UTC
    If the only trouble you are having is that it isn't writing to a file is that you are not printing to a filehandle. Look at the open() docs (perldoc -f open) and perlopentut to learn the different ways to open a file and write to it.

    Cheers,
    KM

RE: Apply regex to entire file, not just individual lines ?
by perlcgi (Hermit) on May 25, 2000 at 15:09 UTC
    Remember that if the unwanted stuff appears more than one per line you'll need a /g to match globally. $lines =~s/^unwantedstuff//gsm

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://14521]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (6)
As of 2024-04-25 08:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found