Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

parsing file/regex question

by smackdab (Pilgrim)
on Oct 23, 2003 at 20:36 UTC ( [id://301707]=perlquestion: print w/replies, xml ) Need Help??

smackdab has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I need to parse a file and validate some data. The thing that I can't figure out is how to allow '\n' or '\t' in my data...

Here is my test case, using __DATA__ instead of a file. I want to see "2 yeps ;-)"
$PRE = '\[\s*('; $VALID1 = '[-a-zA-Z0-9_.*\s]'; $PST = ')\s*\]'; while (<DATA>) { print "yep\n" if /$PRE($VALID1+)$PST/; } __DATA__ [TEST \n DATA]\n [ TEST DATA ]\n

Replies are listed 'Best First'.
Re: parsing file/regex question
by tadman (Prior) on Oct 23, 2003 at 20:40 UTC
    Are those \n characters supposed to be newlines?

    Try this:
    while (<DATA>) { s/\\n/\n/g; print "yep\n" if /$PRE($VALID1+)$PST/; }
    Two yeps.
      Thanks, that does make sense, but does "break" the data driven approach I am trying to come up with...I have expanded the example and maybe someone will come up with a different idea...if not I'll do it the way suggested ;-)
      $PRE = '\[\s*('; $VALID1 = '[-a-zA-Z0-9_.* \t\n]'; $VALID2 = '[-a-z0-9_.*\n]'; $VALID3 = '[a-zA-Z]'; $VALID4 = '[-a-zA-Z0-9]'; $PST = ')\s*\]'; while (<DATA>) { s/\\n/\n/g; #Are these harmless if s/\\t/\t/g; #not needed??? print "yep\n" if m/$PRE($VALID1+)$PST\s* $PRE($VALID2+)$PST\s* $PRE($VALID3+)$PST\s* $PRE($VALID4+)$PST\s* /x; } __DATA__ [TEST \n DATA] [ TEST DATA ] [ 2345423 ] [ TEST\tDATA ]\n [TEST \n DATA] [ TEST DATA ] [ 2345423 ] [ TEST DATA ]\n [TEST \n DATA] [ TEST DATA ] [ 2345423 ] [ TEST\tDATA ]\n
One more q: on: parsing file/regex question
by smackdab (Pilgrim) on Oct 23, 2003 at 23:25 UTC
    Thanks for all of the help on this so far...I took the suggestions and I expanded the sample program to see if that makes a difference...

    I am hoping to get this as data driven as possible to reduce errors (especially when I cut-n-paste ;-)

    I am looking to process some lines in a file and validate text (I am not yet using Taint, but will at some point ;) My problem is how to validate \n or \t, as sometimes it is allowed in the text field.

    The following code should work, but I just want to make sure that the s/\\t/\t/g; (and the other ones that I might need) are the best way to go)

    Thanks again for any help!!!!
    $PRE = '\[\s*'; $VALID1 = '[-a-zA-Z0-9_.* \t\n]'; $VALID2 = '[-a-z0-9_.*\n]'; $VALID3 = '[a-zA-Z]'; $VALID4 = '[-a-zA-Z0-9]'; $PST = '\s*\]'; while (<DATA>) { s/\\n/\n/g; #Are these harmless if s/\\t/\t/g; #not needed??? print "yep\n" if m/$PRE($VALID1+)$PST $PRE($VALID2+)$PST $PRE($VALID3+)$PST $PRE($VALID4+)$PST /ox; } __DATA__ [TEST \n DATA] [ TEST DATA ] [ 2345423 ] [ TEST DATA ]\n [TEST \n DATA] [ TEST DATA ] [ 2345423 ] [ TEST\tDATA ]\n [TEST \n DATA] [ TEST DATA ] [ 2345423 ] [ TEST DATA ]\n
      You didn't say which (if any) of the three data records is supposed to yield "yep"... it looks like none of them will, because $VALID3 specifies letters only, and all three data lines have only digits in the third field. Also, for any of them to match, $PST should include "\s*" after the close bracket, as well as before it (or maybe this should be added before the open bracket in $PRE).

      You do have the right notion for converting a literal (two character) '\n' or '\t' into the corresponding regex for the given type of whitespace.

      Note that some portions of your regexes can be simplified:  [a-zA-Z0-9_] is really just "\w", and if you want to match space, newline and tab, you might as well just use "\s".

      Are $VALID1 and $VALID2 really supposed to accept periods and asterisks, as well as alphanumerics and whitespace? (Just checking... sometimes people tend to make the mistake of putting ".*" inside of square brackets when they really have something else in mind.)

Re: parsing file/regex question
by tcf22 (Priest) on Oct 23, 2003 at 20:49 UTC
    I'm assuming that the '\n' in DATA are actually new lines.

    Maybe something like this is what you want:
    my $re = qr/\[\s*([-a-zA-Z0-9_.*\s]+)\s*\]/; my ($last); while (<DATA>) { chomp; $_ = $last.$_ if($last); if (/$re/){ print "yep\n"; $last = ''; }elsif(substr($_, -1, 1) ne ']'){ $last = $_; }else{ $last = ''; } } __DATA__ [TEST DATA] [ TEST DATA ]

    - Tom

Re: parsing file/regex question
by TomDLux (Vicar) on Oct 23, 2003 at 21:05 UTC
    1. The regular expression doesn't change, so you should notify Perl that it's safe to compile it once,. rather than each time that line is encountered. For two invocations, it obviously doesn't matter, but small scripts have a habit of growing, so you might as well get off to a good start right away:
      if /$PRE($VALID1+)$PST/o
    2. You parenthesize $Valid1 twice, once as the last character of $PRE and the first character of $PST, then explitly when you string them together:
      /$PRE($VALID1+)$PST/.

    --
    TTTATCGGTCGTTATATAGATGTTTGCA

      /o must go (look around for many threads about the /o bug ...)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://301707]
Approved by kvale
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (5)
As of 2024-04-25 09:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found