Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Regex, Newline, Wilcard

by boblawblah (Scribe)
on Jan 27, 2009 at 21:41 UTC ( [id://739359]=perlquestion: print w/replies, xml ) Need Help??

boblawblah has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, I have a file that looks something like this:
<TABLE name="table1"> FieldName1 VARCHAR(20) FieldName2 INT(20) FieldName3 BOOL </TABLE> <TABLE name="table"2> FieldName1 VARCHAR(20) FieldName2 INT(20) FieldName3 BOOL </TABLE>
Now I expect to be able to say:
while ($file =~ s/<TABLE(.*)>([\n.]*)<\/TABLE>//i) { # $1 should equal ' name="table1"' my $attributes = $1; #$2 should equal # #FieldName1 VARCHAR(20) #FieldName2 INT(20) #FieldName3 BOOL # my $fields = $2; }
What I get is nothing. When I say
$file =~ s/<TABLE(.*)>(\n.*)//i
Then
$2 = 'FieldName1 VARCHAR(20)'
- the first line of text. Or:
$file =~ s/<TABLE(.*)>([\n.]*)//i\
Then
$2 = "\n";
I know that the . (wildcard) charcater doesn't match newlines, hence using the character class \n. in my original example - does anyone have an answer as to why this doesn't work, and what should I do instead? Thank you!

Replies are listed 'Best First'.
Re: Regex, Newline, Wilcard
by ikegami (Patriarch) on Jan 27, 2009 at 22:32 UTC
    Additionally, you have a greediness problem.
    /<TABLE(.*?)>(.*?)<\/TABLE>//si) ^ ^ | | \ / added
Re: Regex, Newline, Wilcard
by kyle (Abbot) on Jan 27, 2009 at 21:47 UTC

    The dot in a character class matches a literal dot, not any character. You'd need to say "(?:.|\n)" or use the /s modifier to get the dot to match newline also (and then don't use the character class).

      You'd need to say "(?:.|\n)"

      or "(?s:.)"

Re: Regex, Newline, Wilcard
by shmem (Chancellor) on Jan 27, 2009 at 22:58 UTC

    Since the file you are reading seems, judging from the sample, to contain blocks separated by a double newline, you could set $/ (input record separator) to "\n\n" which would give you one <TABLE> .. </TABLE> block as one record, to process that record further as a multi-line string with a m//mg.

    But you could also use a flip-flop:

    my $table_name; my $sqldef; while (<>) { if (my $flipflop = /<TABLE name="([^"]+)"/ .. /<\/TABLE/) { if ($1) { $table_name = $1; next; } elsif ($flipflop =~ /E0/) { print "table name: $table_name\n"; print "sqldef: |$sqldef|\n"; $table_name = $sqldef = ''; } else { $sqldef .= $_; } } } __END__ table name: table1 sqldef: |FieldName1 VARCHAR(20) FieldName2 INT(20) FieldName3 BOOL | table name: table sqldef: |FieldName1 VARCHAR(20) FieldName2 INT(20) FieldName3 BOOL |

    See ".." in perlop.

Re: Regex, Newline, Wilcard
by hbm (Hermit) on Jan 27, 2009 at 23:59 UTC

    Almost the same as shmem, but I thought I'd use "</TABLE>" as the record separator. That gives you one extra record at the end, which you can easily ignore, but works if you don't have "\n\n". Or flip-flop; or slurp-n-split.

    use strict; my $line = '='x60 ."\n"; #&RS; &DotDot; #&SlurpAndSplit; sub RS { $/=q{</TABLE>}; while (<DATA>){ if (s|\s*<TABLE (.+?)>|$1|) { chomp; print $line.$_; } } } sub SlurpAndSplit { $/=undef; chomp(my @tables = split(/<\/TABLE>\s*/, <DATA>)); print map { s|<TABLE (.+?)>|$1|; "$line$_" } @tables; } sub DotDot { my @wanted; while (<DATA>) { chomp; if (/<TABLE / .. /<\/TABLE>/ ) { if (/<TABLE (.+?)>/) { push(@wanted,$1); } elsif (/<\/TABLE>/) { print $line, map {"$_\n"} @wanted; @wanted = (); } else { push(@wanted,$_); } } } } __DATA__ <TABLE name="table1"> FieldName1 VARCHAR(20) FieldName2 INT(20) FieldName3 BOOL </TABLE> <TABLE name="table"2> FieldName1 VARCHAR(20) FieldName2 INT(20) FieldName3 BOOL </TABLE>
Re: Regex, Newline, Wilcard
by JavaFan (Canon) on Jan 28, 2009 at 09:36 UTC
    I prefer to write my regexes as restrictive as possible. Hence, something like (untested):
    $file =~ m{<TABLE \s+ ([^>]*+)> ( [^<]*+ (?: < (?!/TABLE>) [^<]*+ )* ) </TABLE>}x;
    This puts 'name="table1"' in $1 and the content of the table in $2. If the perl you're using is pre-5.10, or if you know your input is 'valid' (that is, each <TABLE> has a matching </TABLE>) use '*' instead of '*+'.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://739359]
Approved by kyle
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2024-04-19 20:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found