this function reads an ASP file (yes, sometimes I have to fight with such stuff) separating code blocks from HTML blocks. the function returns an array where each element is a 2-element array. the first element is either "HTM" or "ASP", and the second one is the block itself. ASP tags (<% and %>) are removed.
cheers,
Aldo
sub get_asp_blocks {
my($file) = @_;
open(FILE, $file) or die "can't open '$file': $!\n";
my @blocks = ( ["HTM", ""] );
my $state = "HTM";
my $last;
while(read(FILE, $char, 1)) {
if($last eq "<" && $char eq "%" && $state eq "HTM") {
chop $blocks[-1][1];
$state = "ASP";
push(@blocks, ["ASP", ""]);
} elsif($last eq "%" && $char eq ">" && $state eq "ASP") {
chop $blocks[-1][1];
$state = "HTM";
push(@blocks, ["HTM", ""]);
} else {
$blocks[-1][1] .= $char;
}
$last = $char;
}
close(FILE);
return @blocks;
}
Re: parsing an ASP file
by Juerd (Abbot) on May 12, 2004 at 17:15 UTC
|
I think (but have not tested) that even an inefficient regex is faster than reading one character at a time. It is certainly easier to write :)
my @parsed;
while ($asp =~ /\G ((?: [^<]+ | <(?!%) )*) (?: <%(.*?)%> | ((?=<%)) )?
+ /gsx) {
$1 and push @parsed, [ html => $1 ];
$2 and push @parsed, [ asp => $2 ];
defined $3 and die "Unclosed ASP code block near '",
$asp =~ /\G(<%\s*\n?.*)/g, "'.\n";
}
But, of course,
<% foo = "a mere %> breaks either simple minded solution." %>
| [reply] [d/l] [select] |
|
yep. one thing I forgot to mention is that, for the application I'm currently writing (which is basically an ASP cross-reference generator) I need to have the line number where each block appears. so, the code I'm using is something more like:
sub get_asp_blocks {
my($file) = @_;
open(FILE, $file) or die "can't open '$file': $!\n";
my $dot = 1;
my @blocks = ( ["HTM", $dot, ""] );
my $state = "HTM";
my $last;
while(read(FILE, $char, 1)) {
$dot++ if $char eq "\n";
if($last eq "<" && $char eq "%" && $state eq "HTM") {
chop $blocks[-1][-1];
$state = "ASP";
push(@blocks, ["ASP", $dot, ""]);
} elsif($last eq "%" && $char eq ">" && $state eq "ASP") {
chop $blocks[-1][-1];
$state = "HTM";
push(@blocks, ["HTM", $dot, ""]);
} else {
$blocks[-1][-1] .= $char;
}
$last = $char;
}
close(FILE);
return @blocks;
}
this way, each element of the returned array contains three elements: the type (ASP or HTM), the line number, and the block itself.
cheers,
Aldo
King of Laziness, Wizard of Impatience, Lord of Hubris
| [reply] [d/l] |
|
my $state = "HTM";
The state is what I don't like. It means that everything needs to be done manually. So to get the line numbers, I'd probably just extend the regex with one set of all-enclosing parens (or for simple stand-alone scripts just use $&), and then count the number of \n characters found in it.
my @parsed;
my $line = 1;
while ($asp =~ /\G( ((?: [^<]+ | <(?!%) )*) (?: <%(.*?)%> | ((?=<%)) )
+? )/gsx) {
$2 and push @parsed, [ $line, html => $2 ];
$3 and push @parsed, [ $line, asp => $3 ];
defined $4 and die "Unclosed ASP code block starting on line $line
+ near '",
$asp =~ /\G(<%\s*\n?.*)/g, "'.\n";
$line += $1 =~ tr/\n//;
}
| [reply] [d/l] |
|
|
|
|
Ah, but a more complete version is easy to write too! :) (Although, I admit, a bit more longwinded...)
use re 'eval';
my $string = qr[
" [^"\\]* (?:\\.|[^"\\])* "
| ' [^'\\]* (?:\\.|[^'\\])* '
]x;
my $alist = qr[(?: [^"'>]* | $string )*]x;
my $ehead = qr[ <\w+ $alist /? > ]x;
my $textarea = qr[
<textarea $alist>
(?:
[^<]*
| < (?!/textarea>)
)*
</textarea>
]x;
my $asp = qr[
<%
(?:
(?> [^%"']* )
| $string
| % (?! > )
)+
%>
]x;
my $html = qr[
(?:
(?> [^<"'] )
| $textarea
| $ehead
| </\w+>
)+
]x;
my @parsed;
() = $string =~ /
($asp) (?{ push @parsed, [asp => $1] })
| ($html) (?{ push @parsed, [html => $2] })
/gx;
| [reply] [d/l] |
Re: parsing an ASP file
by perlinux (Deacon) on May 12, 2004 at 14:03 UTC
|
Good job! Do you think it can work for other "tag" languages (PHP, JSP), only changing the tag? Or changing the code... Does it needs great changes? I don't ever worked with ASP...
Italian: ho seguito la discussione di questo code in chat... :-) | [reply] |
|
| [reply] |
|
PHP is configurable to use [http://us4.php.net/basic-syntax|many tags: <? <% <?php and
<script language="php">
--
I'm not belgian but I play one on TV.
| [reply] |
Re: parsing an ASP file
by iburrell (Chaplain) on May 25, 2004 at 21:43 UTC
|
The type element ('HTM' or 'ASP') is not required if you leave the ASP tags in the strigns. The parser would effectively split the files into a list of chunks. It would be easy to tell ASP chunks because they start with '<%'.
I have seen a regex-based XML parser that works this way. It breaks the XML into strings which can be identified by looking at the first couple of characters.
| [reply] |
|
|