Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

parsing an ASP file

by dada (Chaplain)
on May 12, 2004 at 13:27 UTC ( [id://352721]=CUFP: print w/replies, xml ) Need Help??

this function reads an ASP file (yes, sometimes I have to fight with such stuff) separating code blocks from HTML blocks. the function returns an array where each element is a 2-element array. the first element is either "HTM" or "ASP", and the second one is the block itself. ASP tags (<% and %>) are removed.


sub get_asp_blocks { my($file) = @_; open(FILE, $file) or die "can't open '$file': $!\n"; my @blocks = ( ["HTM", ""] ); my $state = "HTM"; my $last; while(read(FILE, $char, 1)) { if($last eq "<" && $char eq "%" && $state eq "HTM") { chop $blocks[-1][1]; $state = "ASP"; push(@blocks, ["ASP", ""]); } elsif($last eq "%" && $char eq ">" && $state eq "ASP") { chop $blocks[-1][1]; $state = "HTM"; push(@blocks, ["HTM", ""]); } else { $blocks[-1][1] .= $char; } $last = $char; } close(FILE); return @blocks; }

Replies are listed 'Best First'.
Re: parsing an ASP file
by Juerd (Abbot) on May 12, 2004 at 17:15 UTC

    I think (but have not tested) that even an inefficient regex is faster than reading one character at a time. It is certainly easier to write :)

    my @parsed; while ($asp =~ /\G ((?: [^<]+ | <(?!%) )*) (?: <%(.*?)%> | ((?=<%)) )? + /gsx) { $1 and push @parsed, [ html => $1 ]; $2 and push @parsed, [ asp => $2 ]; defined $3 and die "Unclosed ASP code block near '", $asp =~ /\G(<%\s*\n?.*)/g, "'.\n"; }
    But, of course,
    <% foo = "a mere %> breaks either simple minded solution." %>

    Juerd # { site => '', plp_site => '', do_not_use => 'spamtrap' }

      yep. one thing I forgot to mention is that, for the application I'm currently writing (which is basically an ASP cross-reference generator) I need to have the line number where each block appears. so, the code I'm using is something more like:
      sub get_asp_blocks { my($file) = @_; open(FILE, $file) or die "can't open '$file': $!\n"; my $dot = 1; my @blocks = ( ["HTM", $dot, ""] ); my $state = "HTM"; my $last; while(read(FILE, $char, 1)) { $dot++ if $char eq "\n"; if($last eq "<" && $char eq "%" && $state eq "HTM") { chop $blocks[-1][-1]; $state = "ASP"; push(@blocks, ["ASP", $dot, ""]); } elsif($last eq "%" && $char eq ">" && $state eq "ASP") { chop $blocks[-1][-1]; $state = "HTM"; push(@blocks, ["HTM", $dot, ""]); } else { $blocks[-1][-1] .= $char; } $last = $char; } close(FILE); return @blocks; }
      this way, each element of the returned array contains three elements: the type (ASP or HTM), the line number, and the block itself.


      King of Laziness, Wizard of Impatience, Lord of Hubris

        my $state = "HTM";

        The state is what I don't like. It means that everything needs to be done manually. So to get the line numbers, I'd probably just extend the regex with one set of all-enclosing parens (or for simple stand-alone scripts just use $&), and then count the number of \n characters found in it.

        my @parsed; my $line = 1; while ($asp =~ /\G( ((?: [^<]+ | <(?!%) )*) (?: <%(.*?)%> | ((?=<%)) ) +? )/gsx) { $2 and push @parsed, [ $line, html => $2 ]; $3 and push @parsed, [ $line, asp => $3 ]; defined $4 and die "Unclosed ASP code block starting on line $line + near '", $asp =~ /\G(<%\s*\n?.*)/g, "'.\n"; $line += $1 =~ tr/\n//; }

        Juerd # { site => '', plp_site => '', do_not_use => 'spamtrap' }

      Ah, but a more complete version is easy to write too! :) (Although, I admit, a bit more longwinded...)

      use re 'eval'; my $string = qr[ " [^"\\]* (?:\\.|[^"\\])* " | ' [^'\\]* (?:\\.|[^'\\])* ' ]x; my $alist = qr[(?: [^"'>]* | $string )*]x; my $ehead = qr[ <\w+ $alist /? > ]x; my $textarea = qr[ <textarea $alist> (?: [^<]* | < (?!/textarea>) )* </textarea> ]x; my $asp = qr[ <% (?: (?> [^%"']* ) | $string | % (?! > ) )+ %> ]x; my $html = qr[ (?: (?> [^<"'] ) | $textarea | $ehead | </\w+> )+ ]x; my @parsed; () = $string =~ / ($asp) (?{ push @parsed, [asp => $1] }) | ($html) (?{ push @parsed, [html => $2] }) /gx;
Re: parsing an ASP file
by perlinux (Deacon) on May 12, 2004 at 14:03 UTC
    Good job! Do you think it can work for other "tag" languages (PHP, JSP), only changing the tag? Or changing the code... Does it needs great changes? I don't ever worked with ASP...

    Italian: ho seguito la discussione di questo code in chat... :-)
      as far as I know (which isn't very far :-) PHP uses <? .. ?> as delimiters, so you could just change "%" to "?" in the eqs above (and of course, change "ASP" to "PHP" if you prefer) and it should work.

      JSP seems to be using its own tag library (things like <jsp:getProperty .. /> and so on) as well as blocks like <% .. %>. perhaps you could use this tool and a full-blown HTML (or XHTML) parser to recognize JSP tags in HTML blocks, but I really don't know.


      King of Laziness, Wizard of Impatience, Lord of Hubris

        PHP is configurable to use [|many tags: <? <% <?php and <script language="php">

        I'm not belgian but I play one on TV.

Re: parsing an ASP file
by iburrell (Chaplain) on May 25, 2004 at 21:43 UTC
    The type element ('HTM' or 'ASP') is not required if you leave the ASP tags in the strigns. The parser would effectively split the files into a list of chunks. It would be easy to tell ASP chunks because they start with '<%'.

    I have seen a regex-based XML parser that works this way. It breaks the XML into strings which can be identified by looking at the first couple of characters.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://352721]
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (4)
As of 2024-04-24 22:25 GMT
Find Nodes?
    Voting Booth?

    No recent polls found