http://qs321.pair.com?node_id=412384

BUU has asked for the wisdom of the Perl Monks concerning the following question:

Heres the situation. Theres a programming language I've been using called JASS (For warcraft3 map editor, if you care) and very few tools exist for it. So I want to create something that will syntax highlight it.

Good news: I have a grammar for the language in Extended Backus-Naur Form. Bad news: I can't figure out what to do with it! I played around with PRD for a while, but the grammar it takes certainly isn't EBNF and I can't seem to convert the grammar I have in to one it will recognize.

To reiterate, all I really want to do is syntax highlight the stupid language, since I had the EBNF grammar it seemed the easiest route to take, but now I'm not so sure.

The long grammar lists follow this point. You have been warned:
This is the EBNF form as it was given to me ( at least it claims it's EBNF ):
//-------------------------------------------------------------------- +-- // Global Declarations //-------------------------------------------------------------------- +-- program := file+ file := newline? (declr newline)* func* declr := typedef | globals | native_func typedef := 'type' id 'extends' ('handle' | id) globals := 'globals' newline global_var_list 'endglobals' global_var_list := ('constant' type id '=' expr newline | var_declr newline)* native_func := 'constant'? 'native' func_declr func_declr := id 'takes' ('nothing' | param_list) 'returns' (type | 'nothing') param_list := type id (',' type id)* func := 'constant'? 'function' func_declr newline local_var_list statement_list 'endfunction' newline //-------------------------------------------------------------------- +-- // Local Declarations //-------------------------------------------------------------------- +-- local_var_list := ('local' var_declr newline)* var_declr := type id ('=' expr)? | type 'array' id //-------------------------------------------------------------------- +-- // Statements //-------------------------------------------------------------------- +-- statement_list := (statement newline)* statement := set | call | ifthenelse | loop | exitwhen | return | debug set := 'set' id '=' expr | 'set' id '[' expr ']' '=' expr call := 'call' id '(' args? ')' args := expr (',' expr)* ifthenelse := 'if' expr 'then' newline statement_list else_clause? 'endif' else_clause := 'else' newline statement_list | 'elseif' expr 'then' newline statement_list else_clause? loop := 'loop' newline statement_list 'endloop' exitwhen := 'exitwhen' expr // must appear in a loop return := 'return' expr? debug := 'debug' (set | call | ifthenelse | loop) //-------------------------------------------------------------------- +-- // Expressions //-------------------------------------------------------------------- +-- expr := binary_op | unary_op | func_call | array_ref | func +_ref | id | const | parens binary_op := expr ([+-*/><]|'=='|'!='|'>='|'<='|'and'|'or') expr unary_op := ('+'|'-'|'not') expr // expr must be integer or real when used with unary ' ++' func_call := id '(' args? ')' array_ref := id '[' expr ']' func_ref := 'function' id const := int_const | real_const | bool_const | string_const +| 'null' int_const := decimal | octal | hex | fourcc decimal := [1-9][0-9]* octal := '0'[0-7]* hex := '$'[0-9a-fA-F]+ | '0'[xX][0-9a-fA-F]+ fourcc := ''' .{4} ''' real_const := [0-9]+'.'[0-9]* | '.'[0-9]+ bool_const := 'true' | 'false' string_const := '"' .* '"' // any double-quotes in the string must be escaped wit +h \ parens := '(' expr ')' //-------------------------------------------------------------------- +-- // Base RegEx //-------------------------------------------------------------------- +-- type := id | 'code' | 'handle' | 'integer' | 'real' | 'bool +ean' | 'string' id := [a-zA-Z]([a-zA-Z0-9_]* [a-zA-Z0-9])? newline := '\n'+


Thats nice and lovely isn't it? Too bad I can't seem to figure out how to use it.

I tried to munge it so PRD would take it, and this is what I came up with:
program : file(s) file : newline(?) declr_newline(s?) func(s?) declr_newline : declr newline declr : typedef | globals | native_func typedef : 'type' id 'extends' ('handle' | id) globals : 'globals' newline global_var_list 'endglobals' global_var_list : tmp_g_v_l(s) tmp_g_v_l : 'constant' type id '=' expr newline | var_declr newline native_func : constant(?) 'native' func_declr func_declr : id 'takes' ('nothing' | param_list) 'returns' (type | 'nothing') param_list : type id tmp_p_l(s) tmp_p_l : ',' type id func : 'constant'(?) 'function' func_declr newline local_var_list statement_list 'endfunction' newline local_var_list : tmp_l_v_n(s) tmp_l_v_n : 'local' var_declr newline var_declr : type id tmp_e_e(?) | type 'array' id tmp_e_e : '=' expr statement_list : tmp_stm_nl(s) tmp_stm_nl : statement newline statement : set | call | ifthenelse | loop | exitwhen | return | debug set : 'set' id '=' expr | 'set' id '[' expr ']' '=' expr call : 'call' id '(' args? ')' args : expr (',' expr)(s) ifthenelse : 'if' expr 'then' newline statement_list else_clause? 'endif' else_clause : 'else' newline statement_list | 'elseif' expr 'then' newline statement_list else_clause? loop : 'loop' newline statement_list 'endloop' exitwhen : 'exitwhen' expr return : 'return' expr? debug : 'debug' (set | call | ifthenelse | loop) expr : binary_op | unary_op | func_call | array_ref | func_ +ref | id | const | parens binary_op : expr (/[+-*/><]/|'=='|'!='|'>='|'<='|'and'|'or') exp +r unary_op : ('+'|'-'|'not') expr func_call : id '(' args(?) ')' array_ref : id '[' expr ']' func_ref : 'function' id const : int_const | real_const | bool_const | string_const | + 'null' int_const : decimal | octal | hex | fourcc decimal : /[1-9][0-9]*/ octal : /0[0-7]*/ hex : /\$[0-9a-fA-F]+/ | /0[xX][0-9a-fA-F]+/ fourcc : "'" /.{4}/ "'" real_const : /[0-9]+\.[0-9]*/ | /\.[0-9]+/ bool_const : 'true' | 'false' string_const : /".*"/ parens : '(' expr ')' type : id | 'code' | 'handle' | 'integer' | 'real' | 'boole +an' | 'string' id : /[a-zA-Z]([a-zA-Z0-9_]* [a-zA-Z0-9])?/ newline : /\n+/
However when I attempt to create a PRD object using this grammar, the new method returns undef and no error messages are set anywhere I can find. It just prints out about 300 semi-colons.

Does anyone see a good solution forward?

Replies are listed 'Best First'.
Re: Syntax highlighting EBNF grammar language
by dimar (Curate) on Dec 04, 2004 at 19:58 UTC

    Reading between the lines, it seems like there are 2 separate and distinct issues for you to address here:

    1) How do I familiarize myself with PRD ?

    2) How do I find a quick and painless way to work with JASS?

    These are separate and distinct because the answers are not necessarily mutually compatible. Getting closer to one may take you further from the other.

    A quick Google search reveals that PRD may be *overkill* as far as answering 2). Nevertheless it is a good tool to have in your 'toolbelt', so hats off for you trying to use it. It's just that there may be way easier paths if all you *really* want to do is answer 2).

    eg ... 3 verbatim google searches:

    JASS warcraft 3 "syntax highlighting"
    warcraft "jass IDE"
    warcraft "jass Editor"
    ... each one of these pulls up some links, each one appears to have directly relevant answers ... and you appear to have an internet connection, so the ball is in your court ;-)

Re: Syntax highlighting EBNF grammar language
by ikegami (Patriarch) on Dec 05, 2004 at 08:27 UTC

    I found a few problems when I read through your grammar quickly.

    1) By default, anything matching /\s+/ between tokens is ignored. That includes newlines, yet one of your tokens is a newline. I don't think that's going to work. You have to use <skip>.

    2) Are you using $::RD_AUTOACTION or <autotree>? You won't get much from the grammar if you don't use either of these, or actions ({ ... }) to return selected tokens at the end of every rule. See 410587 for an example which uses actions to return selected tokens.

    3) It would probably be better if you defined program as file(s) /^\Z/ rather than just file(s).

    4) You defined fourcc as "'" /.{4}/ "'", which is the same as /'\s*.{4}\s'/ (with the default <skip>). I think you want fourcc: /'.{4}'/.

    5) const must be above id in expr, or else true and false will be considered ids instead of bool_consts.

    6) type : id | 'code' | 'handle' | 'integer' | 'real' | 'boolean' | 'string'
    can be simplified to
    type : id

    7) I don't know if PRD can handle your expr and binary_op. Fix:

    expr : binary_op(s?) term binary_op : term (/[+-*/><]/|'=='|'!='|'>='|'<='|'and'|'or') term : unary_op # Must be above id, array_ref & func_ref | func_call # Must be above id, array_ref & func_ref | const # Must be above id, array_ref & func_ref | array_ref # Must be above id | func_ref # Must be above id | id | parens unary_op : ('+'|'-'|'not') term

    8) Actually, unary_op is probably slightly more efficient when written as:

    unary_op : '+' term | '-' term | 'not' term

    9) All your binary operators all have the same precendance. How to fix:

    expr : binary_op(s?) term # Lowest precendance. binary_op : binary_op_2 /and|or/ ... binary_op_8 : binary_op_9 /[+-]/ binary_op_9 : term /[*/]/ # Highest precendance. term : ...

    10) That which you called "const" are really literals. Literals are constant, but constants are not necessarily literals.

    11) Your definition of id has a space in it, and it shouldn't. Also concerning the defintion of id, you should use (?:...) instead of (...). The former is faster, and you don't need to capture. Result: /[a-zA-Z](?:[a-zA-Z0-9_]*[a-zA-Z0-9])?/

    12) Your definition of string_const is way too greedy. It'll match up to the last double-quote in the file. You didn't include the escape mechanism. Finally, you shoulnd't allow newlines in it. Try string_const : /"(?:[^\\"\n]|\\[^\n])*"/

Re: Syntax highlighting EBNF grammar language
by ikegami (Patriarch) on Dec 10, 2004 at 01:12 UTC

    If what you want to do is syntax highlighting, you don't need a parser, just a tokenizer. It might make small mistakes (e.g. if you have a variable name that's identical to a keyword), but they should be tolerable.

    What follows is a tokenizer written in Parse::RecDescent. That's inefficient because the whole input file must be in memory, and the whole file is processed before anything is outputed.

    make_tokenizer.pl -- Run once to create JassTokenizer.pm

    #!/usr/bin/perl # To debug: # perl -s make_tokenizer.pl -RD_HINT -RD_TRACE > make_tokenizer.out + 2>&1 use strict; use warnings; use Parse::RecDescent (); my $grammar = <<'__EOI__'; { my %keywords = ( map { $_ => 'keyword' } qw( array call constant debug else elseif endfunction endglobals endif endloop exitwhen extends function globals if local loop native nothing return returns set takes then type ), and => 'operator', or => 'operator', not => 'operator', null => 'null', false => 'bool', true => 'bool', ); } tokenize : <skip: ''> token(s?) { $item[2] } token : ident { [ $item[1], $itempos[1]{offset}{from}, $itempos[ +1]{offset}{to} ] } # | keyword { [ $item[1], $itempos[1]{offset}{from}, $itempos[ +1]{offset}{to} ] } | ws { [ $item[1], $itempos[1]{offset}{from}, $itempos[ +1]{offset}{to} ] } | comment { [ $item[1], $itempos[1]{offset}{from}, $itempos[ +1]{offset}{to} ] } | list_sep { [ $item[1], $itempos[1]{offset}{from}, $itempos[ +1]{offset}{to} ] } | paren { [ $item[1], $itempos[1]{offset}{from}, $itempos[ +1]{offset}{to} ] } | bracket { [ $item[1], $itempos[1]{offset}{from}, $itempos[ +1]{offset}{to} ] } | operator { [ $item[1], $itempos[1]{offset}{from}, $itempos[ +1]{offset}{to} ] } | assign { [ $item[1], $itempos[1]{offset}{from}, $itempos[ +1]{offset}{to} ] } | string { [ $item[1], $itempos[1]{offset}{from}, $itempos[ +1]{offset}{to} ] } | real { [ $item[1], $itempos[1]{offset}{from}, $itempos[ +1]{offset}{to} ] } | decimal { [ $item[1], $itempos[1]{offset}{from}, $itempos[ +1]{offset}{to} ] } | hex { [ $item[1], $itempos[1]{offset}{from}, $itempos[ +1]{offset}{to} ] } | octal { [ $item[1], $itempos[1]{offset}{from}, $itempos[ +1]{offset}{to} ] } | packed { [ $item[1], $itempos[1]{offset}{from}, $itempos[ +1]{offset}{to} ] } # | bool { [ $item[1], $itempos[1]{offset}{from}, $itempos[ +1]{offset}{to} ] } # | null { [ $item[1], $itempos[1]{offset}{from}, $itempos[ +1]{offset}{to} ] } | unknown { [ $item[1], $itempos[1]{offset}{from}, $itempos[ +1]{offset}{to} ] } ident : /[a-zA-Z](?:[a-zA-Z0-9_]*[a-zA-Z0-9])?/ { ($keywords{$item[ +1]} || 'ident') } ws : m'\s+' { $item[0] } comment : m'//[^\n]*' { $item[0] } list_sep : ',' { $item[0] } paren : m'[()]' { $item[0] } bracket : m'[\[\]]' { $item[0] } operator : m'[-+*/]|[=!]=|>=?|<=?' { $item[0] } assign : '=' { $item[0] } #string : /"(?:[^\\"\n]|\\[^\n])*"/ { $item[0] } string : /"(?:[^\\"\n]|\\[^\n])*(?:"|(?=\n))/ { $item[0] } real : /[0-9]+\.[0-9]*/ { $item[0] } | /\.[0-9]+/ { $item[0] } decimal : /[1-9][0-9]*/ { $item[0] } #hex : /\$[0-9a-fA-F]+/ { $item[0] } # | /0[xX][0-9a-fA-F]+/ { $item[0] } hex : /\$[0-9a-fA-F]*/ { $item[0] } | /0[xX][0-9a-fA-F]*/ { $item[0] } octal : /0[0-7]*/ { $item[0] } #packed : /'[^'\n]{4}'/ { $item[0] } packed : /'[^'\n]{0,4}'?/ { $item[0] } unknown : /./ { $item[0] } __EOI__ rename('JassTokenizer.pm', 'JassTokenizer.pm.bak'); Parse::RecDescent->Precompile($grammar, 'JassTokenizer') or die("Bad grammar.\n");

    script.pl

    #!/usr/bin/perl use strict; use warnings; use JassTokenizer (); # To debug, until proper <error> and <error: ...> directives are added + to the grammar: # perl -s script.pl -RD_HINT -RD_TRACE > script.out 2>&1 my $parser = JassTokenizer->new(); my $sample; foreach (<<'__EOI__', <<'__EOI__') function AsciiCharToInteger takes string char returns integer local string charMap = " !\"#$%%&'()*+,-./0123456789:;<=>?@ABCDEFG +HIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~" local string u = SubString(char, 0, 1) local string c local integer i = 0 loop set c = SubString(charMap, i, i + 1) exitwhen c == "" if c == u then return i + 32 endif set i = i + 1 endloop return 0 endfunction function IdStringToIdInteger takes string value returns integer return AsciiCharToInteger(SubString(value, 0, 1)) * 0x1000000 + Asci +iCharToInteger(SubString(value, 1, 2)) * 0x10000 + AsciiCharToInteger +(SubString(value, 2, 3)) * 0x100 + AsciiCharToInteger(SubString(value +, 3, 4)) endfunction function makeAdvancedUnit takes player who, string id, location where, + real angle, string life, string mana, string abil returns unit local integer unitid = S2I(id) local integer spellmnt = StringLength(abil)/4 local unit u = null local integer i = 0 if unitid == 0 then set unitid = IdStringToIdInteger(id) endif set u = CreateUnit(who, unitid, GetLocationX(where), GetLocationY(wher +e), angle) loop exitwhen i>=spellmnt call UnitAddAbility(u, IdStringToIdInteger(SubString(abil,i*4,(i+1)* +4)) ) set i = i + 1 endloop if StringCase(SubString(life,StringLength(life)-1,StringLength(life) ) +,false) == "p" then call SetUnitLifePercentBJ(u,S2R(SubString(life,0,StringLength(life)) + )) else call SetUnitLifeBJ(u, S2R(life) ) endif if StringCase(SubString(mana,StringLength(mana)-1,StringLength(mana) ) +,false) == "p" then call SetUnitManaPercentBJ(u,S2R(SubString(mana,0,StringLength(mana)) + )) else call SetUnitManaBJ(u, S2R(mana) ) endif return u endfunction __EOI__ function Trig_respawn_Condition takes nothing returns boolean return true endfunction function Trig_respawn_Actions takes nothing returns nothing local location respawn_point local integer respawn_unit local unit u local integer i = 0 call DisplayTextToForce( GetPlayersAll(), "TRIGSTR_013" ) loop exitwhen i > udg_max_units if ( GetTriggerUnit() == udg_all_monsters[i] ) then set respawn_unit = GetUnitTypeId(GetTriggerUnit()) set respawn_point = Location( udg_unit_pos_x[i], udg_unit_ +pos_y[i] ) call DisplayTextToForce( GetPlayersAll(), "Unit: " ) call DisplayTextToForce( GetPlayersAll(), I2S( i ) ) call TriggerSleepAction( 5.00 ) set u = CreateUnitAtLoc( GetOwningPlayer( GetTriggerUnit() + ), respawn_unit, respawn_point, bj_UNIT_FACING ) set udg_all_monsters[i] = u else endif set i = i + 1 endloop endfunction //==================================================================== +======= function InitTrig_respawn takes nothing returns nothing set gg_trg_respawn = CreateTrigger( ) call TriggerRegisterAnyUnitEventBJ( gg_trg_respawn, EVENT_PLAYER_U +NIT_DEATH ) call TriggerAddCondition( gg_trg_respawn, Condition( function Trig +_respawn_Condition ) ) call TriggerAddAction( gg_trg_respawn, function Trig_respawn_Actio +ns ) endfunction __EOI__ { printf("Sample %d.\n", ++$sample); my $list = $parser->tokenize($_) or do { print("Bad text.\n\n\n"); next; }; printf("%-8s from %d to %d.\n", @$_) foreach (@$list); print("\n\n"); }
Re: Syntax highlighting EBNF grammar language
by BrowserUk (Patriarch) on Dec 05, 2004 at 10:02 UTC

    Do you have a pointer to a piece of moderately complex sample JASS code?


    Examine what is said, not who speaks.
    "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
    "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
      Sure, heres a couple of samples:

      Sample 1

      function AsciiCharToInteger takes string char returns integer local string charMap = " !\"#$%%&'()*+,-./0123456789:;<=>?@ABCDEFG +HIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~" local string u = SubString(char, 0, 1) local string c local integer i = 0 loop set c = SubString(charMap, i, i + 1) exitwhen c == "" if c == u then return i + 32 endif set i = i + 1 endloop return 0 endfunction function IdStringToIdInteger takes string value returns integer return AsciiCharToInteger(SubString(value, 0, 1)) * 0x1000000 + Asci +iCharToInteger(SubString(value, 1, 2)) * 0x10000 + AsciiCharToInteger +(SubString(value, 2, 3)) * 0x100 + AsciiCharToInteger(SubString(value +, 3, 4)) endfunction function makeAdvancedUnit takes player who, string id, location where, + real angle, string life, string mana, string abil returns unit local integer unitid = S2I(id) local integer spellmnt = StringLength(abil)/4 local unit u = null local integer i = 0 if unitid == 0 then set unitid = IdStringToIdInteger(id) endif set u = CreateUnit(who, unitid, GetLocationX(where), GetLocationY(wher +e), angle) loop exitwhen i>=spellmnt call UnitAddAbility(u, IdStringToIdInteger(SubString(abil,i*4,(i+1)* +4)) ) set i = i + 1 endloop if StringCase(SubString(life,StringLength(life)-1,StringLength(life) ) +,false) == "p" then call SetUnitLifePercentBJ(u,S2R(SubString(life,0,String Length(life +)) )) else call SetUnitLifeBJ(u, S2R(life) ) endif if StringCase(SubString(mana,StringLength(mana)-1,StringLength(mana) ) +,false) == "p" then call SetUnitManaPercentBJ(u,S2R(SubString(mana,0,String Length(mana +)) )) else call SetUnitManaBJ(u, S2R(mana) ) endif return u endfunction

      Sample 2

      function Trig_respawn_Condition takes nothing returns boolean return true endfunction function Trig_respawn_Actions takes nothing returns nothing local location respawn_point local integer respawn_unit local unit u local integer i = 0 call DisplayTextToForce( GetPlayersAll(), "TRIGSTR_013" ) loop exitwhen i > udg_max_units if ( GetTriggerUnit() == udg_all_monsters[i] ) then set respawn_unit = GetUnitTypeId(GetTriggerUnit()) set respawn_point = Location( udg_unit_pos_x[i], udg_unit_ +pos_y[i] ) call DisplayTextToForce( GetPlayersAll(), "Unit: " ) call DisplayTextToForce( GetPlayersAll(), I2S( i ) ) call TriggerSleepAction( 5.00 ) set u = CreateUnitAtLoc( GetOwningPlayer( GetTriggerUnit() + ), respawn_unit, respawn_point, bj_UNIT_FACING ) set udg_all_monsters[i] = u else endif set i = i + 1 endloop endfunction //==================================================================== +======= function InitTrig_respawn takes nothing returns nothing set gg_trg_respawn = CreateTrigger( ) call TriggerRegisterAnyUnitEventBJ( gg_trg_respawn, EVENT_PLAYER_U +NIT_DEATH ) call TriggerAddCondition( gg_trg_respawn, Condition( function Trig +_respawn_Condition ) ) call TriggerAddAction( gg_trg_respawn, function Trig_respawn_Actio +ns ) endfunction

        The first sample you provided isn't valid. In two places, "StringLength" was written as "String  Length".

Re: Syntax highlighting EBNF grammar language
by Courage (Parson) on Dec 04, 2004 at 13:30 UTC
    what perl in it?
        You're right,
        I noticed "PRD" but I think poster should better explain Perl connection.