http://qs321.pair.com?node_id=11115178

perlancar has asked for the wisdom of the Perl Monks concerning the following question:

Wondering if there's something on CPAN or elsewhere which can parse the contents of https://melpa.org/packages/archive-contents into a reasonable Perl representation? I think Data::SExpression chokes on bracket character and that module seems to be pretty much what CPAN has to offer for something relating to parsing S-expression. I guess whipping up a new parser is not hard...

Replies are listed 'Best First'.
Re: Parsing Emacs Lisp sexpr?
by choroba (Cardinal) on Apr 07, 2020 at 22:12 UTC
    > I guess whipping up a new parser is not hard...

    Not as easy as it seems. I probably started from a wrong end, the logic in the actions collapse and tuple should be handled by the grammar itself, but hey: it's slow, but it works for the input.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

      Nice work! I wonder why you opt to parse this format specifically instead of the generic lisp format though.

      As for the speed, it's actually rather on-par with Data::SExpression, which uses Parse::Yapp. I commented out the dumping and then:

      % time perl 11115197.pl archive-contents
      
      real    0m7.449s
      user    0m7.036s
      sys     0m0.413s
      
      % time perl -MFile::Slurper=read_text -MData::SExpression -E'$ds=Data::SExpression->new; ($sexp, $text) = $ds->read(read_text "archive-contents.2");'
      real    0m5.411s
      user    0m5.386s
      sys     0m0.025s
      

      archive-contents.2 is just the original file with replaced with ( ), and then the problematic @ atom replaced by "@".

      Perl regex or Regexp::Grammars will probably be several times faster.

        > I wonder why you opt to parse this format specifically instead of the generic lisp format though.

        As I said, I started from a wrong end. I'm kind of busy working from home and staying there with a wife and three children, so I didn't have time to fix it immediately. Here's a much simpler and faster version, which parses melpa's archive-contents in less than 5 seconds on my machine:

        #! /usr/bin/perl use warnings; use strict; use Marpa::R2; my $dsl = << '__DSL__'; :default ::= action => ::first lexeme default = latm => 1 List ::= ('(') Elements (')') Elements ::= Element+ action => [values] Element ::= List | Vector | Atom | String | Pair Vector ::= ('[') Elements (']') Atom ::= identifier String ::= ('"') Quoteds ('"') Quoteds ::= Quoteds Quoted action => concat | Quoted Quoted ::= backslash || qq || plain Pair ::= Element (dot) Element action => pair :discard ~ whitespace whitespace ~ [\s]+ dot ~ '.' backslash ~ '\\' qq ~ '\"' identifier ~ [-\w@:+]+ plain ~ [^\\"]+ __DSL__ sub concat { $_[1] . $_[2] } sub pair { +{ $_[1] => $_[2] } } my $grammar = 'Marpa::R2::Scanless::G'->new({source => \$dsl}); my $lisp = do { local $/; <> }; my $value_ref = $grammar->parse(\$lisp, {semantics_package => 'main'}) +; use Data::Dumper; print Dumper $value_ref;

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

        And here's my stab at creating a Marpa-based parser, based on JSON::Decode::Marpa: https://github.com/perlancar/perl-SExpression-Decode-Marpa/. It's unfinished (its number and string rules, particularly, are still not adjusted), but can already parse the original archive-contents file, a bit faster than Data::SExpression:

        % time perl -Ilib -MSExpression::Decode::Marpa=from_sexp -MFile::Slurper=read_text -E'from_sexp(read_text "archive-contents")'
        
        real    0m4.023s
        user    0m3.818s
        sys     0m0.204s
        
        Anyhow, I tried hacking a regex-based parser here. It's "working" with some problem: 1) segmentation fault for larger data, indicating a leak somewhere. 2) parsing failure when e.g. the NUMBER rule fails to match and it matches ATOM instead, e.g. in this sexp: (1a) which fails, but (1) and (a) succeed.
Re: Parsing Emacs Lisp sexpr?
by Fletch (Bishop) on Apr 07, 2020 at 19:55 UTC

    That's a lisp vector representation which (e.g.) emacs lisp and clojure use; I believe CL uses a #(1 2 3) reader format instead though so that's probably why Data::SExpression doesn't grok it (then again I don't know if offhand that module'd like the #(1 2 3) notation either . . .)

    Considering that file's basically a serialized list of package-decr instances I'd probably agree with the suggestion to manipulate it from lisp instead. If you expounded on what you're trying to do with it you might prompt better suggestions, but so long as you've called package-initialize and Emacs has populated it you can monkey with things like this. Theoretically you could build up the info you wanted to export and then use json-encode to write out a more convenient representation.

    (require 'cl) ;; Newer emacsen don't need this (cl-loop for (pkg-sym pkg) in package-archive-contents collect (list pkg-sym (package-desc-archive pkg) (package-desc-version pkg)))

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      Aside from the vector syntax, Data::SExpression also currently cannot handle the unquoted @ symbol:

      (@ . [(20181225 1438) ((emacs (24 3))) "multiple-inheritance prototype +-based objects DSL" tar ((:commit . "0a6189f8be42dbbc5d9358cbd447d471 +236135a2") (:authors ("Christopher Wellons" . "wellons@nullprogram.co +m")) (:maintainer "Christopher Wellons" . "wellons@nullprogram.com") +(:url . "https://github.com/skeeto/at-el"))])

      By replacing with ( ) and @ with "@", the data parses.

      2020-04-09 Athanasius fixed formatting of over-long code line.

      Nothing much for now; I just want to see it in tabular ASCII format, which M-x list-packages sort-of already gives me.

      But it would be nice if I can also convert to JSON and use jq or other CLI tools on the data.

Re: Parsing Emacs Lisp sexpr?
by LanX (Saint) on Apr 07, 2020 at 18:41 UTC
    Did you consider using Lisp to parse Lisp and to produce nested Perl arrays?

    Should be easier.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      No, I didn't. Yes, that's one way to do it and the nice thing about it is that I'll be forced to learn some Emacs Lisp in the process :-)