Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: Looking for a module that strips an HTML tag and its associated 'TEXT'

by perlfan (Vicar)
on Jul 29, 2020 at 14:56 UTC ( [id://11119988]=note: print w/replies, xml ) Need Help??


in reply to Looking for a module that strips an HTML tag and its associated 'TEXT'

See Bruce Gray - Refactoring and Readability: Crouching Regex, Hidden Structures for a nice intro into pure Perl/raku options.

I do not know if there is a CPAN module that interfaces these external tools, but you may be interested in hxpipe that W3C provides. For example,

hxpipe (1)           - convert XML to a format easier to parse with Perl or AWK
Here's the full list, which includes tools that support selecting elements.
cexport (1)          - create headerfile of exported declarations from a C file
hxaddid (1)          - add ID's to selected elements
hxcite (1)           - replace bibliographic references by hyperlinks
hxcite-mkbib (1)     - expand references and create bibliography
hxcopy (1)           - copy an HTML file while preserving relative links
hxcount (1)          - count elements and attributes in HTML or XML files
hxextract (1)        - extract selected elements
hxclean (1)          - apply heuristics to correct an HTML file
hxprune (1)          - remove marked elements from an HTML file
hxincl (1)           - expand included HTML or XML files
hxindex (1)          - create an alphabetically sorted index
hxmkbib (1)          - create bibliography from a template
hxmultitoc (1)       - create a table of contents for a set of HTML files
hxname2id            - move some ID= or NAME= from A elements to their parents
hxnormalize (1)      - pretty-print an HTML file
hxnum (1)            - number section headings in an HTML file
hxpipe (1)           - convert XML to a format easier to parse with Perl or AWK
hxprintlinks (1)     - number links & add table of URLs at end of an HTML file
hxremove (1)         - remove selected elements from an XML file
hxtabletrans (1)     - transpose an HTML or XHTML table
hxtoc (1)            - insert a table of contents in an HTML file
hxuncdata (1)        - replace CDATA sections by character entities
hxunent (1)          - replace HTML predefined character entities to UTF-8
hxunpipe (1)         - convert output of pipe back to XML format
hxunxmlns (1)        - replace "global names" by XML Namespace prefixes
hxwls (1)            - list links in an HTML file
hxxmlns (1)          - replace XML Namespace prefixes by "global names"
asc2xml, xml2asc (1) - convert between UTF8 and &#nnn; entities
hxref (1)            - generate cross-references
hxselect (1)         - extract elements that match a (CSS) selector
And FWIW, Sphinx also provides HTML stripping. Not sure how you'd use it, but it can be done when ingesting data for indexing.
  • Comment on Re: Looking for a module that strips an HTML tag and its associated 'TEXT'
  • Download Code

Replies are listed 'Best First'.
Re^2: Looking for a module that strips an HTML tag and its associated 'TEXT'
by marto (Cardinal) on Jul 29, 2020 at 15:02 UTC

    A non perl dependency makes this unappealing, when pure perl modules can already achieve this.

      Without too much work, you could create an XS module that just uses this code directly. That way you don't need to exec.

      -Thomas
      "Excuse me for butting in, but I'm interrupt-driven..."
      Just sharing. I've been surprised at how many people find this incredibly useful. Also, there's nothing unappealing about Perl programs that call external programs unless those programs are also Perl. Not sure when it became discouraged to use Perl for one of the reasons it was originally created on Unix systems.
        there's nothing unappealing about Perl programs that call external programs unless those programs are also Perl

        ... or can be trivially reimplemented in Perl. See Hippo's Law of Perl.


        🦛

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11119988]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (2)
As of 2024-04-26 04:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found