Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

HTML::Parser guidance

by 4perl (Initiate)
on May 28, 2013 at 12:55 UTC ( [id://1035586]=perlquestion: print w/replies, xml ) Need Help??

4perl has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone

I am having a bit of trouble understanding the methods in HTML::Parser.

My task is basically to parse lines in html files and every line is of structure. Consider the:

"{div class="message1"} {span class="date"} yyyy-mm-dd hh:mm:ss {/span} {span class="id"} id1 {/span} {span class="resource"} text1 {/span} - > {span class="id"} id2 {/span} {span class="resource"} text2 {/span} {span class="messagetext"} texttext {/span} {/div}"

What I want to obtain is a file with lines having only the data, with \t between every chunk of text si I can play with it further.

Any help is kindly appreciated. Curly brackets used to keep structure visible.

Thank you

Replies are listed 'Best First'.
Re: HTML::Parser guidance
by smls (Friar) on May 28, 2013 at 15:39 UTC

    Well the first thing you need to do when it comes to parsing data records from HTML documents, is to look at the markup structure of the document and determine the simplest "search rule" that matches all data records (without matching any false positives).

    If I understand your post correctly, each record in your case will be of the form <div class="message1">...</div>. But that's not enough information to determine the "search rule" for matching all of them, you need to look at what exactly changes from record to record, and at the page structure they're embedded in.
    Here are some examples of what the "search rule" could be, depending on the exact document structure at hand (sorted from simpler to more complex):

    • "All <div> elements anywhere in the document"
    • "All <div> elements that are direct children of <body>"
    • "All <div> elements that are direct children of a container element that has a unique id"
    • "All <div> elements that have the attribute class="message1""
    • "All <div> elements that have the attribute class="messageX", with X being a number"
    • "All <div> elements that have a <span class="id"> child element"
    • ...
    Once you have determined the simplest rule for matching records (and no false positives!) for your particular use-case, you can then start thinking about how to implement the parsing. Report back once you are at that stage and need more help with that.

    PS: As for formatting questions on PerlMonks, it's best to put code (including HTML markup) in <code>...</code> tags - among other benefits you can then keep the angled brackets in the code, they will show up verbatim.

Re: HTML::Parser guidance
by ww (Archbishop) on May 28, 2013 at 15:08 UTC
    1. Reading the documentation is apt to improve your understanding.
    2. What have you tried?
    Markup: Markup in the Monastery
    Effort: On asking for help

    If you didn't program your executable by toggling in binary, it wasn't really programming!

      Well to be fair, HTML::Parser is not an easy module to wrap your head around if you are not already familiar with OO programming and event-based parsing. And its official documentation isn't very newbie-friendly either. So let's cut the OP some slack... :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1035586]
Approved by shmem
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (5)
As of 2024-04-24 22:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found