Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re^4: joining words

by bigup401 (Pilgrim)
on Dec 19, 2020 at 21:04 UTC ( [id://11125470]=note: print w/replies, xml ) Need Help??


in reply to Re^3: joining words
in thread joining words

This node falls below the community's threshold of quality. You may see it by logging in.

Replies are listed 'Best First'.
Re^5: joining words
by bliako (Monsignor) on Dec 19, 2020 at 23:09 UTC
    i removed span and div, because i never wanted them

    And what about this?

    <table> <tr> <td><span>Omonoia 1948</span></td> <td><span>Apoel</span></td> <td><span>3-0</span></td> </tr> </table>

    your regex removing spans will remove all content from above table.

    My 2nd point is that there are 2 tables in the example URL you posted. Why are you not specific about which table you want to process, first or second? That's sloppy. Really sloppy. Bad. You have to type the code.

    My 3rd point is that you are trying to parse html fetched from a website. Your code fails for some reason. But the site is hit and delivers the html. Then you make a modification to your code. Then you try again ... by asking the site again to give you the same (format, not content because in the meantime the score may be 4-0) HTML so that you try again your new regex or whatever. This can be done 15 times per minute. The same URL hundreds of times until you finally make your table parsing correct, fingers crossed. But the website admininstrators get angry. Everyone gets angry. "Just find the correct hole goddamit" -- keyhole that is. They are asked by management to install new measures to stop your "attack". You created a lot of hassle. We don't want that. So, why not download the webpage once, save it to a file and then keep trying your regex-tricks or whatever on that local html file, no need to hit the site again and again and again. BTW, are you the one hitting MY site all day long???????

    over and out

    bw, bliako

Re^5: joining words
by Bod (Parson) on Dec 19, 2020 at 22:14 UTC

    I cannot get my head around what you are trying to say...but...there is no need to remove the <span> and <div> tags any more than you do any of the rest of the page's HTML.

    use strict; use LWP::Simple; my $html = get("http://example.com"); while ($html =~ /<td>(.+)?<\/td>/gc) { print $1."\n"; } # untested as written on my mobile

    This will fetch a webpage and extract and print the content of every <td> tag. No need to strip anything out first or to make more than one request to the webserver.

      I cannot get my head around what you are trying to say

      Don't waste your time.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11125470]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (2)
As of 2024-04-26 01:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found