Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Personaly, I think you're engaging in premature optimization here: when fetching 4M urls, the DNS traffic is unlikely to be your biggest concern.

Having said that, the cheapest/cleanest method would be to install a caching-only DNS server on your localhost, and let it handle the DNS caching.

Some reasons why your current solution might be slow:

  • are all those 4 pages each in a flat file, and all the flat files in one directory? You'd be better off distributing them over a tree of directories.
  • Do you have enough bandwidth to download all those pages? The line might be saturated with that much data. If you are connected through some asymetric line (like ADSL), your downloads could be chocked by the lack of bandwidth for the ACK traffic.
  • Do you have enough memory for all the processes you've started? If your processes are being swapped out, they will not only be running more slowly as different processes are getting swapped in and out, but they'll probably compete for disk bandwidth with the files you're writing out.

In reply to Re: Advice on Efficient Large-scale Web Crawling by matija
in thread Advice on Efficient Large-scale Web Crawling by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others avoiding work at the Monastery: (6)
    As of 2021-03-08 17:08 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?
      My favorite kind of desktop background is:











      Results (126 votes). Check out past polls.

      Notices?