http://qs321.pair.com?node_id=839384

keeper12 has asked for the wisdom of the Perl Monks concerning the following question:

I have a project in which i have to download a few files using a perl script and then mail it. For Emailing the File i am using Email::Send and Email::MIME. The script is more or less ready, but i have a small problem.

The URLs are specified in a text file.

The problem is that every time i download a file, i have to specify the name by which it is to be saved along with the URL. What I want to do is that whenever a file is to be downloaded, the name should be extracted from the URL only.

e.g. if URL is www.abcd.com/search or www.efgh.in#found

The file should be saved by the name of (www.abcd.com.ppt) or (www.efgh.in.doc)

The code is posted below. Please help me out.

#!/usr/bin/perl use warnings; use LWP::Simple; use Tie::File; my $testfolder = "/Users/Apurv/Desktop/"; tie @file, 'Tie::File', $testfolder . "file1.txt" or die; foreach $URL (@file) { my $name = substr $URL, 29, 13; my $add = substr $URL, 0, 29; my $file = $testfolder . "$name"; my $status = mirror($add,$file); die "Cannot retrieve $add" unless is_success($status); . . . {Mailing Part code} }

Here I was manually specifying the Name along with the text file.

I found out that I can use Split. But being a newbie to perl programming, could not understand it properly.

Replies are listed 'Best First'.
Re: Split Help
by almut (Canon) on May 11, 2010 at 07:55 UTC

    It's not exactly clear to me how you want to extract www.abcd.com.ppt from www.abcd.com/search, or www.efgh.in.doc from www.efgh.in#found... (looks like the .ppt and .doc should rather come from the content type of the page; also, what if there is more than one URL with the same host part?)

    Anyhow, maybe the URI module could help.  It provides various methods for getting the path, etc. components of an URL.

      What I am trying to say is I want to extract www.abcd.com or www.efgh.in from the URLs using Split function.

      The URLs i am supposed fetch files from contain single files only. You are right that the .ppt and .doc will come from the content type of the page.

      What I want is if a '/' or A '#' is encountered wherever in the URL, the part before it should be taken as the filename. i.e

      if

      www.abcd.com?file/search is URL then www.abcd.com?file should be the file name.

      and if

      www.abcd.com/search is URL then www.abcd.com should be the file name. Same is the Case with '#'

      I want to split the URL at the first '/' or first '#' and use it.

      BTW Thank you for ur Reply
        I want to split the URL at the first '/' or first '#' and use it.

        Then maybe just try

        for my $url ("www.abcd.com?file/search", "www.abcd.com/search", "www.efgh.in#found") { my ($fname) = split /[\/#]/, $url; print $fname; } __END__ www.abcd.com?file www.abcd.com www.efgh.in

        split takes a regular expression by which to split the string, and [\/#] is a character class comprising the two characters '/' and '#', which means to split on either of those characters.  The parentheses around $fname in the assignment are needed to supply list context for split.

Re: Split Help
by sam_bakki (Pilgrim) on May 11, 2010 at 10:56 UTC
    You can even use regular expressions and match the exact phrase you want and then use it as file name, Ex:
    $url='http://www.abcd.com/search'; if($url=~m/www\.(.*)\//) { $fn="www.$1.ppt"; print $fn; }
Re: Split Help
by lbutler (Initiate) on May 11, 2010 at 20:18 UTC

    Use the URI module. It allows you to access bits of a URL without trying to hack a regular expression. Moreover, the code will be clearer (self-documenting).

    Here is an example from the command line:
    $ perl -wl -MURI -e '$url=URI->new("http://www.perlmonks.org/?parent=8 +39419;node_id=3333"); print $url->host;'

    And the output:

    www.perlmonks.org
Re: Split Help
by elwarren (Priest) on May 11, 2010 at 22:26 UTC
    Emailing a file attachment named "file.com" will set off virus scanners because *.com files are executables in dos/windows. I would suggest adding a check to avoid this and possibly appending another .ext onto these files.