Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Combining Ffile:fetch with MySQL

by justin423 (Scribe)
on Jul 18, 2022 at 02:58 UTC ( [id://11145569]=perlquestion: print w/replies, xml ) Need Help??

justin423 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a need to download about 50 documents a month where the links are all the same format and can be segregated by month the code prints out the links, so I know that part works, but Fetch is not downloading the data to the data folder. here is the Fetch code that isn't working. anybody see what it wrong with it?
while (my $ref = $query->fetchrow_hashref()) { print "url: $ref->{url}\n"; my $ff = File::Fetch->new(uri=>$ref->{url}); my $where = $ff->fetch( to => '/data/'); } $query->finish;

Replies are listed 'Best First'.
Re: Combining Ffile:fetch with MySQL
by haukex (Archbishop) on Jul 18, 2022 at 07:24 UTC
    I guess alternatively, automatically renaming the saved file to uniquedata.pdf would work as well, but I wasn't able to figure out how to do that. Has anyone done something like this?

    Since it sounds like the URLs are unique, you could use Corion's Text::CleanFragment on them to get filenames. For the rest, I think it would be better if you show some example code, see SSCCE.

Re: Combining Ffile:fetch with MySQL
by GrandFather (Saint) on Jul 18, 2022 at 04:17 UTC

    Can you not munge "www.example.com/uniquedata/blah/blah/blah/document.pdf" to be "www.example.com-uniquedata-blah-blah-blah-document.pdf" or some variant of that to get a unique file name (maybe omitting the domain part of the URL)?

    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
Re: Combining Ffile:fetch with MySQL
by Marshall (Canon) on Jul 18, 2022 at 05:05 UTC
    I am not sure what the naming convention is? Is "uniquedata" something that changes on a per month basis? And if so, can you predict with certainty what it will be next month?

    Could you take a couple of example documents and show what the full URL is to these documents for say June 2022 and July 2022?.

    Are you saving the .pdf doc as a "BLOB" inside the SQLite DB or are you just saving a directory path on your local machine?

    I have a similar web application. My app runs once per hour, looks around on part of a particular website for any new links. If it finds one, it "clicks" on it, to see there is anything there is "interesting or not". If so, the interesting data is saved in an SQLite DB. In any event, I save the URL in the DB so that I don't go there again. If nothing changed on the website, it figures that out very efficiently. This thing has been running every hour for the past 6 years, so it is possible for apps like this to work out very well. I use WWW::Mechanize but you don't seem to need the sophistication of making sense of a webpage? Or do you?

    Update:
    Could you show your table schema?
    I would be thinking along the lines of:
    URL text (where the data came from)
    Version Datetime (this is actually just text (not numeric) yyyy-mm-dd hh:mm:ss)
    - you can omit time and I think also the dd if not available
    - leading zeroes are mandatory because this column must be in
    - ASCII sort order
    Downloaded Datetime (optional but often handy to know)
    title text (Name of the document)
    pdf blob (actual pdf file)

      The uniquedata is a standard industry identifier that only occurs for that link, and I get a download of all the identifiers and date the files are published, and because it is a PDF, I want to download them into a folder for review later (the links are only temporary and valid for 60 days after month end, and then the PDF's are removed from the website) and not save the blob file in the database, but could do that. the table schema is one table. Identifer CHAR(9), publish date CHAR(8) LINK CHAR(255) Publish date is YYYYMMDD and link is just concat('staticlink/','identifier',/staticlink') as link I just need to get the links from the database into perl for Perl to download them.
        Ok, now I understand better. You are using an existing DB, not making one? You need to use the Perl DBI - Data Base Interface. There are links to the FAQ's and some tutorials. You can see some Monk links here: https://www.perlmonks.org/?node=Tutorials#Database-Programming.

        A wild guess and some untested example code would be:
        Instead of printing the link, you would use your code to download that doc, then do whatever you are going to do with it.

        Update: I see that you are using MySQL instead of SQLite. Brain cramp on my part. But code is essentially the same - just a difference in how to get the DB connection. The dbi would be: dbi::MySQL, then you need an account and password. For SQLite, these are null strings. See the doc's referenced above for a connection example with username/password.

        #!/usr/bin/perl use warnings; use strict; use Data::Dumper; use DBI qw(:sql_types); my $dbfile = "./YourDbName.sqlite"; my $dbh = DBI->connect("dbi:SQLite:dbname=$dbfile","","",{RaiseError = +> 1}) or die "Couldn't connect to database: " . DBI->errstr; my $get_links_4date_sql = "SELECT Identifer, LINK FROM YourTableName WHERE Publish date IS ?"); my $get_links_4date = $dbh->prepare ($get_links_4date_sql); my $date = "20220701"; $get_links_4date->execute($date); my $array_ref = $get_links_4date->fetchall_arrayref; foreach my $row_ref (@$array_ref) { my ($id,$link) = @$row_ref; print "$id \t $link\n"; }
        I can't anticipate exactly what kind of SQL you need. Above just gets docs for a particular date. Although sounds like what you need is: for each unique id, download latest version of the document. The SQL for that is of course more involved but doable.

        Hope this gets you further along your way.

Re: Combining Ffile:fetch with MySQL
by bliako (Monsignor) on Jul 19, 2022 at 14:00 UTC

    Are you asking how to loop over the results of an SQL select and then feed each result to file::fetch?

    use DBI; use File::Fetch; my $dbh = DBI->connect($dsn, $user, $password) or die "Couldn't connect to database: " . DBI->errstr; my $query = "select url FROM tablexxx where date='?'"; # << minor edit my $SQL = $dbh->prepare($query) or die "prepare: ".$dbh->errstr; $SQL-> execute('202207') or die "execute: ".$dbh->errstr; while (my $row = $sth->fetchrow_hashref) { print "url: $row->{url}\n"; # logic to get unique filename for output my $outfile = ...; my $ff = File::Fetch->new( uri => $url ); $ff->fetch(to => $outfile) or die "failed to fetch '$url' to '$outf +ile'."; # or get file contents in a scalar and then process them, e.g. to e +xtract # pdf metadata e.g. date, author, and md5 hash my $contents; $ff->fetch(to => \$contents); # use PDF::API2 my %options; my $pdf = PDF::API2->from_string($contents, %options); my $author = $pdf->author(); # I wish it was that simple... you may + need to sieve through all the metadata }

    totally untested, bw, bliako

      ok, here is the code I put together from the really helpful response. it runs with no errors, it just doesn't download... I must be missing something.

      $host = "localhost"; $port = "3306"; $dbdriver='mysql'; $tablename = "linktable"; print "Enter year and month to download in format YYYYMM:"; $date = <STDIN>; chomp $date; $dsn = "dbi:$dbdriver:$database:$host:$port"; $path='/data/'; my $dbh = DBI->connect($dsn, $user, $password) or die "Couldn't connect to database: " . DBI->errstr; $dbh-> do("DROP TABLE IF EXISTS LINKTABLE"); $dbh-> do("CREATE TABLE IF NOT EXISTS LINKTABLE(document_id CHAR(9),IN +DEX(DOCUMENT_ID),PUBLISH_DATE CHAR(8),URL CHAR(150))"); $dbh-> do("LOAD DATA LOCAL INFILE '/data/DATA.CSV' INTO TABLE $tablena +me(DOCUMENT_ID,PUBLISH_DATE)"); $dbh-> do("UPDATE LINKTABLE SET URL=CONCAT('www.example.com/document/' +,DOCUMENT_ID,'document.pdf')") ; my $SQL = "select url,document_id FROM $tablename where left(publish_d +ate,6) ='$date'"; my $query = $dbh->prepare($SQL) or die "prepare: ".$dbh->errstr; $query-> execute() or die "execute: ".$dbh->errstr; while ($url,$document_id = $query->fetchrow_array()) { print "url: $row->{url} \n"; my $outfile = '$path.$document_id'; my $ff = File::Fetch->new( uri => url ); $ff->fetch(to => $outfile) or die "failed to fetch '$url' to '$outf +ile'."; #rename("document.pdf","/output/$document_id.pdf") || die "Can't re +name file: ";
        my $dbh = DBI->connect($dsn, $user, $password) or die "Couldn't connect to database: " . DBI->errstr;

        Here you test that the connection has succeeded and bail out if it hasn't. This is good. However, on the subsequent 4 $dbh->do statements you make no such tests. How can you tell if one or more of these have failed?

        my $SQL = "select url,document_id FROM $tablename where left(publish_d +ate,6) ='$date'"; my $query = $dbh->prepare($SQL) or die "prepare: ".$dbh->errstr; $query-> execute() or die "execute: ".$dbh->errstr;

        No reason at all not to use a placeholder here:

        my $SQL = "select url,document_id FROM $tablename where left(publish_d +ate,6) = ?"; my $query = $dbh->prepare($SQL) or die "prepare: ".$dbh->errstr; $query-> execute($date) or die "execute: ".$dbh->errstr;
        it runs with no errors, it just doesn't download

        Does it print the URL correctly each time through the loop or not?


        🦛

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11145569]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (1)
As of 2024-04-24 15:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found