Combining Ffile:fetch with MySQL

justin423 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Combining Ffile:fetch with MySQL by haukex (Archbishop) on Jul 18, 2022 at 07:24 UTC
I guess alternatively, automatically renaming the saved file to uniquedata.pdf would work as well, but I wasn't able to figure out how to do that. Has anyone done something like this? Since it sounds like the URLs are unique, you could use Corion's Text::CleanFragment on them to get filenames. For the rest, I think it would be better if you show some example code, see SSCCE.	[reply]
Re: Combining Ffile:fetch with MySQL by GrandFather (Saint) on Jul 18, 2022 at 04:17 UTC
Can you not munge "www.example.com/uniquedata/blah/blah/blah/document.pdf" to be "www.example.com-uniquedata-blah-blah-blah-document.pdf" or some variant of that to get a unique file name (maybe omitting the domain part of the URL)? Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond	[reply]
Re: Combining Ffile:fetch with MySQL by Marshall (Canon) on Jul 18, 2022 at 05:05 UTC
I am not sure what the naming convention is? Is "uniquedata" something that changes on a per month basis? And if so, can you predict with certainty what it will be next month? Could you take a couple of example documents and show what the full URL is to these documents for say June 2022 and July 2022?. Are you saving the .pdf doc as a "BLOB" inside the SQLite DB or are you just saving a directory path on your local machine? I have a similar web application. My app runs once per hour, looks around on part of a particular website for any new links. If it finds one, it "clicks" on it, to see there is anything there is "interesting or not". If so, the interesting data is saved in an SQLite DB. In any event, I save the URL in the DB so that I don't go there again. If nothing changed on the website, it figures that out very efficiently. This thing has been running every hour for the past 6 years, so it is possible for apps like this to work out very well. I use WWW::Mechanize but you don't seem to need the sophistication of making sense of a webpage? Or do you? Update: Could you show your table schema? I would be thinking along the lines of: URL text (where the data came from) Version Datetime (this is actually just text (not numeric) yyyy-mm-dd hh:mm:ss) - you can omit time and I think also the dd if not available - leading zeroes are mandatory because this column must be in - ASCII sort order Downloaded Datetime (optional but often handy to know) title text (Name of the document) pdf blob (actual pdf file)	[reply]
Re^2: Combining Ffile:fetch with MySQL by justin423 (Scribe) on Jul 18, 2022 at 05:57 UTC
The uniquedata is a standard industry identifier that only occurs for that link, and I get a download of all the identifiers and date the files are published, and because it is a PDF, I want to download them into a folder for review later (the links are only temporary and valid for 60 days after month end, and then the PDF's are removed from the website) and not save the blob file in the database, but could do that. the table schema is one table. Identifer CHAR(9), publish date CHAR(8) LINK CHAR(255) Publish date is YYYYMMDD and link is just concat('staticlink/','identifier',/staticlink') as link I just need to get the links from the database into perl for Perl to download them.	[reply]
Re^3: Combining Ffile:fetch with MySQL by Marshall (Canon) on Jul 18, 2022 at 18:28 UTC
Ok, now I understand better. You are using an existing DB, not making one? You need to use the Perl DBI - Data Base Interface. There are links to the FAQ's and some tutorials. You can see some Monk links here: https://www.perlmonks.org/?node=Tutorials#Database-Programming. A wild guess and some untested example code would be: Instead of printing the link, you would use your code to download that doc, then do whatever you are going to do with it. Update: I see that you are using MySQL instead of SQLite. Brain cramp on my part. But code is essentially the same - just a difference in how to get the DB connection. The dbi would be: dbi::MySQL, then you need an account and password. For SQLite, these are null strings. See the doc's referenced above for a connection example with username/password. #!/usr/bin/perl use warnings; use strict; use Data::Dumper; use DBI qw(:sql_types); my $dbfile = "./YourDbName.sqlite"; my $dbh = DBI->connect("dbi:SQLite:dbname=$dbfile","","",{RaiseError = +> 1}) or die "Couldn't connect to database: " . DBI->errstr; my $get_links_4date_sql = "SELECT Identifer, LINK FROM YourTableName WHERE Publish date IS ?"); my $get_links_4date = $dbh->prepare ($get_links_4date_sql); my $date = "20220701"; $get_links_4date->execute($date); my $array_ref = $get_links_4date->fetchall_arrayref; foreach my $row_ref (@$array_ref) { my ($id,$link) = @$row_ref; print "$id \t $link\n"; } [download] I can't anticipate exactly what kind of SQL you need. Above just gets docs for a particular date. Although sounds like what you need is: for each unique id, download latest version of the document. The SQL for that is of course more involved but doable. Hope this gets you further along your way.	[reply] [d/l]
Re: Combining Ffile:fetch with MySQL by bliako (Monsignor) on Jul 19, 2022 at 14:00 UTC
Are you asking how to loop over the results of an SQL select and then feed each result to file::fetch? use DBI; use File::Fetch; my $dbh = DBI->connect($dsn, $user, $password) or die "Couldn't connect to database: " . DBI->errstr; my $query = "select url FROM tablexxx where date='?'"; # << minor edit my $SQL = $dbh->prepare($query) or die "prepare: ".$dbh->errstr; $SQL-> execute('202207') or die "execute: ".$dbh->errstr; while (my $row = $sth->fetchrow_hashref) { print "url: $row->{url}\n"; # logic to get unique filename for output my $outfile = ...; my $ff = File::Fetch->new( uri => $url ); $ff->fetch(to => $outfile) or die "failed to fetch '$url' to '$outf +ile'."; # or get file contents in a scalar and then process them, e.g. to e +xtract # pdf metadata e.g. date, author, and md5 hash my $contents; $ff->fetch(to => \$contents); # use PDF::API2 my %options; my $pdf = PDF::API2->from_string($contents, %options); my $author = $pdf->author(); # I wish it was that simple... you may + need to sieve through all the metadata } [download] totally untested, bw, bliako	[reply] [d/l]
Re^2: Combining Ffile:fetch with MySQL by justin423 (Scribe) on Jul 24, 2022 at 01:43 UTC
ok, here is the code I put together from the really helpful response. it runs with no errors, it just doesn't download... I must be missing something. $host = "localhost"; $port = "3306"; $dbdriver='mysql'; $tablename = "linktable"; print "Enter year and month to download in format YYYYMM:"; $date = <STDIN>; chomp $date; $dsn = "dbi:$dbdriver:$database:$host:$port"; $path='/data/'; my $dbh = DBI->connect($dsn, $user, $password) or die "Couldn't connect to database: " . DBI->errstr; $dbh-> do("DROP TABLE IF EXISTS LINKTABLE"); $dbh-> do("CREATE TABLE IF NOT EXISTS LINKTABLE(document_id CHAR(9),IN +DEX(DOCUMENT_ID),PUBLISH_DATE CHAR(8),URL CHAR(150))"); $dbh-> do("LOAD DATA LOCAL INFILE '/data/DATA.CSV' INTO TABLE $tablena +me(DOCUMENT_ID,PUBLISH_DATE)"); $dbh-> do("UPDATE LINKTABLE SET URL=CONCAT('www.example.com/document/' +,DOCUMENT_ID,'document.pdf')") ; my $SQL = "select url,document_id FROM $tablename where left(publish_d +ate,6) ='$date'"; my $query = $dbh->prepare($SQL) or die "prepare: ".$dbh->errstr; $query-> execute() or die "execute: ".$dbh->errstr; while ($url,$document_id = $query->fetchrow_array()) { print "url: $row->{url} \n"; my $outfile = '$path.$document_id'; my $ff = File::Fetch->new( uri => url ); $ff->fetch(to => $outfile) or die "failed to fetch '$url' to '$outf +ile'."; #rename("document.pdf","/output/$document_id.pdf") \|\| die "Can't re +name file: "; [download]	[reply] [d/l]
Re^3: Combining Ffile:fetch with MySQL by hippo (Bishop) on Jul 24, 2022 at 09:29 UTC
`my $dbh = DBI->connect($dsn, $user, $password) or die "Couldn't connect to database: " . DBI->errstr;` [download] Here you test that the connection has succeeded and bail out if it hasn't. This is good. However, on the subsequent 4 `$dbh->do` statements you make no such tests. How can you tell if one or more of these have failed? `my $SQL = "select url,document_id FROM $tablename where left(publish_d +ate,6) ='$date'"; my $query = $dbh->prepare($SQL) or die "prepare: ".$dbh->errstr; $query-> execute() or die "execute: ".$dbh->errstr;` [download] No reason at all not to use a placeholder here: `my $SQL = "select url,document_id FROM $tablename where left(publish_d +ate,6) = ?"; my $query = $dbh->prepare($SQL) or die "prepare: ".$dbh->errstr; $query-> execute($date) or die "execute: ".$dbh->errstr;` [download] it runs with no errors, it just doesn't download Does it print the URL correctly each time through the loop or not? 🦛	[reply] [d/l] [select]
Re^4: Combining Ffile:fetch with MySQL by afoken (Chancellor) on Jul 24, 2022 at 10:13 UTC
Re^4: Combining Ffile:fetch with MySQL by justin423 (Scribe) on Jul 24, 2022 at 15:28 UTC
Re^5: Combining Ffile:fetch with MySQL by justin423 (Scribe) on Jul 24, 2022 at 17:48 UTC
Some notes below your chosen depth have not been shown here


"be consistent"
	PerlMonks