Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Evaluate my news retriever... (pretty please!)

by jarich (Curate)
on Dec 15, 2003 at 05:54 UTC ( [id://314770]=note: print w/replies, xml ) Need Help??


in reply to Evaluate my news retriever... (pretty please!)

I've spotted a few things for you.

1. Drop diagnostics

Btw. This is production code that fills a database with news articles which are extracted from XML code. The script needs around 5 minutes to parse through 1200 articles/stories

# modules for easier debugging and # better coding standards use warnings; use diagnostics; use strict;

It won't cut the code time down a lot, but it'll help a bit if you drop the use of diagnostics once this code goes into production. Using diagnostics will make the code take longer to compile and if your code is running warning free then you don't need this.

2. Consider standard modules

Have you considered using Getopt::Std for your script argument handling? What you're doing to find a single flag is fine, but if you want to add further switch handling you may want to take a look at the module.

3. Use the right quotes (and consider pulling out common strings)
# URL that returns list of YBX categories my $category_list_http = qw !http://api.yellowbrix.com/api/?service=ca +tegory_list&method=xml&id=verizon&pass­word=xxxxxx!;

I think you mean to use qq// or q// here rather than qw//

I also think that these would be better written to use just a couple more variables:

my $url = qq!http://api.yellowbrix.com/api/?!; my $common = qq!method=xml&id=verizon&pass­word=xxxxxx;! # URL that returns list of YBX categories my $category_list_http = $url . qq!service=category_list&$common!; # URL to xml file on yellowbrix server. CATEGORY must be provided my $category_http_start = $url . qq!service=headlines&$common&category +=!; my $category_http_end = qw !&enabletext=1!; # URL to weather on yellowbrix server. weather query must be appended. my $weather_http = $url . qq!service=weather&$common&query=!;
because this will allow you to change the site, id or password, if necessary, in one place rather than many.

4. Code order

A general note of good programming practice. Usually subroutines appear right at the bottom of the main code rather than in the middle of them. This means that someone reading the code gets a better idea of code flow.

5. Don't escape your own SQL (use placeholders)

# mySQL escape character my $sql_escape_char = '\\'; ... $data{pub_source} =~ s/(')/$sql_escape_char'/gs; $data{headline} =~ s/(')/$sql_escape_char'/gs; $data{summary} =~ s/(')/$sql_escape_char'/gs;

You're using DBI, there should be no need whatsoever for you to be self-escaping these values. And if you can think of a need then you really should be using the DBD's quote function not your own.

sqlStatement ('INSERT articles ('. 'story_id,'. 'arrival_date,'. 'display_date,'. 'category,'. 'pub_source,'. 'xml_url,'. 'ybh_url,'. 'img_url,'. 'headline,'. 'summary,'. 'summary_url'. ') VALUES ('. "'$article_data{story_id}',". "'$article_data{arrival_date}',". "'$article_data{display_date}',". "'$article_data{category}',". "'$article_data{pub_source}',". "'$article_data{xml_url}',". "'$article_data{ybh_url}',". "'$article_data{img_url}',". "'$article_data{headline}',". "'$article_data{summary}',". "'$article_data{summary_url}'". ')');

Yup, as I feared. If you use placeholders here and update your sqlStatement subroutine here you will get much better escaping of your data. Consider the following case:

$article_data{summary_url} = "some stuff \'); drop table articles;";
After your escape attempt this will now cause you problems. Rely on DBI/DBD to do your quoting correctly!

6. Verify your inputs

It's also up to you to check that the data is meaningful before throwing it into your db.

sqlConnectToDB; sqlStatement ("TRUNCATE TABLE articles"); .... # retrieve xml file with all articles my $headline_xml = getHTTPAddress($category_http_start.uri_escape($cat +egory_name).$category_http_end);
You're throwing away your articles before you've even verified that you can connect to server, yet alone checked if there were any articles to retreive. Maybe this is okay, but I think it's a bug. You do the same with weather.

You should also consider making sure that the values you really want to have data in them aren't null, and maybe doing some other sanity checking.

I hope this helps.
jarich

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://314770]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2024-04-18 06:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found