Re: Solving possible missing links

in reply to Solving possible missing links

Going through the links pointing to external sites is also interesting to scan for spam links, hence this is indeed desireable.

I currently lack the time to do it myself, and database access is somewhat scarce, but the relevant DB schema is (roughly):

create table node (
    node_id integer not null unique primary key,
    type_nodetype integer not null references node,
    author_user integer not null references node,
    lastedit timestamp
);

create table user (
    user_id integer not null references node    
);

create table note (
    note_id integer not null unique primary key references node(node_i
+d),
    doctext text not null default ''
);
[download]

And Real, Working SQL to query these tables is (also at Replies with outbound links, but that's for gods only to access):

select node_id, doctext
from node 
left join document on node_id = document_id
where lastedit > date_sub( current_date(), interval 10 day )
  and type_nodetype = 11 -- note
  and doctext like '%http://%'
order by lastedit desc
[download]

This SQL should be refined to also catch https:// links, and then some Perl code needs to be written to verify that the text is an actual link.

Test cases for text with links would be for example:

<p>It's right [https://cpants.cpanauthors.org/release/GWHAYWOOD/sendma
+il-pmilter-1.20_01|here].</p>
---
<a href="http://www.groklaw.net">Groklaw</a>
---
[href://http://www.perlmonks.org/?node=Tutorials|Monastery Tutorials]
[download]

Negative test cases would be:

<P><A>http://matrix.cpantesters.org/?dist=sendmail-pmilter%201.20_01</
+A></P> 
---
"http://localhost:3000"
[download]

Ideally, we will be able to refine this code later to highlight outbound links that are not on the whitelist of Perlmonks links.

In Section Perl Monks Discussion