Re^3: Extracting a (UK) Address

by gone2015 (Deacon)
on Jan 02, 2009 at 12:06 UTC

in reply to Re^2: Extracting a (UK) Address
in thread Extracting a (UK) Address

So you are looking for three or more lines together, the last ending in something that looks like a post code...

$letter =~ m/((?:[^\n]+\n){2,}[^\n]*?[a-zA-Z]+[0-9]+\s+[0-9]+[a-zA-Z]+\s*?\n)\s*?\n/

...seemed to do the trick, where the entire letter was read into $letter. Obviously this will miss addresses with no post code or really rubbish post codes. You could just extract all groups of 3 or more lines, and then apply some more cunning address recogniser to the result -- perhaps from one of the modules recommended elsewhere.

(I haven't tried to figure out how much work this is asking the regex engine to do on difficult input. I'd worry about that only if it becomes a problem.)

Replies are listed 'Best First'.
Re^4: Extracting a (UK) Address
by jvector (Friar) on Jan 04, 2009 at 20:33 UTC
    The module Geo::Postcode may be of use in recognising the last line of a block as a (UK) postcode. Apparently there are a few gotchas among UK postcodes, that diverge from thhe expected patterns.

    It may be a bit of a sledge-hammer to crack a nut: the module also is able to do lots of good geo stuff you don't need -

    Geo::Postcode will accept full or partial UK postcodes, validate them against the official spec, separate them into their significant parts, translate them into map references or co-ordinates and calculate distances between them. It does not check whether the supplied postcode exists: only whether it is well-formed according to British Standard 7666
    but still could be helpful.

