Parsing "free form" documents is always tricky, and you may have to end up encoding some special cases. But the trick is to find regularities. As a first step, here's my take:
The news items themselves include the titles, so I would say you can just skip everything up to the line of equal signs. Then, you can read each news item as a paragraph, and consider the first line to be the title.
The following snippet of code stores the news items in %db, using the title as the key, containing the "body" (they could as well be stored in an array, if you want to preserve the order).
use strict;
my $f=0;
my %db;
$/="";
while (<>) {
$f=1,next if /^==========/;
next unless $f;
my @item=split /\n/, $_, 2;
$db{$item[0]}=$item[1];
}
foreach (keys %db) {
print "Title: $_\nBody: $db{$_}";
}
A further step would be to parse the body. There again, the trick is to find any regularities. In the example data you gave, there are 3 lines of "headers" followed by the text. If this is always the case, something like this could do the trick:
@body=split /\n/, $body, 4;
And you would end up with the three headers in @body[0,1,2] and the text in $body. If the "three header lines" rule does not apply, you could use some other heuristic. For example, are header lines always less than 40 characters in length? Then you could use something like this: (untested):
my @lines=split /\n/, $body;
my @hdr;
my $l;
while ($l=shift(@lines)) {
last if length($l)>40
push @hdr, $l;
}
$body=join("\n", $l, @lines);
Which would leave all the initial shorter-than-40 character lines in @hdr, and the rest re-joined with newlines in $body.
--ZZamboni
| [reply] [Watch: Dir/Any] [d/l] [select] |
If the titles always appear at first in their own block then it would be useful to extract these first and store them away somewhere. Then as you process each "message" block you could mark each title to show that you have found and processed the corresponding message. This would give you a useful check at the end that you haven't missed something.
A different way of looking at things would be to provide a form based way for the users to input the article. Then you could have fields for title, source, date, text, etc. And of course you have more control and can do validation and reformatting at the data entry point. | [reply] [Watch: Dir/Any] |
I definetely suggest you use a state machine. It'll help you with the maintenance of the parser in case the format changes (and chances are that it probably will).
| [reply] [Watch: Dir/Any] |