Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: How to get started with scraping my IMAP emails

by bliako (Monsignor)
on Mar 01, 2022 at 15:44 UTC ( [id://11141730]=note: print w/replies, xml ) Need Help??


in reply to How to get started with scraping my IMAP emails

Thank you all for your insights and shared code. I have not replied since because I am still struggling with this. I have solved the first part: fetching a message from server, thanks to your input. I do something like this:

use Mail::IMAPClient; use Email::MIME; use Data::Dumper; my $imap = Mail::IMAPClient->new( Server => 'abc.com', User => 'xxx', Password => 'xxx', Ssl => 1, Uid => 1, # Starttls => 1, ); die "failed to instantiate." unless defined $imap; $imap->connect or die "Could not connect: $@\n"; my $folders = $imap->folders or die "List folders error: ", $imap->LastError, "\n"; print "Folders: @$folders\n"; $imap->select( 'INBOX' ) or die "Select 'INBOX' error: ", $imap->LastError, "\n"; my $list = $client->search('SUBJECT', 'a new email'); for my $msgid (@$list){ my $from = $client->get_header( $msgid, "From" ); my $subj = $client->get_header( $msgid, "Subject" ); my $bsdat = $client->fetch( $msgid, "bodystructure" ); my $bss = $client->body_string($msgid); my $parser = MIME::Parser->new(); $parser->output_to_core(0); # this saves message IN ONE BIG FILE, text+attachments togethe +r!!! # and the extension is '.txt'!!!! $parser->extract_nested_messages(1); $parser->output_under('./out'); my $entity = $parser->parse_data($bss); # $entity->parts does not give me the parts # even if message is 'Content-type: MULTIPART/mixed' }

I am still struggling with the 2nd part: unwrap a message to local disk, each attachment on its own file. And I am looking for a way to do that seemingly simple and solved-by-now problem either by MIME::Parser or some other package. Alas the prospects look bleak.

The above was put together with code from NERDVANA, Discipulus, talexb !

p.s. edit: my bandwidth is very limited so in order to test this I have setup a minimal mail server (dovecot) in my linux box without the ability to smtp or ssl (to keep things simple). I have used thunderbird in order to copy my multipart test email from a "real" email account's INBOX to the localhost dummy (using 'copy to' in thunderbird) and now I can do the testing without using the net or bothering my MailSP. Of course I could have just saved the email into a file and read from there ...

Replies are listed 'Best First'.
Re^2: How to get started with scraping my IMAP emails
by Corion (Patriarch) on Mar 01, 2022 at 16:16 UTC

    In my (not really elegant, not really recommended) approaches, I recursively descend down the MIME message tree and usually output the Content-Type headers, to get a first view of the mail structure:

    sub dump_parts($msg, $level=0) { print " " x $level, $msg->content_type, "\n"; for my $part ($msg->parts) { dump_parts($part, $level+1); } } dump_parts( $entity );

    Then, I usually modify dump_parts to actually handle the content types (and other criteria) of the parts I'm interested in.

    This discussion has given me the idea that maybe having an SQL, XPath or CSS-like query language for the parts could improve things, but so far, I haven't come up with a good enough concept to implement this.

      Ouch! can you trust all those email apps to map the same content to the same content-mime-type consistently?

      In the meantime I went back to Email::MIME and had good results (for my one multipart test email) with its walk_parts().

      my $client = Mail::IMAPClient->new(...); # ... search mail box my $parsed = Email::MIME->new($client->message_string($msgid)) +; my @parts_to_save; $parsed->walk_parts(sub { push @parts_to_save, $_[0] }); # the [0] is the whole message, rest are all parts including n +ested for (@parts_to_save){ print $_->as_string }

      Email::MIME has also a t/nested-parts.t which I used to check that it works fine for nested parts.

      And it seems I am leaving the dreadfull world of email.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11141730]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (6)
As of 2024-04-24 09:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found