http://qs321.pair.com?node_id=11141594

bliako has asked for the wisdom of the Perl Monks concerning the following question:

Deer Monkees

I want to retrieve emails from remote server which supports IMAP (I guess POP3 too). I want to search for messages by subject (or content if that's easy), or just get the unread messages or get the N most recent messages from folder. After downloading the emails I want to parse them because they can contain attachments (images, pdf etc.). And then save them attachments to disk. And mark the message on server as 'read' or even delete it.

I have been searching for a couple of hours and found that Mail::IMAPClient is recently updated but I failed to tell it to give me the message in a format that Email::MIME can parse. OTOH Net::IMAP::Client worked without too much hassle and Email::MIME seems to be able to parse its fetched messages but I don't know how to save email and its attachments to disk. And it's not recently updated.

To summarise: I am open to using any module(s) which can fetch me my emails (searching for 'unread' is great) parse them, give me subject, content, date and offer me a simple way to save attachments to disk preferably with their original filename. Any suggestions AND example code?

I can offer this snippet for Net::IMAP::Client and Email::MIME which seem to work OK but don't know how to save:

use Net::IMAP::Client; use Email::MIME; use Data::Dumper; my $imap = Net::IMAP::Client->new( server => 'xyz.com', user => 'xxx', pass => 'xxx', ssl => 0, port => 143, ); die "failed to instantiate $@." unless defined $imap; $imap->login or die "Could not connect: ".$imap->last_error."\n"; my @folders = $imap->folders or die "List folders error: ", $imap->last_error, "\n"; print "Folders: @folders\n"; # get total # of messages, # of unseen messages etc. (fast!) my $status = $imap->status(@folders); # hash ref! print Dumper($status); $imap->select('INBOX') or die "Select 'INBOX' error: ", $imap->last_error, "\n"; # do a reverse-date search (most recent first) my $messages = $imap->search('ALL', '^DATE'); for my $amid (@$messages){ print "message id: $amid\n"; my $msg = $imap->get_rfc822_body($amid); my $parsed = Email::MIME->new($msg); die "failed to parse" unless $parsed; my @parts = $parsed->parts; # These will be Email::MIME objects, t +oo. my $decoded = $parsed->body; my $non_decoded = $parsed->body_raw; for my $apart (@parts){ # indeed they are Email::MIME, how do I save them??? print "got this email part: $apart\n" } my $content_type = $parsed->content_type; last; } $imap->logout();

and this to get me started with Mail::IMAPClient

use Mail::IMAPClient; use Email::MIME; use Data::Dumper; my $imap = Mail::IMAPClient->new( Server => 'abc.com', User => 'xxx', Password => 'xxx', Ssl => 1, Uid => 1, # Starttls => 1, ); die "failed to instantiate." unless defined $imap; $imap->connect or die "Could not connect: $@\n"; my $folders = $imap->folders or die "List folders error: ", $imap->LastError, "\n"; print "Folders: @$folders\n"; $imap->select( 'INBOX' ) or die "Select 'INBOX' error: ", $imap->LastError, "\n"; my @messages = $imap->messages; my $msg = pop @messages; my $obj = $imap->get_bodystructure($msg); print Dumper($obj);

bw, bliako

Replies are listed 'Best First'.
Re: How to get started with scraping my IMAP emails
by Corion (Patriarch) on Feb 23, 2022 at 18:52 UTC

    I've had good success with Mail::IMAPClient, and then using MIME::Parser to extract parts.

    Using MIME::Parser directly isn't all great because I found that I often need to recurse through the matryoshka-doll nested MIME parts myself in the code and finding the relevant part (text/html, or text/plain, or any of these, found recursively) is nasty. But I have the feeling that this is just how MIME parts are...

      Do you mean that there is no MIME::Parser-built-in and easy way to unwrap an email with attachments fetched from server into a dir, each attachment on its own file, with a filename as specified by the message (if any, or just recomment one)? I have managed to tell MIME::Parser to save under a dir but it does not unwrap the message, message text and attachments are all in one big file. I also found this post of yours Re^3: read email-message. But I hesitate to start handling all these cases. I still looking for a built-in solution.

        Yes, that's unfortunately what I mean - at least I am unaware of a good/convenient way of handling that. I've written the linked subroutine in at least two or three incarnations, but I'm not sure that this would generalize in any meaningful way. Maybe having a way to query "the email" would make the interface better, like maybe CSS-style selectors or an SQL interface to select all attachments or everything that is part of an included mail etc. - but I haven't progressed to any approach that is not ad-hoc.

Re: How to get started with scraping my IMAP emails
by Fletch (Bishop) on Feb 23, 2022 at 17:36 UTC

    Rather than use IMAP directly I've had good success in the past with Mail::Box and offlineimap. Use the latter to get a maildir copy of things and then manipulate the messages locally with the former. This has the added benefit of letting me use things like mu / mu4e for reading my mail.

    Edit: Also nmh is handy for manipulating mail from the command line once you've got a local maildir.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: How to get started with scraping my IMAP emails
by talexb (Chancellor) on Feb 23, 2022 at 21:24 UTC

    I'm also using Mail::IMAPClient, then $client->select ( 'Inbox.Github' ) to choose a specific folder and $list = $client->search ( 'UNSEEN' ); to grab any unread messages from my github folder. I can then use $client->set_flag ( 'Seen', @list_of_msgs ) to mark a bunch of them as seen. It's a great way for me to automate the whole "I can ignore all of these github messages because that PR has already been approved" process.

    Alex / talexb / Toronto

    Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

Re: How to get started with scraping my IMAP emails
by NERDVANA (Deacon) on Feb 23, 2022 at 22:02 UTC
    Here are some code snippets you might find helpful, from stuff from work I wrote that I probably can't share in full:

    my $p= MIME::Parser->new; $p->output_to_core(1); $p->parse($msg_or_fh);
    # Find every MIME part which is not a container for other parts sub _leaf_parts { my @parts= $_->parts; @parts? ( map { _leaf_parts() } @parts ) : ( $_ ) } my @leaf_parts= map { _leaf_parts() } $email;
    # Open a handle to each part which is an attachment my @attachments= map +{ name => $_->head->recommended_filename, content_type => _decoded_mime_header($_->head, 'Content-Type' +), handle => $_->bodyhandle->open('r'), mimepart => $_, email => $email }, grep length($_->head->recommended_filename//''), @leaf_parts; # Convert zipfile attachments to the list of files within @attachments= map { $_->{name} =~ /\.zip$/? _extract_zipfile($_) : +($_) } @attachments;
    # Takes one file info, and returns a list of file infos for each file +within the zip file. # Since these are not directly MIME parts, they are simply: # { # name => $original_filename, # handle => $io_handle # } sub _extract_zipfile { my ($file)= @_; my @files; my $zipfile= IO::Uncompress::Unzip->new($file->{handle}) or die "Can't open zip file: $UnzipError"; my $status; for ($status= 1; $status > 0; $status= $zipfile->nextStream()) { my $name= $zipfile->getHeaderInfo->{Name}; $log->info("Extracting $name from zip file"); my $tmp= File::Temp->new(TEMPLATE => 'email-zip-content-XXXXXXX' +); my $buf; while (($status= $zipfile->read($buf)) > 0) { $tmp->print($buf) or die; } last if $status < 0; push @files, { name => $name, handle => $tmp }; } die "Error processing zip file" if $status < 0; return @files; }
Re: How to get started with scraping my IMAP emails
by Discipulus (Canon) on Feb 24, 2022 at 08:20 UTC
Re: How to get started with scraping my IMAP emails
by bliako (Monsignor) on Mar 01, 2022 at 15:44 UTC

    Thank you all for your insights and shared code. I have not replied since because I am still struggling with this. I have solved the first part: fetching a message from server, thanks to your input. I do something like this:

    use Mail::IMAPClient; use Email::MIME; use Data::Dumper; my $imap = Mail::IMAPClient->new( Server => 'abc.com', User => 'xxx', Password => 'xxx', Ssl => 1, Uid => 1, # Starttls => 1, ); die "failed to instantiate." unless defined $imap; $imap->connect or die "Could not connect: $@\n"; my $folders = $imap->folders or die "List folders error: ", $imap->LastError, "\n"; print "Folders: @$folders\n"; $imap->select( 'INBOX' ) or die "Select 'INBOX' error: ", $imap->LastError, "\n"; my $list = $client->search('SUBJECT', 'a new email'); for my $msgid (@$list){ my $from = $client->get_header( $msgid, "From" ); my $subj = $client->get_header( $msgid, "Subject" ); my $bsdat = $client->fetch( $msgid, "bodystructure" ); my $bss = $client->body_string($msgid); my $parser = MIME::Parser->new(); $parser->output_to_core(0); # this saves message IN ONE BIG FILE, text+attachments togethe +r!!! # and the extension is '.txt'!!!! $parser->extract_nested_messages(1); $parser->output_under('./out'); my $entity = $parser->parse_data($bss); # $entity->parts does not give me the parts # even if message is 'Content-type: MULTIPART/mixed' }

    I am still struggling with the 2nd part: unwrap a message to local disk, each attachment on its own file. And I am looking for a way to do that seemingly simple and solved-by-now problem either by MIME::Parser or some other package. Alas the prospects look bleak.

    The above was put together with code from NERDVANA, Discipulus, talexb !

    p.s. edit: my bandwidth is very limited so in order to test this I have setup a minimal mail server (dovecot) in my linux box without the ability to smtp or ssl (to keep things simple). I have used thunderbird in order to copy my multipart test email from a "real" email account's INBOX to the localhost dummy (using 'copy to' in thunderbird) and now I can do the testing without using the net or bothering my MailSP. Of course I could have just saved the email into a file and read from there ...

      In my (not really elegant, not really recommended) approaches, I recursively descend down the MIME message tree and usually output the Content-Type headers, to get a first view of the mail structure:

      sub dump_parts($msg, $level=0) { print " " x $level, $msg->content_type, "\n"; for my $part ($msg->parts) { dump_parts($part, $level+1); } } dump_parts( $entity );

      Then, I usually modify dump_parts to actually handle the content types (and other criteria) of the parts I'm interested in.

      This discussion has given me the idea that maybe having an SQL, XPath or CSS-like query language for the parts could improve things, but so far, I haven't come up with a good enough concept to implement this.

        Ouch! can you trust all those email apps to map the same content to the same content-mime-type consistently?

        In the meantime I went back to Email::MIME and had good results (for my one multipart test email) with its walk_parts().

        my $client = Mail::IMAPClient->new(...); # ... search mail box my $parsed = Email::MIME->new($client->message_string($msgid)) +; my @parts_to_save; $parsed->walk_parts(sub { push @parts_to_save, $_[0] }); # the [0] is the whole message, rest are all parts including n +ested for (@parts_to_save){ print $_->as_string }

        Email::MIME has also a t/nested-parts.t which I used to check that it works fine for nested parts.

        And it seems I am leaving the dreadfull world of email.

Re: How to get started with scraping my IMAP emails
by LanX (Saint) on Feb 23, 2022 at 22:38 UTC
    > Deer Monkees

    Reminds me of

    • The Deer Monkees

      Epic war movie about an American casting band trapped in a Vietnamese jungle prison, where they are forced to sing Russian roulette songs.

    ;)

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery