Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Extracting files from .7z using Perl

by zarath (Beadle)
on Jun 12, 2017 at 15:59 UTC ( [id://1192612]=perlquestion: print w/replies, xml ) Need Help??

zarath has asked for the wisdom of the Perl Monks concerning the following question:

Hi everybody!

Quite new to Perl, but picking it up at a steady pace.

The point of the code given below is the following: I have 7 .7z-files. The files are not that big (500 - 700 MB), but each file contains a folder which in turn contains around 2.2 million .txt-files. Needless to say, Windows/7zip are having a pretty difficult time extracting all the files from the archives (remaining time = around 1500 hours after 5% done, me and my bosses don't have that kind of time).

I'm going to give up on extracting everything and want to try only extracting the files which have not yet been extracted and putting them into the correct subfolder (which is D:\Some\Specific\Folder\Archive\yyyy\mm\dd\ - based on the date in the filenames ; these folders already exist). All this using 1 little Perl-script so I can just start it up and let it run as long as it needs to without having to worry about it anymore until it's done.

# !perl use strict; # because we should use warnings; # because we should my $base = 'D:\Some\Specific\Folder\Archive\\'; # This might hurt your eyes, but for some obscure reason, DateTime is +not installed and I can seriously not be bothered to install it mysel +f my @years = ('2011','2012','2013','2014','2015','2016','2017'); my @months = ('01','02','03','04','05','06','07','08','09','10','11',' +12'); my @days = ('01','02','03','04','05','06','07','08','09','10','11','12 +','13','14','15','16','17','18','19','20','21','22','23','24','25','2 +6','27','28','29','30','31'); my $archive = glob qq($base.Gridfee0?.7z\\); #Tried glob qq($base.Gridfee0?.7z\\Gridfee*\\) and glob qq($base.G +ridfee0?.7z) too, same result foreach my $year (@years) { foreach my $month (@months) { foreach my $day (@days) { my @files = glob qq($base\\Gridfee0?.7z\\Gridfee?\\invoic_ +b2c_$year$month$day*.txt); # The zipfiles are zipped folders, so there is 1 extra + layer to traverse, hence the \\Gridfee?\\ # The name is definitely correct: for example Gridfee0 +5.7z contains folder named Gridfee5 # But possibly made a mistake in the syntax? foreach my $file (@files) { my $extdir = '-o $base.$year\$month\$day\\'; system ('C:\Program Files\7-Zip\7z.exe',' e ',$archive +,' -r',$extdir,$file,' n n') or die $!; # Many things in Perl are a mystery to me and the +use of 7z via Perl is most definitely one of them # This is the syntax I have distilled from what in +fo I could find on interaction between Perl and 7z # and from the 'Help'-file that comes with the 7zi +p-program # The ' n n' is my attempt at making 7zip skip the + files which can already be found in $extdir say('Moved '.$file.' to '.$extdir); # Just a check to see what has been done, if a +nything } } } } exit 0;

The code as it is right now outputs absolutely nothing, it just seems to run for a couple of seconds and then stops and I have no idea why.

Can you spot my mistakes?

Replies are listed 'Best First'.
Re: Extracting files from .7z using Perl
by Tux (Canon) on Jun 12, 2017 at 16:10 UTC
      Did not even know that exists. Will have a look if i can get it installed. Thank you.
Re: Extracting files from .7z using Perl
by zentara (Archbishop) on Jun 12, 2017 at 16:22 UTC
    Hi, I don't run Windows, but just from general knowledge and the way your program is written, I would suggest the quotations in your extended command line options. For instance, what are the n n 's for?:
    system ('C:\Program Files\7-Zip\7z.exe',' e ',$archive,' -r' +,$extdir,$file,' n n') or die $!; # maybe n n should be \n\n ? system ('C:\Program Files\7-Zip\7z.exe', ' e ', $archive,' -r + ,$extdir,$file' , "\n\n\") or die $!;
    or the suspicious:
    my @files = glob qq($base\\Gridfee0?.7z\\Gridfee?\\invoic_b2c_$year$mo +nth$day*.txt);
    Or some other quoting inconsistencies. Your program silently dies. Remember, " and ' are not the same. " interpolates.

    Also start printing out your data arrays to make sure they were filled as a debugging aid.

    Also, just test extraction first. Can you get your script to do a 7z with the x option on the file? That will ensure it all works, before trying to extract files individually.


    I'm not really a human, but I play one on earth. ..... an animated JAPH
Re: Extracting files from .7z using Perl
by poj (Abbot) on Jun 12, 2017 at 17:36 UTC

    Not sure what you are doing with the glob statements. Try this test program which should list the files in an archive that match the invoic_b2c etc pattern

    #!perl use strict; use warnings; my $base = 'D:/Some/Specific/Folder/Archive/'; my $archive = $base.'Gridfee05.7z'; my $exe = '"C:/Program Files/7-Zip/7z.exe"'; my @files = grep /invoic_b2c_201\d{5}.*\.txt/, qx "$exe l $archive"; print scalar @files." files found in $archive - see filelist.txt\n\n"; open OUT,'>',"filelist.txt" or die "$!"; print OUT join "\n",@files; close OUT;
    poj
Re: Extracting files from .7z using Perl
by Lotus1 (Vicar) on Jun 12, 2017 at 17:53 UTC

    It appears you are attempting concatenation with the '.' operator but inside of qq(). In that context the period is just text and not an operator. Try getting your glob to work first and keep adding small tested chunks to your program until you have built up what you need. Also, the '\' backslash inside a glob is seen as quoting the next character. You need to use the front slash '/' inside a glob.

    Just glancing at the rest of the program, I don't think millions of calls to 7z.exe is going to work like this. Inside your nested loop when you call 7zip each time it will need to extract the archive to find the file you need. It might make more sense to unzip everything at once to a temporary folder and then operate on the files, then remove the temporary folder.

    # !perl use strict; # because we should use warnings; # because we should my $base = 'C:\usr\pm\temp\test1\\'; my $search = qq($base.Gridfee0?.7z\\); print "1: $search \n"; my $search2 = qq(${base}Gridfee0?.7z\\); print "2: $search2 \n"; my $search3 = qq(C:/usr/pm/temp/test1/Gridfee0?.7z\\); print "3: $search3 \n"; my $archive = glob qq($base.Gridfee0?.7z\\); #Tried glob qq($base.Gridfee0?.7z\\Gridfee*\\) and glob qq($base.G +ridfee0?.7z) too, same result print "archive = $archive\n"; my @archs = glob $search3; #Tried glob qq($base.Gridfee0?.7z\\Gridfee*\\) and glob qq($base.G +ridfee0?.7z) too, same result print "archs = @archs\n"; __END__ Use of uninitialized value $archive in concatenation (.) or string at +C:\usr\pm\temp\1192612.pl line 18. 1: C:\usr\pm\temp\test1\.Gridfee0?.7z\ 2: C:\usr\pm\temp\test1\Gridfee0?.7z\ 3: C:/usr/pm/temp/test1/Gridfee0?.7z\ archive = archs = C:/usr/pm/temp/test1/Gridfee02.7z\ C:/usr/pm/temp/test1/Gridfe +e05.7z\
Re: Extracting files from .7z using Perl
by Marshall (Canon) on Jun 13, 2017 at 03:54 UTC
    I see a number of problems.
    First forget this '\' stuff! That is ancient DOS. Modern Windows command line is NOT DOS and it works fine with '/'. In your code always use forward slash ('/') instead of backslash ('\'). This amoungst other things avoids the need to "double escape" the backslash. Again, forget this '\' stuff!
    my $base = 'D:\Some\Specific\Folder\Archive\\'; # should be: my $base = 'D:/Some/Specific/Folder/Archive'; # a path to a directory or my $base = "D:/Some/Specific/Folder/Archive"; # a path to a directory # do not put a trailing '/' or '\' on a directory name # this is not needed and can confuse the shell # pre-pend a '/' when you expand the path my $new = "$base/$extra_path";
    Your triple loop over $year,$month,$day is truly bizarre.
    my @files = glob qq($base\\Gridfee0?.7z\\Gridfee?\\invoic_ +b2c_$year$month$day*.txt);
    What are you trying to do there? I don't quite "get it".

    I am curious as to why you are using .7z suffixes? I like 7z. It generates .zip files faster which use less memory than the MS zip program does. These .zip files generated by 7z are compatible with Windows .zip. I've never used the .7z specific format because the .zip MS compatible format appears to be just fine for my applications. Again, if 7z makes that .zip it will be smaller than what MS does, be generated faster and yet be compatible with MS .zip.

Re: Extracting files from .7z using Perl
by BillKSmith (Monsignor) on Jun 12, 2017 at 20:48 UTC

    I have had problems with directory names which include a space. Try the "short name" instead. You can use the /X option of dir to find it. It probably is PROGRA~1 for Program Files. Even if this 'works', I would consider it a debug aid, not a permanent fix.

    UPDATE: Add example of using system with windows. I expect that it will run on most windows machines. It demonstrates how much the short names and forward slashes can simplify the quoting. The following example displays the perlmonks forum using Microsoft internet explorer.

    use strict; use warnings; my $browser = 'iexplore.exe'; my $arg1 = 'http://perlmonks.com'; my $cmd; $cmd = "C:/PROGRA~1/INTERN~1/$browser"; print "$cmd, $arg1\n"; print "hit any key to run command. Close IE to contine."; <>; system $cmd, $arg1; print "Back again\n"; $cmd = "\"C:\\Program Files\\Internet Explorer\\$browser\""; print "$cmd, $arg1\n"; print "hit any key to run command. Close IE to continue."; <>; system $cmd, $arg1; print "Back again\n";
    Bill
      Howdy!

      ...assuming that the file actually has a short name. One cannot rely on that being the case, I have learned, to my great annoyance.

      yours,
      Michael
        Yes, an excellent reason not to leave this in production code. If it works now, why not use it to debug the quoting by eliminating one big problem?
        Bill
Re: Extracting files from .7z using Perl
by zarath (Beadle) on Jun 13, 2017 at 08:22 UTC

    Thank you for all the tips everyone!

    Quite a few things to get through, love the challenge.

    Since the glob thing has been mentioned more than once, so I'll give a quick explanation what I'm trying to do with them.

    The first one: my $archive = glob qq($base.Gridfee0?.7z\\); I need to use a wildcard, there are 7 zipfiles named 'Gridfee01' up to 'Gridfee07' which i need to get through.

    The second one: my @files = glob qq($base\\Gridfee0?.7z\\Gridfee?\\invoic_b2c_$year$month$day*.txt); Need to use several wildcards here (? twice and a *) and the glob qq() thing is the easiest way that has worked for me until now.

      There is nothing wrong with using glob for this purpose but there are some pitfalls. You have to make sure you are giving the glob what it needs which you are not as has been pointed out already. Do your files have whitespace in the names or paths? Do you need to call glob in scalar or array context? There are a few different versions of glob available and you need to specify if you need one other than the built in one. Wildcards are what glob was made for but make sure you understand which metacharacters you are using. Have a look at File::Glob and another look at the replies you have already gotten.

      The first one: my $archive = glob qq($base.Gridfee0?.7z\\); I need to use a wildcard, there are 7 zipfiles named 'Gridfee01' up to 'Gridfee07' which i need to get through.
      Don't include a trailing "/" in $base or in the glob.
      my $archive = glob qq($base/Gridfee0?.7z);
      The second one: my @files = glob qq($base\\Gridfee0?.7z\\Gridfee?\\invoic_b2c_$year$month$day*.txt); Need to use several wildcards here (? twice and a *) and the glob qq() thing is the easiest way that has worked for me until now.
      The wildcard glob pattern should only be at the end of the "filespec". You cannot expect this to work with multiple wildcards in the "path". There are at least 3 versions of glob() that I have encountered. If you are using glob, keep what it does as simple as possible and use grep{} to further refine the search. You will need to code a loop here or use another method. Do not expect glob() to do implict looping. For the end target, I would use *.txt and use grep{} to filter out files which do not match /_(\d+)\.txt$/. Looping at multiple levels for $year$month$day*.txt makes no sense. Get all .txt files at one "go" and then filter as needed.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1192612]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2024-04-24 19:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found