zarath has asked for the wisdom of the Perl Monks concerning the following question:
Hi everybody!
Quite new to Perl, but picking it up at a steady pace.
The point of the code given below is the following: I have 7 .7z-files. The files are not that big (500 - 700 MB), but each file contains a folder which in turn contains around 2.2 million .txt-files. Needless to say, Windows/7zip are having a pretty difficult time extracting all the files from the archives (remaining time = around 1500 hours after 5% done, me and my bosses don't have that kind of time).
I'm going to give up on extracting everything and want to try only extracting the files which have not yet been extracted and putting them into the correct subfolder (which is D:\Some\Specific\Folder\Archive\yyyy\mm\dd\ - based on the date in the filenames ; these folders already exist). All this using 1 little Perl-script so I can just start it up and let it run as long as it needs to without having to worry about it anymore until it's done.
# !perl
use strict; # because we should
use warnings; # because we should
my $base = 'D:\Some\Specific\Folder\Archive\\';
# This might hurt your eyes, but for some obscure reason, DateTime is
+not installed and I can seriously not be bothered to install it mysel
+f
my @years = ('2011','2012','2013','2014','2015','2016','2017');
my @months = ('01','02','03','04','05','06','07','08','09','10','11','
+12');
my @days = ('01','02','03','04','05','06','07','08','09','10','11','12
+','13','14','15','16','17','18','19','20','21','22','23','24','25','2
+6','27','28','29','30','31');
my $archive = glob qq($base.Gridfee0?.7z\\);
#Tried glob qq($base.Gridfee0?.7z\\Gridfee*\\) and glob qq($base.G
+ridfee0?.7z) too, same result
foreach my $year (@years) {
foreach my $month (@months) {
foreach my $day (@days) {
my @files = glob qq($base\\Gridfee0?.7z\\Gridfee?\\invoic_
+b2c_$year$month$day*.txt);
# The zipfiles are zipped folders, so there is 1 extra
+ layer to traverse, hence the \\Gridfee?\\
# The name is definitely correct: for example Gridfee0
+5.7z contains folder named Gridfee5
# But possibly made a mistake in the syntax?
foreach my $file (@files) {
my $extdir = '-o $base.$year\$month\$day\\';
system ('C:\Program Files\7-Zip\7z.exe',' e ',$archive
+,' -r',$extdir,$file,' n n') or die $!;
# Many things in Perl are a mystery to me and the
+use of 7z via Perl is most definitely one of them
# This is the syntax I have distilled from what in
+fo I could find on interaction between Perl and 7z
# and from the 'Help'-file that comes with the 7zi
+p-program
# The ' n n' is my attempt at making 7zip skip the
+ files which can already be found in $extdir
say('Moved '.$file.' to '.$extdir);
# Just a check to see what has been done, if a
+nything
}
}
}
}
exit 0;
The code as it is right now outputs absolutely nothing, it just seems to run for a couple of seconds and then stops and I have no idea why.
Can you spot my mistakes?
Re: Extracting files from .7z using Perl
by Tux (Canon) on Jun 12, 2017 at 16:10 UTC
|
| [reply] |
|
Did not even know that exists. Will have a look if i can get it installed. Thank you.
| [reply] |
Re: Extracting files from .7z using Perl
by zentara (Archbishop) on Jun 12, 2017 at 16:22 UTC
|
Hi, I don't run Windows, but just from general knowledge and the way your program
is written, I would suggest the quotations in your extended command line options.
For instance, what are the n n 's for?: system ('C:\Program Files\7-Zip\7z.exe',' e ',$archive,' -r'
+,$extdir,$file,' n n') or die $!;
# maybe n n should be \n\n ?
system ('C:\Program Files\7-Zip\7z.exe', ' e ', $archive,' -r
+ ,$extdir,$file' , "\n\n\") or die $!;
or the suspicious:
my @files = glob qq($base\\Gridfee0?.7z\\Gridfee?\\invoic_b2c_$year$mo
+nth$day*.txt);
Or some other quoting inconsistencies. Your program silently dies.
Remember, " and ' are not the same.
" interpolates.
Also start printing out your data arrays to make sure they were filled as a debugging aid.
Also, just test extraction first. Can you get your script to do a 7z with the x option on the file?
That will ensure it all works, before trying to extract files individually.
| [reply] [d/l] [select] |
Re: Extracting files from .7z using Perl
by poj (Abbot) on Jun 12, 2017 at 17:36 UTC
|
Not sure what you are doing with the glob statements. Try this test program which should list the files in an archive that match the invoic_b2c etc pattern
#!perl
use strict;
use warnings;
my $base = 'D:/Some/Specific/Folder/Archive/';
my $archive = $base.'Gridfee05.7z';
my $exe = '"C:/Program Files/7-Zip/7z.exe"';
my @files = grep /invoic_b2c_201\d{5}.*\.txt/, qx "$exe l $archive";
print scalar @files." files found in $archive - see filelist.txt\n\n";
open OUT,'>',"filelist.txt" or die "$!";
print OUT join "\n",@files;
close OUT;
poj
| [reply] [d/l] [select] |
Re: Extracting files from .7z using Perl
by Lotus1 (Vicar) on Jun 12, 2017 at 17:53 UTC
|
It appears you are attempting concatenation with the '.' operator but inside of qq(). In that context the period is just text and not an operator. Try getting your glob to work first and keep adding small tested chunks to your program until you have built up what you need. Also, the '\' backslash inside a glob is seen as quoting the next character. You need to use the front slash '/' inside a glob.
Just glancing at the rest of the program, I don't think millions of calls to 7z.exe is going to work like this. Inside your nested loop when you call 7zip each time it will need to extract the archive to find the file you need. It might make more sense to unzip everything at once to a temporary folder and then operate on the files, then remove the temporary folder.
# !perl
use strict; # because we should
use warnings; # because we should
my $base = 'C:\usr\pm\temp\test1\\';
my $search = qq($base.Gridfee0?.7z\\);
print "1: $search \n";
my $search2 = qq(${base}Gridfee0?.7z\\);
print "2: $search2 \n";
my $search3 = qq(C:/usr/pm/temp/test1/Gridfee0?.7z\\);
print "3: $search3 \n";
my $archive = glob qq($base.Gridfee0?.7z\\);
#Tried glob qq($base.Gridfee0?.7z\\Gridfee*\\) and glob qq($base.G
+ridfee0?.7z) too, same result
print "archive = $archive\n";
my @archs = glob $search3;
#Tried glob qq($base.Gridfee0?.7z\\Gridfee*\\) and glob qq($base.G
+ridfee0?.7z) too, same result
print "archs = @archs\n";
__END__
Use of uninitialized value $archive in concatenation (.) or string at
+C:\usr\pm\temp\1192612.pl line 18.
1: C:\usr\pm\temp\test1\.Gridfee0?.7z\
2: C:\usr\pm\temp\test1\Gridfee0?.7z\
3: C:/usr/pm/temp/test1/Gridfee0?.7z\
archive =
archs = C:/usr/pm/temp/test1/Gridfee02.7z\ C:/usr/pm/temp/test1/Gridfe
+e05.7z\
| [reply] [d/l] [select] |
Re: Extracting files from .7z using Perl
by Marshall (Canon) on Jun 13, 2017 at 03:54 UTC
|
I see a number of problems.
First forget this '\' stuff! That is ancient DOS. Modern Windows command line is NOT DOS and it works fine with '/'. In your code always use forward slash ('/') instead of backslash ('\'). This amoungst other things avoids the need to "double escape" the backslash. Again, forget this '\' stuff!
my $base = 'D:\Some\Specific\Folder\Archive\\';
# should be:
my $base = 'D:/Some/Specific/Folder/Archive'; # a path to a directory
or
my $base = "D:/Some/Specific/Folder/Archive"; # a path to a directory
# do not put a trailing '/' or '\' on a directory name
# this is not needed and can confuse the shell
# pre-pend a '/' when you expand the path
my $new = "$base/$extra_path";
Your triple loop over $year,$month,$day is truly bizarre.
my @files = glob qq($base\\Gridfee0?.7z\\Gridfee?\\invoic_
+b2c_$year$month$day*.txt);
What are you trying to do there? I don't quite "get it".
I am curious as to why you are using .7z suffixes? I like 7z. It generates .zip files faster which use less memory than the MS zip program does. These .zip files generated by 7z are compatible with Windows .zip. I've never used the .7z specific format because the .zip MS compatible format appears to be just fine for my applications. Again, if 7z makes that .zip it will be smaller than what MS does, be generated faster and yet be compatible with MS .zip. | [reply] [d/l] [select] |
Re: Extracting files from .7z using Perl
by BillKSmith (Monsignor) on Jun 12, 2017 at 20:48 UTC
|
I have had problems with directory names which include a space. Try the "short name" instead. You can use the /X option of dir to find it. It probably is PROGRA~1 for Program Files. Even if this 'works', I would consider it a debug aid, not a permanent fix.
UPDATE: Add example of using system with windows. I expect that it will run on most windows machines. It demonstrates how much the short names and forward slashes can simplify the quoting. The following example displays the perlmonks forum using Microsoft internet explorer.
use strict;
use warnings;
my $browser = 'iexplore.exe';
my $arg1 = 'http://perlmonks.com';
my $cmd;
$cmd = "C:/PROGRA~1/INTERN~1/$browser";
print "$cmd, $arg1\n";
print "hit any key to run command. Close IE to contine.";
<>;
system $cmd, $arg1;
print "Back again\n";
$cmd = "\"C:\\Program Files\\Internet Explorer\\$browser\"";
print "$cmd, $arg1\n";
print "hit any key to run command. Close IE to continue.";
<>;
system $cmd, $arg1;
print "Back again\n";
| [reply] [d/l] |
|
| [reply] |
|
Yes, an excellent reason not to leave this in production code. If it works now, why not use it to debug the quoting by eliminating one big problem?
| [reply] |
Re: Extracting files from .7z using Perl
by zarath (Beadle) on Jun 13, 2017 at 08:22 UTC
|
Thank you for all the tips everyone!
Quite a few things to get through, love the challenge.
Since the glob thing has been mentioned more than once, so I'll give a quick explanation what I'm trying to do with them.
The first one: my $archive = glob qq($base.Gridfee0?.7z\\); I need to use a wildcard, there are 7 zipfiles named 'Gridfee01' up to 'Gridfee07' which i need to get through.
The second one: my @files = glob qq($base\\Gridfee0?.7z\\Gridfee?\\invoic_b2c_$year$month$day*.txt); Need to use several wildcards here (? twice and a *) and the glob qq() thing is the easiest way that has worked for me until now.
| [reply] [d/l] [select] |
|
| [reply] |
|
The first one: my $archive = glob qq($base.Gridfee0?.7z\\); I need to use a wildcard, there are 7 zipfiles named 'Gridfee01' up to 'Gridfee07' which i need to get through.
Don't include a trailing "/" in $base or in the glob.
my $archive = glob qq($base/Gridfee0?.7z);
The second one: my @files = glob qq($base\\Gridfee0?.7z\\Gridfee?\\invoic_b2c_$year$month$day*.txt); Need to use several wildcards here (? twice and a *) and the glob qq() thing is the easiest way that has worked for me until now.
The wildcard glob pattern should only be at the end of the "filespec". You cannot expect this to work with multiple wildcards in the "path". There are at least 3 versions of glob() that I have encountered. If you are using glob, keep what it does as simple as possible and use grep{} to further refine the search. You will need to code a loop here or use another method. Do not expect glob() to do implict looping. For the end target, I would use *.txt and use grep{} to filter out files which do not match /_(\d+)\.txt$/. Looping at multiple levels for $year$month$day*.txt makes no sense. Get all .txt files at one "go" and then filter as needed.
| [reply] [d/l] [select] |
|
|