ftumsh has asked for the wisdom of the Perl Monks concerning the following question:
Lo,
I'm trying to identify various types of text file, xml, csv etc.
The idea being that it is presented with a text file and it works outwhat type it is.
The one file format I am having trouble with is fixed width.
The definition of a fixed width file being:
1) Text file made up of records (ie LF or CRLF delimited)
2) Different records may be of different lengths
3) Records of a particular may be denoted by starting with particular characters
or by the length of the record.
As you know, variants of the above are legion, so I only expect(hope) to get a largish percentage.
The only test I have at the moment is if the length of every record is the same and it's
failed the tests for other file types, ie I'm testing for fixed width after all else.
Typically in a simple case a file will contain a header record followed by line records.
This will repeat down the file.
eg
Hfoobar
L123456field2
L...
H...
L...
L... etc
In a more complicated file, the header and line will be split across multiple records
eg
Hfield1field2
Ffield1field2 part of header still
Afield3 field4 still part of header
Now I can look at a file by eye and say yes it's fixed width, so it should be possible to do so
programmatically.
The options I have up to press:
1) Try and work out if it's fixed width
2) Say hey, we got this far so it's fixed width (will give false positive on random text files)
3) work out if it's a text file containing prose, if it's not, it's fixed width
The text files my module will be presented with should be computer generated, so prose text is a mistake and not happen too often. The whole point of this is to try and
cut out humans trying to identify a file. In other words, I don't expect it to catch every fixed width
file.
So, all and any suggestions gratefully received.
John
Re: how to identify a fixed width file
by Cody Pendant (Prior) on May 14, 2008 at 12:41 UTC
|
The definition of a fixed width file being [...] records may be of different lengths
There's your problem.
Nobody says perl looks like line-noise any more
kids today don't know what line-noise IS ...
| [reply] |
|
Also...
Now I can look at a file by eye and say yes it's fixed width, so it should be possible to do so
programmatically.
Didn't the natural language recognition folks start out saying something very similar?
| [reply] |
Re: how to identify a fixed width file
by moritz (Cardinal) on May 14, 2008 at 12:04 UTC
|
The unix utility file is great for generally identifying file types.
But as for your description of the "fixed width" file format: I just don't understand it, and the part that you showed in the example doesn't look very fixed width to me.
Maybe you could show us a few samples of that file? (Real samples, where you can see patterns)
There's a nice trick to determine if something is fixed-width with delimiters: take a long string that consists of the delimiting character, and binary-AND it with many records. If the delimiting character is still there at some places, that is very likely a delimiter within a fixed-width record.
(But since I don't understand your file format I can't say if that trick is applicable here). | [reply] [d/l] |
|
| [reply] |
|
Excellent. It's similar to moritz' suggestion only with an example which is always better for eejits like me. Thanks for that.
| [reply] |
|
I think it may be easier to work out if it's a prose file, ie plenty of words and if it is prose then it isn't "fixed width"
fwiw, I won't know if the file has delimiters. I'd rather not think about the comma seperated fixed width fields format files I have come across ...
Here's the most awkward fixed file I can find. It looks fixed width practically straight away to my eye. The more trained observer will notice it's a weird variation of a tradacoms edi message. This is an example, I must point out that any computer generated text file will be passed to my module and it should have a good go of working out what it is.
STX 8888888888888 dfdfdf dfdf dfdfdfdfdfs sdfdff
+d
STXA
TYP 0700 dfderf
SRT 2323232323235 sdertryh aswedrfg gfrfgtgs fgt
SRTAHigh Cross CRRtrR dfdeereeR dsdd
SRTBLoRdoR d34 dfr
SRTC 232323232
CRT 8888888888888 RUNELM RuRRlm sdsd sdsdsdsdsds sdsdsdd
CRTAsdsdsdss sdsdsdR sdsdR sdy
CRTBSystoR sdsdsdsdsdsdsR
CRTCLE7 2NF
RNA 0000
RNAA
RNAB
RNAC
RNAR
RNAE
RNAF
RNAG
FIL 0002 0002 045450 000000
FRT 074550 070520
ACR 0000000000000
CLO 4545454545459 0750
CLOARuRRlm (BFllymRRF) (0750)
CLOBURit2, rtrtrt rtrk trtril rtrk rtrRR rtRk rtFd
CLOCBFllymRRF rtt2 rtA
IRF wewee8 070508 070508
PYT wewewees wewewewewe wewewewe 034438 002500 000 002500
+000
RNAH0000
RNAI
RNAJ
RNAK
RNAL
RNAM
RNAN
RNAO
ORR 5656566820 256562 070508 070508 266528
+ 070508
ORRA000000000000002 0000000000000 0000000000000 0705
+08
ORRB 0000000000000
ORRC0000000000000
ORRR
ILR 0000000000000 20922 00000000000000 000000
+0000000
ILRA000000000000000 000000000000002 000
+0000000000
ILRB 000000000000022 0000000022000 RFch 00000000025000 RFch
ILRC00000000300000 S 027500 0 URimFt - WhitR
ILRR 00000000000000 0000000000
+0000
ILRE00000000000000 00000000000000 00000000025000 00000000000000 000000
ILRF00000000000000
ILR 0000000000000 22294 00000000000000 000000
+0000000
ILRA000000000000000 000000000000002 000
+0000000000
ILRB 000000000000003 0000000003000 RFch 00000000025000 RFch
ILRC00000000075000 S 027500 0 URimFt - CrRFm
ILRR 00000000000000 0000000000
+0000
ILRE00000000000000 00000000000000 00000000025000 00000000000000 000000
ILRF00000000000000
ILR 0000000000000 22270 00000000000000 000000
+0000000
ILRA000000000000000 000000000000002 000
+0000000000
ILRB 000000000000003 0000000003000 RFch 00000000025000 RFch
ILRC00000000075000 S 027500 0 URimFt - PiRk
ILRR 00000000000000 0000000000
+0000
ILRE00000000000000 00000000000000 00000000025000 00000000000000 000000
ILRF00000000000000
ILR 0000000000000 22393 00000000000000 000000
+0000000
ILRA000000000000000 000000000000002 000
+0000000000
ILRB 000000000000003 0000000003000 RFch 00000000025000 RFch
ILRC00000000075000 S 027500 0 URimFt - BluR
ILRR 00000000000000 0000000000
+0000
ILRE00000000000000 00000000000000 00000000025000 00000000000000 000000
ILRF00000000000000
CIA 000000 00000000000000
RNC 0000
RNCA
RNCB
RNCC
RNCR
RNCE
RNCF
RNCG
STL S 027500 0000000004 000000005250 000000000000 000000000000 0000000
+00000
STLA000000000000 000000005250 000000000232 000000005228 000000000896
STLB000000006246 000000006024
TLR 0000000002 000000005250 000000000000 000000000000 000000000000 000
+000000000
TLRA000000005250 000000000232 000000005228 000000000896 000000006246
TLRB000000006024
CLO 5656565656567 0390
CLOAghghgh (ghghghghr) (0390)
CLOBURit 3, ghghhg hgFd ghghil ghgk Oghgh gh
CLOCRoRcFstRr gth ghE
IRF 565629 070508 070508
PYT tytytyys tytytytyFl tytytyRs 070508 002500 000 002500
+000
RNAH0000
RNAI
RNAJ
RNAK
RNAL
RNAM
RNAN
RNAO
ORR 3434343426 242342 070508 070508 266529
+ 070508
ORRA000000000000002 0000000000000 0000000000000 0705
+08
ORRB 0000000000000
ORRC0000000000000
ORRR
ILR 0000000000000 53652 00000000000000 000000
+0000000
ILRA000000000000000 000000000000002 000
+0000000000
ILRB 000000000000002 0000000002000 RFch 00000000029900 RFch
ILRC00000000059800 S 027500 0 ClFssic ShRll ShFpRd BFth Pillow Cr
+RFm
ILRR 00000000000000 0000000000
+0000
ILRE00000000000000 00000000000000 00000000029900 00000000000000 000000
ILRF00000000000000
ILR 0000000000000 20922 00000000000000 000000
+0000000
ILRA000000000000000 000000000000002 000
+0000000000
ILRB 000000000000006 0000000006000 RFch 00000000025000 RFch
ILRC00000000250000 S 027500 0 URimFt - WhitR
ILRR 00000000000000 0000000000
+0000
ILRE00000000000000 00000000000000 00000000025000 00000000000000 000000
ILRF00000000000000
ILR 0000000000000 22270 00000000000000 000000
+0000000
ILRA000000000000000 000000000000002 000
+0000000000
ILRB 000000000000002 0000000002000 RFch 00000000025000 RFch
ILRC00000000050000 S 027500 0 URimFt - PiRk
ILRR 00000000000000 0000000000
+0000
ILRE00000000000000 00000000000000 00000000025000 00000000000000 000000
ILRF00000000000000
CIA 000000 00000000000000
RNC 0000
RNCA
RNCB
RNCC
RNCR
RNCE
RNCF
RNCG
STL S 027500 0000000003 000000002598 000000000000 000000000000 0000000
+00000
STLA000000000000 000000002598 000000000066 000000002532 000000000443
STLB000000003042 000000002975
TLR 0000000002 000000002598 000000000000 000000000000 000000000000 000
+000000000
TLRA000000002598 000000000066 000000002532 000000000443 000000003042
TLRB000000002975
| [reply] [d/l] |
|
| [reply] [d/l] [select] |
Re: how to identify a fixed width file
by Pancho (Pilgrim) on May 14, 2008 at 12:46 UTC
|
I think the key is figuring out the criteria by which you can test a file is fixed width and that depends on your requirements. If the criteria is too broad then the validity of the test will decrease to the point where the test is useless.
A different approach would be to look for a certain pattern in the record identifier and record length, again depending on your requirements for example:
First record starts with H and second with D third with T. The pattern repeats and all H records, D records and T records are the same length.
Good Luck
| [reply] |
Re: how to identify a fixed width file - do a histogram!
by Narveson (Chaplain) on May 14, 2008 at 14:39 UTC
|
Records of a particular length may be denoted by starting with particular characters
or by the length of the record.
Some of the brethren have boggled at fixed-width files that mix different record lengths, but I think we can make some sense of this, especially if the record type is signaled by the initial character.
use strict;
use warnings;
my %histogram;
my %records_of_length;
while (<DATA>) {
my $record_length = length;
my $initial_char = substr($_, 0, 1);
$records_of_length{$record_length}++;
$histogram{$record_length}{$initial_char}++;
}
# Review how many distinct record lengths were seen.
# If all records of given length start with same char,
# rejoice!
for my $rec_len (sort {$a <=> $b} keys %histogram) {
print "Saw $records_of_length{$rec_len} records";
print " with length $rec_len:\n";
for my $char (sort keys %{$histogram{$rec_len}}) {
print "\t$char: ";
print $histogram{$rec_len}{$char}, "\n";
}
}
__DATA__
C4498 John__ Smith___
I0023 widget 004 4.95
I0869 foozle 001 29.50
I7765 gadget 002 340.00
C5678 Mary__ Doe____
I9999 misc__ 003 6.25
prints
Saw 2 records with length 22:
C: 2
Saw 4 records with length 24:
I: 4
and now you can work on heuristics to decide if the number of different record types is small enough to usefully classify the file as "mixed fixed width".
| [reply] [d/l] [select] |
Re: how to identify a fixed width file
by jhourcle (Prior) on May 14, 2008 at 16:41 UTC
|
First off, I don't know if I'd specifically call your format 'fixed width', as it doesn't match what I'm used to dealing with -- simple tabular data with lots of whitespace. I haven't had to deal with the formatting you're dealing with, but I could probably deal with whitespace padded tabular data in a consistent manner.
Although this probably will have some false negatives for the odd files that I deal with, I'd probably take some subset of the middle of the file (ie, try to remove headers and footers), and then use something like BrowserUK's unpack mask generator to see if there are columns of consistently white space among columns of non-whitespace.
Obviously, this is going to fail in the case if you include the header or footer, and there's a good chance of it not matching multiline records (but still fixed width) or if there are sub-headings of substantial length. Many of the fixed-width files I deal with have various formatting quirks, but if yours are more consistent, it might be worthwhile.
for the case where you don't have whitespace padding, but you do have data other than strings, you might be able to create masks of where there's numeric vs. alpha columns, and make your decision based on that. (still wouldn't deal with the multi-line record issue, though)
| [reply] |
Re: how to identify a fixed width file
by dragonchild (Archbishop) on May 14, 2008 at 13:56 UTC
|
The reason why XML, CSV, and other similar file formats were created was to address the inherent problems with fixed with formats. THe first formats were fixed width because they are very simple to work with. In essence, they are the serialization of an array of structs in C. So, marshalling one of those in C is really simple. Finding a given record when you know its index (10th, 1024th, etc) is very simple. Overwriting a given record is very simple. It's the ultimate in RAM-backed-to-disk. The only problem is that you have to know the mapping. If you don't know what a fixed-width format means, you're out of luck.
And, furthermore, many fixed-width files have a header and, possibly, a footer. DBM::Deep's file format is a record-based format with a two headers (first is fixed, second is variable). Good luck detecting that it's a DBM::Deep file without recognizing the first four bytes.
Frankly, I'd do the following:
- Is it XML, CSV, HTML, etc?
- Is it a fixed-width format I recognize (PNG, JPG, DOC, XLS, etc)?
- Punt.
Which, essentially, is what the file utility does.
My criteria for good software:
- Does it work?
- Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
| [reply] |
|
1) I do recognise XML, CSV etc already, the problem is with fixed width.
2) The formats mentioned are not text files so are of no relevance.
3) I'd rather not punt if possible, tho it seems I may have to.
My code atm uses File::MMagic to get the mime type. If it's a text file I then work out what sort of text file it is ie
1) XML - uses mmagic and XML::LibXML
2) SAIFFE - regex
3) EDIFACT - regex
4) Tradacoms - regex
5) CSV - Text::CSV_XS
6) Fixed width - foobar
| [reply] |
Re: how to identify a fixed width file
by reasonablekeith (Deacon) on May 14, 2008 at 15:33 UTC
|
Why don't you try running through the file counting up the number of times a line of a given length is seen...
my %line_count_by_length;
while (<DATA>) {
my $line_length = length($_);
$line_count_by_length{$line_length}++;
}
If any (or a sufficiently large portion of) those line counts represent a big percentage of the total line count, you could make a guess that the file was fixed width. Perhaps also giving a weighting on how many different line lengths are represented in the file, compared to how many you might expect given the file's length?
---
my name's not Keith, and I'm not reasonable.
| [reply] [d/l] |
|
My initial stab was a count of record lengths which was fine until the different length files cropped up.
I think bringing that back along with some analysis of the counts, along with tachyon/mortitz' text OR should go a long way to solving this
| [reply] |
|
|