How to process variable length fields in delimited file.

dbach355 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to process variables length fields in delimited file. by liverpole (Monsignor) on Oct 06, 2016 at 02:03 UTC
Hi dbach355, My first approach would be to define, programmatically (ie. with a data structure), what the input file contains on each line. Once that's in a script, you run it and prove to yourself that your data does in fact behave as expected. Since each line is made up of space-delimited items, but some of them are count-prefixed, you could define your line format with an array containing an array reference for each item. Each array reference would hold the LABEL of the item (eg. 'ssn' for social-security, 'emp_num' for employee number, etc.), and a compiled regular expression (that's the qr/.../ syntax) used to parse the item. In cases where the item is prefixed with a count, specifying the length of the item, you could use a string like 'COUNT' instead of a regex. Here's an example for what you've defined: `my @line_format = ( [ 'ssn', qr/(\d{9})/ ], [ 'emp_num', qr/(\d+)/ ], [ 'emp_name', 'COUNT' ], [ 'hire_date', qr/(\d{8})/ ], [ 'city', 'COUNT' ], [ 'state', qr/([A-Z]{2})/ ], [ 'city', 'COUNT' ], [ 'zip', qr/(\d{5})/ ], );` [download] Then you write a subroutine `parse_line` that you call for each line of your input file. (I would also pass in the line number, in case the line doesn't match your formula, so you can die with an error saying which line was invalid). For each array ref in `@line_format` you either parse the COUNT, and pull off that number of characters, or you apply the next regex. If the data validates, you assign it into a hash local to the subroutine, with the label as the key. When the subroutine completes successfully, you pass back a reference to that hash. Here's how you might write the `parse_line` subroutine: sub parse_line { my ($line, $linenum) = @_; my %parsed = ( ); foreach my $format (@line_format) { my ($label, $expected) = @$format; if ($expected eq 'COUNT') { # Pull the COUNT off the beginning of the line and apply i +t if ($line !~ s/\s(\d+) //) { die "Error #1 parsing item '$label' (line #$linenum)\n +"; } my $count = $1; if ($line !~ s/(.{$count})//) { die "Error #2 parsing item '$label' (line #$linenum)\n +"; } $parsed{$label} = $1; } else { # Pull of the next non-space word, and test with the regex if ($line !~ s/^\s(\S+)//) { die "Error #3 parsing item '$label' (line #$linenum)\n +"; } $parsed{$label} = $1; } } return \%parsed; } [download] When I call that subroutine with the data you defined for a single line: `use Data::Dumper::Concise; my $line = "123445678 45612 11 Steve Smith 11012015 16 1001 Main + Street GA 7 Atlanta 30553"; my $result = parse_line($line, 1); die Dumper $result;` [download] This simple program dumps as its result: `{ city => "Atlanta", emp_name => "Steve Smith", emp_num => 45612, hire_date => 11012015, ssn => 123445678, state => "GA", zip => 30553 }` [download] So I know I'm on the right track. The next steps would be something like; Read all the lines in the file Call the subroutine `parse_line` on each line (and line number), getting back a hash ref Add that hash ref to an array (or do whatever you want with it) Does that help? Edit: fixed whom I'm responding to (thanks choroba) s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/	[reply] [d/l] [select]
Re: How to process variable length fields in delimited file. by GrandFather (Saint) on Oct 06, 2016 at 01:37 UTC
If the fixed fields really are fixed length rather than space delimited then you can pull the lines apart using a template like this: use strict; use warnings; my @template = ( 'ssn 9', 'employee number 5', 'employee name ', 'hire date 8', 'address ', 'state 2', 'city ', 'zip 5' ); while (my $line = <DATA>) { chomp $line; my %fields; for my $field (@template) { my ($name, $length) = $field =~ /(.) (.+)/; $line =~ s/^\s+//; $length = substr $line, 0, index ($line, ' ') + 1, '' if $leng +th eq '*'; $fields{$name} = substr $line, 0, $length, ''; } print "$_: $fields{$_}\n" for keys %fields; } __DATA__ 123445678 45612 11 Steve Smith 11012015 16 1001 Main Street GA 7 Atlan +ta 30553 [download] Prints: `employee number: 45612 state: GA hire date: 11012015 city: Atlanta zip: 30553 ssn: 123445678 employee name: Steve Smith address: 1001 Main Street` [download] For output I'd strongly recommend using a module like Text::CSV to generate correctly formatted CSV files. Premature optimization is the root of all job security	[reply] [d/l] [select]
Re: How to process variable length fields in delimited file. by choroba (Cardinal) on Oct 06, 2016 at 07:44 UTC
You can create an unpack template that parses each line, the only problem is that in order to use the length fields, they must be separated by null bytes, not zeroes. But it's easy to change spaces to nulls and then back: `#!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use Syntax::Construct qw{ /r }; my $template = join 'x', 'A9', # ssn 'A5', # employee number 'Z/A', # employee name 'A8', # hire date 'Z/A', # address 'A2', # state 'Z/A', # city 'A5', # zip ; while (<>) { say join ',', map tr/\x0/ /r, unpack $template, tr/ /\x0/r; }` [download] Update:* used `tr` instead of `s` . ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re: How to process variable length fields in delimited file. by Marshall (Canon) on Oct 06, 2016 at 01:29 UTC
Hi dbach355, My goal is to output a delmited file with a unique delimiter such as \f. I think that you will find that a CSV (Comma Separated Value) line using the "pipe" character, "\|" as the delimiter will work out well. CSV is a generic term, you can use something other than a comma. I work with a few "\|" separated DB's, some with a million+ records. If you use \f, "Form Feed", you will wind up with something that cannot be printed easily (one page per column is not too friendly!). This also has the problem of being "invisible". Using a tab character (\t) has the same visibility problem. The real problem with your format are the embedded spaces. These first 10 columns can be handled in a number of ways. What do the other columns look like? Do they contain embedded spaces, like "John Smith"? Do they have a constant field width perhaps? Your goal is achievable. I just need a bit more info. Update: Once you have the data in "\|" delimited form, Perl can process a line like that easily. An example is shown below. There are modules, like Txt::CSV that can be used. However, if the "\|" does not appear anywhere in the data, there is no need for that. You are new to Perl and I don't want to overly complicate things if it is not necessary. `#!/usr/bin/perl use strict; use warnings; my $line = "ssn\|empNo\|ncEmpName\|empName\|hireDate\|ncAddr\|addr\|state\|ncC +ity\|city\|zip"; my @columns = split /\\|/, $line; print "@columns[-1,-4,4,3]\n"; # "zip state hireDate empName"` [download] Update with code: I thought some more about this problem. If you have fixed width fields interspersed with space separated fields, you have a big mess. One way to describe the fields and implement this is shown below. A 'v' field contains no embedded spaces and is variable in length, an "f" field, fixed field is a certain number of characters. This code builds a Regex (Regular Expression) and then executes that regex on the input. Anybody's brain would go crazy to write a regex with 100 terms, hence the program does that from the input table. I do suspect that your problem can be solved "easier" than this, but without more info about the other ~90 columns, I am unsure. #!/usr/bin/perl use strict; use warnings; #empNo\|ncEmpName\|empName\|hireDate\|ncAddr\|addr\|state\|ncCity\|city\|zip"; my $line2 = "123445678 45612 11 Steve Smith 11012015 16 1001 Main Stre +et GA 7 Atlanta 30553 x y z"; # Note: Looks like ncEmpName is "45612 11", a fixed width field my @format_spec = qw( v empNo f8 ncEmpName f11 enpName v hireDate v ncAddr f16 addr v state v ncCity f7 city v zip v x v y v z ); my $regex = "^"; while (@format_spec) { my $format = shift @format_spec; # pair wise in List::Util possible my $name = shift @format_spec; # here keep it simple if ($format =~ /v/) #variable length (no embedded spaces) { $regex .= '\s(\S+)'; } elsif ( (my $width) = $format =~ /\sf(\d+)/) # fixed length,means # embedded spaces { $regex .= '\s(.{' . "$width})"; # \s cannot be within "" } print "$regex\n"; #for debug, comment this out later } my (@tokens) = $line2 =~ /$regex/; print join ("\|", @tokens), "\n"; __END__ The regex is built like this: ^\s(\S+) ^\s(\S+)\s(.{8}) ^\s(\S+)\s(.{8})\s(.{11}) ^\s(\S+)\s(.{8})\s(.{11})\s(\S+) ^\s(\S+)\s(.{8})\s(.{11})\s(\S+)\s(\S+) ^\s(\S+)\s(.{8})\s(.{11})\s(\S+)\s(\S+)\s(.{16}) ^\s(\S+)\s(.{8})\s(.{11})\s(\S+)\s(\S+)\s(.{16})\s(\S+) ^\s(\S+)\s(.{8})\s(.{11})\s(\S+)\s(\S+)\s(.{16})\s(\S+)\s(\S+) ^\s(\S+)\s(.{8})\s(.{11})\s(\S+)\s(\S+)\s(.{16})\s(\S+)\s(\S+) +\s(.{7}) ^\s(\S+)\s(.{8})\s(.{11})\s(\S+)\s(\S+)\s(.{16})\s(\S+)\s(\S+) +\s(.{7})\s(\S+) ^\s(\S+)\s(.{8})\s(.{11})\s(\S+)\s(\S+)\s(.{16})\s(\S+)\s(\S+) +\s(.{7})\s(\S+)\s(\S+) ^\s(\S+)\s(.{8})\s(.{11})\s(\S+)\s(\S+)\s(.{16})\s(\S+)\s(\S+) +\s(.{7})\s(\S+)\s(\S+)\s(\S+) ^\s(\S+)\s(.{8})\s(.{11})\s(\S+)\s(\S+)\s(.{16})\s(\S+)\s(\S+) +\s(.{7})\s(\S+)\s(\S+)\s(\S+)\s(\S+) The "\|" separated line is like this: 123445678\|45612 11\|Steve Smith\|11012015\|16\|1001 Main Street\|GA\|7\|Atlan +ta\|30553\|x\|y\|z [download] Of course the fixed length fields can have trailing spaces, but that is easy to get rid of: `@tokens = map{s/\s*$//; $_;}@tokens; #delete trailing spaces` [download] or some such similar formulation. Also, a very long but simple (no back-tracking) regex can execute quite quickly. I doubt that a regex approach will be a performance problem even if the regex is so long that it is incomprehensible to a human.	[reply] [d/l] [select]
Re: How to process variable length fields in delimited file. by Tux (Canon) on Oct 06, 2016 at 07:40 UTC
If the first 10 fields are of fixed length, I'd use `unpack` on that part. Using `A10` for a 10 characted wide field will strip the trailing spaces. Work on from there. `my ($ssn, $empno, $empname, ...) = unpack "A10 A20 A12 ...", $buffer;` [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re: How to process variable length fields in delimited file. by johngg (Canon) on Oct 06, 2016 at 11:20 UTC
This is similar to GrandFather's approach moving along the line field by field a time but uses the `@fieldNames` array and a counter to determine whether we have an actual field or the field width of the next field. use strict; use warnings; use feature qw{ say }; open my $dataFH, q{<}, \ <<__EOF__ or die qq{open: < HEREDOC: $!\n}; 123445678 45612 11 Steve Smith 11012015 16 1001 Main Street GA 7 Atlan +ta 30553 234256653 76467 8 Joe Blow 06072014 11 83 Low Road CO 6 Denver 12345 239879583 62098 10 Andy Pandy 03112012 13 10 The Strand NJ 13 Atlantic + City 16345 __EOF__ my @fieldNames = qw{ ssn empNo ncEmpName empName hireDate ncAddr addr state ncCity city zip }; while ( <$dataFH> ) { chomp; my $fieldCt = 0; my @fields; while ( length ) { s{^\s}{}; my $next = $1 if s{(\S+)}{}; if ( $fieldNames[ $fieldCt ] =~ m{^nc} ) { s{^\s}{}; push @fields, substr $_, 0, $next, q{}; $fieldCt ++; } else { push @fields, $next; } $fieldCt ++; } say join q{\|}, @fields; } [download] The output. `123445678\|45612\|Steve Smith\|11012015\|1001 Main Street\|GA\|Atlanta\|30553 234256653\|76467\|Joe Blow\|06072014\|83 Low Road\|CO\|Denver\|12345 239879583\|62098\|Andy Pandy\|03112012\|10 The Strand\|NJ\|Atlantic City\|163 +45` [download] I hope this is of interest. Cheers, JohnGG	[reply] [d/l] [select]
Re: How to process variable length fields in delimited file. by shadowsong (Pilgrim) on Oct 06, 2016 at 11:37 UTC
Hi dbach355 Seeing as how The number of fields is fixed, the number of fixed field lengths and variable field lengths varies - if all you need is another file with custom delimiters; you can achieve this with a one-liner: `perl -lawpe "$_=qq\|$F[0]\\f$F[1]\\f$F[2]\\f$F[3]\|" in.txt > out.txt` [download] The offset within the @F array denotes each input field in your file; so offset 0 would represent ssn, offset 1 employee number and so on; simply craft your output line how you'd like it... See http://www.perl.com/pub/2004/08/09/commandline.html for additional command line options. Cheers, Shadowsong	[reply] [d/l]
Re: How to process variable length fields in delimited file. by dbach355 (Initiate) on Oct 06, 2016 at 13:28 UTC
Thank you all for your responses. I will review them for a better understanding. I am very novice in perl, so I would like to read details to get a good understanding of the proposed methods I could have put the exact code to begin with, but I did not want to get to long winded, but at times details are better. One reason I was thinking of using the \f character is I don't care about printing the data ( I say that now), the data once in a readable delimited file will pass to SPLUNK application for end use. The problem in the data is there is about every character in the text. There are maybe 1,0000,000 lines of text a day and from the below message these are text from network devices which include characters such as #@$^\|}{[]<> and about every character I could think of. They had tabs in also. I finally grepped the file for several days of output and I did not find a \f. Other possibility is to use multicharacter delimiter such as @#! which is unlikely to be together as standard text. Here is the devil in the details of the true layout and an example of 1 data line. I will review and when I have time, comment on the solution. Thank you all For each message: 1. Record Starter: "====>" 2. Message ID (uuid) 3. Condition ID (uuid, for future use) 4. Network Type of message node: IP Node 1 Non IP Node 5 5. IP Address (see A.) 6. String length of the nodename 7. Nodename 8. Network Type of message generation node (see 4.) 9. IP Address of message generation node (see A.) 10. String length of the message generation nodename 11. Nodename of message generation node 12. Log only flag 13. Unmatched flag 14. Message source type Console 0x0001 Message API 0x0002 Logfile 0x0004 Monitor 0x0008 SNMP 0x0010 Server MSI 0x0020 Agent MSI 0x0040 Legacy Link 0x0080 \| Schedule 0x0100 Internal 0x1000 Subproduct 0x2000 15. Notification flag:w 16. Trouble ticket flag 17. Acknowledge on troubleticket flag 18. Message creation date and time (see B. for the format) 19. Message receipt date and time (see B. for the format) \| 20. Unbuffer time 21. Severity UNKNOWN 0x01 NORMAL 0x02 WARNING 0x04 CRITICAL 0x08 MINOR 0x10 MAJOR 0x20 22. Status of the auto action Failed 2 Started 8 Finished 9 Defined 11 Undefined 12 23. Network Type of auto action node (see 4.) 24. IP address of the node where the auto action is executed (see A. +) 25. String length of the nodename where the auto action is executed 26. Nodename of the node where the auto action is executed 27. Auto action creates annotation flag 28. Acknowledge flag of the auto action 29. Status of the operator initiated action (see 15.) 30. Network Type of operator initiated action node (see 4.) 31. IP address of the node where the operator initiated action is ex +ecuted 32. String length of the nodename where the oper. initiated action i +s executed 33. Nodename of the node where the operator initiated action is exec +uted 34. Operator initiated action creates annotation flag 35. Acknowledge flag of the operator initiated action 36. Time and date when the message has been acknowledged (see B. for + the format) 37. String length of the operator who has acknowledged the message 38. Name of the operator who has_acknowledged the message 39. String length of message source 40. Message source 41. String length of application 42. Application 43. String length of messagegroup 44. Messagegroup 45. String length of object 46. Object 47. String length of notification service name(s) 48. Notification service name(s) 49. String length of auto action call 50. Auto action call 51. String length of operator initiated action call 52. Operator initiated action call 53. String length of message text 54. Message text 55. String length of original message text 56. Original message text 57. Number of annotations 58. String length of message type 59. Message type 60. Esclate Flag 61. Assign flag 62. Escalation type 63. Date and time when the message was escalated (see B. for the for +mat) 64. Network Type of escalation node (see 4.) 65. Escalation server IP address 66. String length of escalation server node name 67. Escalation server node name 68. String length of the operator who has escalated the message 69. Name of the operator who has escalated the message 70. Instruction type: No instruction 0 Instruction text 1 Instruction Interface 2 Internal instruction 3 71. Read only flag 72. Original message number (uuid) 73. Time difference in seconds between agent time zone and GMT 74. String length of instruction ID or name 75. Instruction ID, instruction interface name or message numbers of internal instructions (depends on instruction type) \| 76. Length of Instruction Interface parameters 77. Instruction Interface parameters 78. String length of service name 79. Service name 80. String length of message key 81. Message key 82. Duplicate count 83. Date/time when last duplicate was received (see B. for the form +at). This field is 0 if message has no duplicates. 84. CMA count. Number of custom message attributes. For each CMA: 1. CMA record starter: "CMA" 2. String length of the CMA name 3. CMA name 4. String length of the CMA value 5. CMA value For each annotation: 1. Annotation record starter: "ANNO" 2. Date and time of the annotation (see B. for the format) 3. Annotation number 4. String length of the author of the annotation 5. Author of the annotation 6. String length of the annotation text 7. Annotation text A. All IP addresses are in binary format the following script can be used to convert the IP address: #cat convert.sh #!/bin/ksh # convert.sh # usage convert <IP_ADDRESS_IN_BINARY_FORMAT> OPC_IP_ADDR=$(echo $1\| awk '{printf("%d.%d.%d.%d\n", \ ((int($1)/16777216)%256), \ ((int($1)/65536)%256), \ ((int($1)/256)%256), \ ((int($1))%256) \ )}') echo "$1 = ${OPC_IP_ADDR}" #end of convert.sh B. All time specifications are in seconds since 1.1.1970 GMT 1 Example data line ====> 064191a8-7db9-71e6-12cc-abbb01aa0000 45f86528-d563-71e0-03bd-8a2 +39ed50000 1 175337506 39 router174.network.microsoft.com 1 -141380770 +2 44 syslog152.network.microsoft.com 1 0 4 0 0 0 1474214430 147421443 +1 0 2 12 0 0 0 0 0 12 0 0 0 0 0 1474214431 3 OpC 22 GNS_IOS_SYSLOG_ +2(1.71) 35 SYSLOG-cisco-ios-RADIUS-SERVERALIVE 4 DATA 13 mxgamdrnb08e + 0 0 0 116 RADIUS-6-SERVERALIVE: Group ACCT_GROUP: Radius server 1 +7.24.174.55:1645,1646 is responding again (previously dead). 235 2016 +-09-18T10:59:45.932408-05:00 mxgamdrnb08e.microsoft.com local7.info 2 +1395: Sep 18 15:59:44.907 GMT: %RADIUS-6-SERVERALIVE: Group ACCT_GRO +UP: Radius server 17.24.174.55:1645,1646 is responding again (previou +sly dead). 0 0 0 0 0 0 0 0.0.0.0 0 0 0 0 0000000000000000000000000 +00000000000 18000 0 0 44 systlog152.network.microsoft.com 70 SYSLOG +:mxgamdrnb08e:RADIUS-SERVER_STATUS:17.24.174.55:1645,1646:good 0 1474 +214431 20 CMA 15 ATRIUM_CATEGORY 6 SWITCH CMA 13 ATRIUM_IMPACT 0 CMA + 17 ATRIUM_IP_ADDRESS 12 10.15.212.34 CMA 15 ATRIUM_MAILCODE 7 GA8-89 +5 CMA 19 ATRIUM_MANUFACTURER 5 CISCO CMA 17 ATRIUM_NODE_GROUP 50 MANA +GENOC DATA SITE TYPE A2 CSCTG62793_DISABLE_RD CMA 15 ATRIUM_PRIORITY + 10 PRIORITY_5 CMA 14 ATRIUM_PRODUCT 18 Catalyst 3560x-24P CMA 13 ATR +IUM_REGION 2 US CMA 17 ATRIUM_SITE_GROUP 5 US-GA CMA 14 ATRIUM_URGENC +Y 0 CMA 13 ATRIUM_ciName 12 MXGAWDRNB08E CMA 13 MSC_IN_ATRIUM 1 Y CM +A 11 EventSource 10 MS_Network CMA 15 REMEDY_ticketID 1 N CMA 14 cond +ition_name 55 SYSLOG-cisco-ios-RADIUS-SERVERALIVE (resolution) [1628] + CMA 15 gns.alarm.class 8 BreakFix CMA 15 gns.alarm.state 10 REGISTER +ED CMA 19 gns.alarm.subobject 22 17.24.174.55:1645,1646 CMA 25 gns.cm +db.auto.ticket.flag 4 none [download]	[reply] [d/l]
Re^2: How to process variable length fields in delimited file. by shmem (Chancellor) on Oct 06, 2016 at 20:10 UTC
Here is the devil in the details of the true layout and an example of 1 data line The squirrel is always in the details, since the devil is a squirrel. But I can't help you here with the data you provided (only one record? seriously?) since in "39 router174.network.microsoft.com" - well, "router174.network.microsoft.com" is just 31 chars long, not 39. Even with a NULL terminator it would be 32 chars long, not 39. Hence, the following is just bull - you know, garbage in => garbage out. while (<>) { s/\r?\n//; # strip line endings # get field numbers and field description if (/\s{2,3}(\d+)\. (.+)/) { my ($number, $text) = ($1,$2); $number--; # since first element of an array is 0, not 1 # if this field denotes string length, store it if ($text =~ /string length/i) { push(@lengths, $number); } # remember field number and text (only if not previously seen) $names{$number} = $text unless $names{$number}; next; # nothing else to do for this line. } # now process the one line of data, if at hand if (/^====>/) { # Record Starter, right? # split line at whitespace my @array = split; # for all length indicators, concatenate # subsequent array elements into one # complain if the size doesn't fit for my $index (@lengths) { my $length = $array[$index]; my $string; my $counter = 1; while (length $string < $length) { # join array elements with space to rebuild the field $string = join " ", $string, $array[$index + $counter] +; warn "length mismatch for $string: $length <=> ".lengt +h $string,"\n" if length $string > $length + 1; } # weed out concatenated elements from array splice @array, $index + 1, $counter; } # done, output the fields for (sort {$a <=> $b} keys %names) { print "$names{$_}: $array[$_]\n"; } } } [download] perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'	[reply] [d/l]
Re^3: How to process variable length fields in delimited file. by dbach355 (Initiate) on Dec 20, 2016 at 18:16 UTC
You are correct, sorry about giving only 1 record. And some have additional info. Without getting too lengthy, I included 10 records. I am reviewing comments and proceeding. Read more... (83 kB)	[reply] [d/l]
Re: How to process variable length fields in delimited file. by dbach355 (Initiate) on Oct 17, 2016 at 20:02 UTC
Thank you all for comments. I have not had change to review and test to see if I understand. The job gets in the way :) And the company is forcing 2 weeks off, so I have been trying to clean up some old tasks. Again, thank you and I will update. David	[reply]