Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^2: How to select specific lines from a file

by Laurent_R (Canon)
on Apr 29, 2014 at 21:05 UTC ( [id://1084392]=note: print w/replies, xml ) Need Help??


in reply to Re: How to select specific lines from a file
in thread How to select specific lines from a file

I love it. "Taking the last line..." (which is line 570 of the sample input): 66 occurs in the 6th column of the first row and the 293rd row, 16.840 occurs in the 9th column of line 293 from the sample input, and 'B' occurs on the last row and the 293rd, but not on the first.

You love it? So do I. I also spent a quite bit of time trying to figure out which line the OP was really talking about.

Anyway, this is fixed-width stuff, so you should probably be thinking in terms of unpack or substr rather than regular expressions.

I definitely agree that unpack or substr are the most efficient solutions in terms of computing resources (especially unpack, most probably). But, picking on your remark about the programmer standpoint, and assuming that the file is just a few hundreds or thousands of lines, I might as well consider a regular expression, but not a regex similar to what the OP posted, but a very simple one in a call to the split function. Sometimes, with data looking similar to the OP's data, I find it easier to use something like:

my ($key, $value, $predicate) = (split /\s+/, $line)[0,3,7];
rather than having to compute the exact position of each piece of data (and testing to make sure that I don't have an off-by-one error). But I am doing that only insofar I am reading a relatively small parameter or reference data file before having to process very large or sometimes huge data sets.

(Typically, my reference data files have a few hundred or thousand lines, while the real data files to be analyzed have at least dozens of millions of lines, sometimes hundreds of millions lines. In such cases, I really don't care spending a split second more reading the reference data, if I know that processing the main data will take 20 minutes anyway. In other words, I would most probably use the substr or unpack function for the main data, if appropriate, but I don't mind using a slightly slower process for small reference data if it saves me some development time and make the code easier to understand at first glance when I have to maintain it).

But this was just a side note about slightly specific situations, I agree otherwise fully with just about everything that you said.

Replies are listed 'Best First'.
Re^3: How to select specific lines from a file
by davido (Cardinal) on Apr 29, 2014 at 22:14 UTC

    I like your comment, and intend to upvote it. In this specific case it appears that the data set is tame enough that the distinction between fixed-width and space-delimited is moot.

    However, one general principle that I try to adhere to as much as possible is placing the fewest possible demands on a data set as possible. This concept can be generalized from some lessons I learned by reading Effective STL, where Steve Meyers makes some strong cases for why a template container class should place as few requirements on the objects it contains as possible. I'd love to go into the details, but it's a big enough concept that I probably wouldn't do it justice in a simple PerlMonks node.

    Let's take it as a given, then, that the generalized practice of placing as few demands on an entity that we don't control as possible is "a good thing". In particular, doing so helps to simplify our parser, allows us to unambiguously reject data that is broken, and probably even makes it easier to generate valid data.

    So what is the simplest, least demanding set of requirements that we can place on our OP's data? As we look it over, it becomes pretty obvious that it is of fixed-width, and that it is space delimited. ...or is it? What if one of those numeric fields (66, for example) extends to four digits? We already see in his data set places where it extends to three digits. A fourth would cause it to run up against our "[AB]" field. So there's one requirement we have to place on the data set; no column can become filled to the point that it touches the one next to it. 1000 is illegal for the 6th field. Maybe this is reasonable, but I don't know. I do know that as 66 grows to 100, the field widths haven't shifted, so that field size must always be four or less. But I don't know if four digits is a possible in-range value.

    What about blank fields? The user's data set example has no blank fields (that I can detect, though there are some big gaps). \s+ delimited data requires that every field contain something. There's another demand placed on our data set, or if not placed on the data set, another ambiguity that our parser must deal with.

    Next, by looking at his data it seems obvious that there cannot be embedded spaces. However, that is not just an observation, it's a requirement placed on the data. If a field ever changes such that it allows embedded spaces, our parser breaks. And if that ever happens, we run into all sorts of additional demands for our data; embedded spaces must be escaped or quoted, quotes must be balanced if used, embedded quotes must be escaped, and so on.

    This will probably never happen with the user's data set; it may never morph into something more complex. Splitting on space may forever be fine. ...it will have to be fine because the parser now demands it. It can never be permitted to morph into something that includes, for example, a notes field (unless it's in the last position, which is another requirement placed on the data and another rule for the parser), completely full fields, or blank fields.

    So here are the choices for how we can parse fixed-width data:

    1. As fixed width: Must be fixed width.
    2. As space delimited: No full fields, no blank fields, no embedded spaces.

    The first rule seems to be the most likely for this data set. If we treat it as If it's fixed width, we impose only one requirement. And probably that requirement is already part of the implementation of the producer. If we treat fixed width data as space delimited, we impose three additional restrictions on the data. Treat fixed width as fixed width for the most robust solution.


    Dave

      Yes, Dave you are absolutely right. The thing that I did not say is that, usually when I have such cases of parameter file or reference data, I am usually extracting the data myself (or it is done by one of my colleagues) from another system, so that I know exactly what I can demand from the data, or we have something called an interface agreement specifying exactly how the data should look like.

      When the data comes from unknown source or the exact format cannot be certain, then we are left with trying our best to get the best out of it, and, in the case in point, I fully agree that considering the data as fixed format is the best that can be done on the basis of what the data looks like.

Re^3: How to select specific lines from a file
by monkey_boy (Priest) on Apr 29, 2014 at 21:29 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1084392]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (5)
As of 2024-04-19 10:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found