Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: Suggestion for regular expression speed improvement.

by Corion (Patriarch)
on Jun 15, 2009 at 12:00 UTC ( [id://771625]=note: print w/replies, xml ) Need Help??


in reply to Suggestion for regular expression speed improvement.

It's likely much better to use Text::CSV_XS or a simple split on /\t/ in your case, as a regular expression is overkill in your situation.

Replies are listed 'Best First'.
Re^2: Suggestion for regular expression speed improvement.
by bala.linux (Novice) on Jun 15, 2009 at 12:32 UTC
    Thanks. Your suggestion can be well used for the properly separated log files like CSV. But, I want my code to work with regular expression so that I can parse any format of logs. Hope you understand my problem. So, unfortunately I can not use split or CSV modules :(
      I do not understand this response. Using a regex such as you described is less flexible than using split, not more flexible: The regex will only match on lines containing at least 25 tab-separated fields. If there are fewer fields, it will fail to match and return no data. If there are more, then some fields will not be separated from each other and returned as a single field1. split will work with any number of tab-separated fields right out of the box.

      Going beyond split to a proper CSV-handling module, you will be able to not only read arbitrary numbers of tab-separated columns, but it will also give you the ability to recognize quoting of the fields, so that they can contain embedded tabs without causing false field separations. Accomplishing this with regexes is messy, at best.

      1 ...unless you switch from (.+) to ([^\t]+), in which case it will only match lines containing exactly 25 fields.

        I just want to give you an example. The logs that I need to parse will not have a definite single separator like , or tab. But the my question had simple tab separated format. I would be parsing lines of this format : A=XX;Testing of YY;ZZ;Criticality:WW In the above line, I may need to extract XX, YY, ZZ and WW. So, by allowing regular expression, I would be able to achieve that with grouping.
        I just want to give you an example. The logs that I need to parse will not have a definite single separator like , or tab. But the my question had simple tab separated format. I would be parsing lines of this format :
        A=XX;Testing of YY;ZZ;Criticality:WW
        In the above line, I may need to extract XX, YY, ZZ and WW. So, by allowing regular expression, I would be able to achieve that with grouping.
      …so that I can parse any format of logs.

      Can you elaborate how you hope to handle "any format" with regular expressions?

        By "any format", I meant single line having different formats which can be matched by the users and using groups he can indicate us whats required for him. Further, we will process only the grouped strings. And, not for the logs having multi-lines to convey a mail delivery like qmail logs :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://771625]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2024-04-19 15:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found