Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Parsing a text file without newlines

by existem (Sexton)
on Dec 14, 2004 at 11:04 UTC ( #414675=perlquestion: print w/replies, xml ) Need Help??

existem has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone
I have a weird problem... I have a CSV file. I would usually load up the file and parse it line by line, but for some reason all the lines appear to be stuck together, so it is one long string of concatinated rows. I presume this is because there are no line breaks? But when I load the file up in my text editor it shows the file as you would expect, with each row on a new line. So i'm a bit confused.

Here is an example of the code I am using:

open FH, "< $csv_file" or die "$!"; foreach my $offer (<FH>) { print "$offer\n"; my @fields = split(',',$offer); } close(FH) or die "$!";

And here are the first few rows of the CSV file:

producttype,price,imageurl,itemname,description,ctpage,smallimageurl,t +opseller,tagline Lifestyle,19.9899997711,http://images.iwoot.com/medium/globri_med.jpg, +Glow Brick,This light leads an exotic double life. By day it's a quie +t mild-mannered acrylic brick; by night it's a show-stopping illumin +ation that lights up as darkness falls...,http://www.dgm2.com/m/iwant +o/b.asp?a=1081&i=14497&c=http://www.iwantoneofthose.com/GLOBRI.htm,ht +tp://images.iwoot.com/thumbs/globri_thu.jpg,0, Toys & Games,11.9899997711,http://images.iwoot.com/medium/glofri_med.g +if,Glow in the dark frisbee,The 175g Discraft Ultra-Star is the offic +ial disc of the Ultimate Players Association - but who cares it's a +Frisbee that glows in the dark so it's a blast in the park!,http://ww +w.dgm2.com/m/iwanto/b.asp?a=1081&i=14497&c=http://www.iwantoneofthose +.com/GLOFRI.htm,http://images.iwoot.com/thumbs/glofri_thu.gif,0, Toys & Games,24.9899997711,http://images.iwoot.com/medium/glx200_med.g +if,AIR ROCKET - GLX-200,The GL-X200 Rocket Launcher powers rockets up + to 250 feet into the air.,http://www.dgm2.com/m/iwanto/b.asp?a=1081& +i=14497&c=http://www.iwantoneofthose.com/GLX200.htm,http://images.iwo +ot.com/thumbs/glx200_thu.jpg,0, Electronics,54.9900016785,http://images.iwoot.com/medium/surhea_med.jp +g,5.1 Surround Sound Headset,Imagine a five speaker surround-sound sy +stem buried in your head without the medical complications and you' +re getting close to the experience of wearing this awesome set of hea +dphones.,http://www.dgm2.com/m/iwanto/b.asp?a=1081&i=14497&c=http://w +ww.iwantoneofthose.com/HEASET.htm,http://images.iwoot.com/thumbs/surh +ea_thu.jpg,0,

Any ideas how I can parse the file?

Thanks, Tom

Replies are listed 'Best First'.
Re: Parsing a text file without newlines
by si_lence (Deacon) on Dec 14, 2004 at 11:15 UTC
    If you see newlines in your text file the maybe your record separator
    variable has been set to undef (undef $/;) somewhere in
    your code.
    But for reading csv files the module Text::xSV might be worth
    looking at anyway
    si_lence
Re: Parsing a text file without newlines
by rev_1318 (Chaplain) on Dec 14, 2004 at 11:22 UTC
    I can't see anything wrong with this code-sample. Is $/ redefined somewhere in the code we don't see?
    Otherwise, My guess would be, that your data originated from *nix or MacOS and you are using a Windows-environment when parsing the file.
    If so, make sure you transfer the file as a text-file, not a binairy one.

    Paul

      well i'm running the script on a linux machine, but i'm guessing the csv file was probably created on windows and has some funny windows encoding or something like that...?

      I tried using Text::CSV_XS

      my $csv = Text::CSV_XS->new({ 'quote_char' => '"', 'escape_char' => '"', 'sep_char' => ',', 'binary' => 1 });

      I changed binary to 1 and that solved a similar problem I had with another csv file, but hasn't made a difference to this file.

      Is there a setting somewhere for different types of files, like those created on windows or those from unix?

        It sounds like you need to set the eol flag. Look at the file in a hex editor to find out what the line endings really are (your editor may hide some things) or else just experiment with eol set to \012 or \015\012 or \015.
Re: Parsing a text file without newlines
by bart (Canon) on Dec 14, 2004 at 13:18 UTC
    My guess is that your local file has, for example, only CR characters as end of line markers, and you upload (FTP) in text mode... Whoops, all the CR characters are gone. Are you by any chance using a Mac on either side of the connection?

    Anyway, if that's the cause, think about uploading in binary mode, which may require an extra conversion on one side of the transmission, and for which you may even use a perl script to fix it. Either that, or you make your parsing script more flexible, so it accepts any conventional line endings (CR only, LF only, CR+LF).

    Your file, as you have shown here, is kaput. Upload it again, more carefully this time.

      You've got that reversed. FTP's text mode (usually the default or enabled with a command ascii) converts line endings to platform native; binary preserves the contents of the file verbatim.

        The thing is, uploading a file in text mode from say Windows to Linux, will simply strip all CR characters, whether there are LF character present, or not. All you have left, is one, long line.
Re: Parsing a text file without newlines
by Animator (Hermit) on Dec 14, 2004 at 15:43 UTC
    Your line-endings are incorrect.

    Either the file is generated incorrectly or you uploaded it to a different OS using Binary-transfer mode (which you shouldn't do since it's a Text-file, and Text is ASCII).

    My guess is that your data has line ending \x0A (or maybe \x0D) and that $/ is set to the \xOD (or \x0A)...

Re: Parsing a text file without newlines
by Anonymous Monk on Dec 14, 2004 at 15:55 UTC
    Tom... Ran across similar problem myself with some WWW stuff... There's no standard definition on simple TEXT files between DOS, WIN, UNIX, etc... Windows uses a <CR><LF> (2-chars) as it's new line "\n". Other systems use just a <CR> which is "\c", while others yet use just a <LF> which is "\l". "Smart" editors accomodate these differences. Try making a small change and saving the file, some editors will convert to your native machine. Lastly, try doing a "HEX" or "OCTAL" dump of the source file and parse the lines base on what you see there.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://414675]
Approved by BrowserUk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2022-08-14 06:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?