Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

difficulty reading csv file

by foxycleop (Initiate)
on Aug 08, 2011 at 01:10 UTC ( [id://919131]=perlquestion: print w/replies, xml ) Need Help??

foxycleop has asked for the wisdom of the Perl Monks concerning the following question:

Hello there, I am a new user here and new to perl as well. I am wondering if someone can help me with an issue I am having with CSV files. I have downloaded CSV files from an internet source (it is marketing data). I have written a PERL program to read the contents of the file so that I can work with the data. However, when I read the file and print it to a different file or to screen I get an output that looks something like this:

礀漀最愀 洀愀琀 cresce +nt moon yoga mat洀愀栀愀 礀&#2841 +6;最愀 洀愀琀 silk yoga mat bag&# +29952;渀椀焀甀攀 礀漀 +最愀 洀愀琀 tapas ultra yoga mat& +#30976;漀最愀 洀愀琀  +欀椀琀 monster yoga mat漀爀最&#24 +832;渀椀挀 挀漀琀琀&# +28416;渀 礀漀最愀 洀& +#24832;琀 designer yoga mat礀漀最愀&#81 +92;洀愀琀 眀椀瀀攀&#2 +9440; yoga mat care渀漀渀 琀漀&#3 +0720;椀挀 礀漀最愀 &# +27904;愀琀 power yoga mat漀洀 礀& +#28416;最愀 洀愀琀 wholesale yoga + products眀栀漀氀攀猀愀&#27 +648;攀 礀漀最愀 攀&#2 +8928;甀
After playing with he data in different ways, I found a couple of things: 1. If I copy the data (copy/paste) to a new csv file, the output looks fine. 2. So I thought that since the data is fine, there must be something wrong with the file itself as I have downloded it from the internet. So I went in the property and found the message: "This file came from another computer and might be blocked to help protect this computer". There was an unblock option right next to this message which I pressed to unlock the file. I tried running the program again on the unblocked file but the output is still non-readable. Does anyone know if there is a fix to this because I have many files that I am working with and it would be a big hassle to copy paste data into new files. It would defeat the whole idea of learning programming. Thanks for any help. =================================== The original CSV looks like (it is all English):
utopian yoga mat 12 58 $0.95 2/98 ads 2/98 ads 105/ +307 days 2011/07/21 08:01:00 2010/09/20 10:28:00 crescent moon yoga mat 18 58 $0.56 3/140 ads 1/140 ads + 117/247 days 2011/07/21 06:29:00 2010/11/19 12:58:00 maha yoga mat 69 91 $1.23 2/107 ads 1/107 ads 130/24 +7 days 2011/07/23 06:16:00 2010/11/19 10:33:00 silk yoga mat bag 59 91 $1.04 4/192 ads 2/192 ads 12 +0/247 days 2011/07/23 06:20:00 2010/11/19 01:12:00 unique yoga mat 170 110 $1.44 4/138 ads 2/138 ads 34 +6/708 days 2011/07/23 09:10:00 2009/08/15 10:35:00 tapas ultra yoga mat 16 110 $0.76 4/92 ads 1/92 ads +71/307 days 2011/07/23 09:46:00 2010/09/20 08:14:00 yoga mat kit 41 140 $1.06 4/197 ads 1/197 ads 115/32 +2 days 2011/07/23 10:00:00 2010/09/05 07:36:00 monster yoga mat 27 140 $0.70 1/89 ads 1/89 ads 128/ +307 days 2011/07/22 09:31:00 2010/09/20 08:03:00 organic cotton yoga mat 56 140 $1.23 8/261 ads 2/261 ad +s 130/305 days 2011/07/23 07:12:00 2010/09/22 12:16:00 designer yoga mat 67 170 $1.28 6/206 ads 2/206 ads 1 +28/247 days 2011/07/23 02:55:00
Also, the data is obtained form a paid marketing data site so there is no direct URL provided, A window opens up with the download link (JAVASCRIPT) to download the CSV file. I am downloading manually not sure how you would use LWP::Simple to d/l that data .

Replies are listed 'Best First'.
Re: difficulty reading csv file
by Albannach (Monsignor) on Aug 08, 2011 at 02:54 UTC
    It would be helpful to your readers to know what your original CSV looks like it before you process it, and what "looks fine" really means, however I can say that it sure appears to be HTML entity codes for CJK unified ideographs. From this I presume "looks fine" probably means "is made up of mixed English and Chinese text". Whatever you use to display the contents of this file must be able to decode this HTML, so many basic text editors will probably get it wrong. So the browser you're using probably looks fine when I copy a chunk from your example without code tags:

    礀漀最愀 洀愀琀 crescent moon yoga mat

    At this point I don't know where you want to go with this. If you only want the English, it is relatively simple to strip out the HTML entity codes, but if you still want it to be bilingual you will need to keep them and display only through a device that understands them. Fortunately for me I'm blissfully unaware of the wonders of Unicode, but one place to start may be HTML::Entities. I can't offer more than that, but I'm sure many others here can if you can be more specific about your needs and intentions.

    Hope this helps!

    --
    I'd like to be able to assign to an luser

Re: difficulty reading csv file
by Tux (Canon) on Aug 08, 2011 at 06:29 UTC

    Then you could also use perl to download/fetch the file using LWP::Simple. Then use Encode to decode the content. After that, use Text::CSV_XS (or Text::CSV) to parse the decoded data.

    # untested, just an example use strict; use warnings; use autodie; use LWP::Simple; use Text::CSV_XS; my $data = get ("http://some.web.site/location/data.csv"); open my $fh, "<", \$data; # if the site correctly encoded the data open my $fh, "<:encoding(utf-8)", \$data; # if you have to decode your +self my $csv = Text::CSV_XS->new ({ auto_diag => 1, binary => 1 }); while (my $row = $csv->getline ($fh)) { # do something with @$row }

    Again, a link to the original data would be a nice pointer to people here that want to help you out.


    Enjoy, Have FUN! H.Merijn
      The original CSV looks like (it is all English): utopian yoga mat 12 58 $0.95 2/98 ads 2/98 ads 105/ +307 days 2011/07/21 08:01:00 2010/09/20 10:28:00 crescent moon yoga mat 18 58 $0.56 3/140 ads 1/140 ads + 117/247 days 2011/07/21 06:29:00 2010/11/19 12:58:00 maha yoga mat 69 91 $1.23 2/107 ads 1/107 ads 130/24 +7 days 2011/07/23 06:16:00 2010/11/19 10:33:00 silk yoga mat bag 59 91 $1.04 4/192 ads 2/192 ads 12 +0/247 days 2011/07/23 06:20:00 2010/11/19 01:12:00 unique yoga mat 170 110 $1.44 4/138 ads 2/138 ads 34 +6/708 days 2011/07/23 09:10:00 2009/08/15 10:35:00 tapas ultra yoga mat 16 110 $0.76 4/92 ads 1/92 ads +71/307 days 2011/07/23 09:46:00 2010/09/20 08:14:00 yoga mat kit 41 140 $1.06 4/197 ads 1/197 ads 115/32 +2 days 2011/07/23 10:00:00 2010/09/05 07:36:00 monster yoga mat 27 140 $0.70 1/89 ads 1/89 ads 128/ +307 days 2011/07/22 09:31:00 2010/09/20 08:03:00 organic cotton yoga mat 56 140 $1.23 8/261 ads 2/261 ad +s 130/305 days 2011/07/23 07:12:00 2010/09/22 12:16:00 designer yoga mat 67 170 $1.28 6/206 ads 2/206 ads 1 +28/247 days 2011/07/23 02:55:00 download Also, the data is obtained form a paid marketing data site so there is no direct URL provided, A window opens up with the download link (JAVASCRIPT) to download the CSV file. I am downloading manually not sure how you would use LWP::Simple to d/l that data .

        Doesn't look like CSV to me at all. Maybe the download site has options to specify the format (Excel/CSV/HTML/XML/txt/...).

        And when posting data, it would be wise to use <code> tages.


        Enjoy, Have FUN! H.Merijn

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://919131]
Approved by koolgirl
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (None)
    As of 2024-04-25 01:31 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found