Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Parsing csv without changing dimension of original file

by zillur (Novice)
on Mar 06, 2017 at 17:14 UTC ( [id://1183774]=perlquestion: print w/replies, xml ) Need Help??

zillur has asked for the wisdom of the Perl Monks concerning the following question:

Hi Shifus, I was trying to parse a csv and replace cell content according another text file. I was using this:

# This script was excerpted from http://stackoverflow.com/questions/11 +678939/replace-text-based-on-a-dictionary use strict; use warnings; use Text::CSV; open my $fh, '<', 'kegg_pathway_title.txt' or die $!; my %dict = map { chomp; split ' ', $_, 2 } <$fh>; my $re = join '|', keys %dict; open $fh, '<', 'Orthogroups_3.csv' or die $!; while (<$fh>) { next if $. < 2; s/($re)/$dict{$1}/g; print; }

It gives me near expected output but changed the dimension of the original file. Now I can't load the parsed csv in R to do other stuff. Is there anyway to replace cell content of csv without changing dimension?

> grpsTbl <- read.csv("Orthogroups_3.csv", header=T, sep = "\t", row. +names = 1, stringsAsFactors=F) > dim(grpsTbl) [1] 5791 13
> grpsTbl <- read.csv("parsed_with_perl.csv", header=T, sep = "\t", r +ow.names = 1, stringsAsFactors=F) Error in read.table(file = file, header = header, sep = sep, quote = q +uote, : more columns than column names

My orthogroups_3.csv contain multicolumn, each cell may contain 0/1/many values(name). I want exact same csv just values in each cell(names) will be replaced according do the 1st file. Best Regards Zillur

Replies are listed 'Best First'.
Re: Parsing csv without changing dimension of original file
by Eily (Monsignor) on Mar 06, 2017 at 17:40 UTC

    I don't understand what you mean by "changing dimension" but I do see one problem with your code. $. is the line count on a given filehandle, reset each when you close the handle. Except you never close $fh, so your line count is not 1 for the first line of the second file (but 1 + the number of lines in the first file).

    You should probably use distinct names for the two handles (eg: $dict_fh and $csv_fh).

    Another way (not excluding the previous one) to correct your code is to properly close the handles when done using them. This can be done implicitly by limiting the life of the handle to a block, so that it is closed automatically on destruction. Using do, this can be done like this:

    my %dict = do { open my $dict_fh, "<", $path or die "Couldn't open $path: $!"; map { chomp; split ' ', $_, 2 } <$dict_fh>; }; # $dict_fh doesn't exist here, so it is closed

      I totally agree.

      But the only mention of $. in the OP code is the line

      next if $. < 2;

      Which suggests that the real problem is not in the line count but elsewhere.

      Also, what purpose does this line serve?

      use Text::CSV;

      Edit: The OP seems to have been silently updated while I was writing this. But I still don't understand the problem.

        Thank you very much for your reply. Here is the original code:

        use strict; use warnings; open my $fh, '<', 'kegg_pathway_title.txt' or die $!; my %dict = map { chomp; split ' ', $_, 2 } <$fh>; my $re = join '|', keys %dict; open $fh, '<', 'Orthogroups_3.csv' or die $!; while (<$fh>) { s/($re)/$dict{$1}/g; print; }

        I added the line

        next if $. < 2;

        because it giving me more column than the original csv. My original csv contains 13 columns. After replacing the content using dictionary it gives me many more column and couldn't load in R. You saw my previous code in loading the file in R. Is there anything I need to do to make it loadable in R. The 2nd column of the dict may contain multiple name. In the original csv the cells also contains single/multiple or no entry. Thanks again for helping me. Best Regards Zillur

Re: Parsing csv without changing dimension of original file
by NetWallah (Canon) on Mar 06, 2017 at 20:02 UTC
    It looks like you added the line:
    next if $. < 2;
    to the stackoverflow script.

    Without access to the data - I'm guessing that this suppresses output of the heading line, and might cause the problem you have seen.

    You can avoid speculation about your intent, and get better responses if you show a sample of the original data, and the output you expect.

            ...it is unhealthy to remain near things that are in the process of blowing up.     man page for WARP, by Larry Wall

      Because there is no close $fh;between the two opens next if $. < 2; is meaningless if there are any lines in kegg_pathway_title.txt

      Thank you very much for your reply. Here is the sample of my original data "kegg_pathway_title.txt":

      PVX_088085 Protein processing in endoplasmic reticulum PVX_114095 Protein processing in endoplasmic reticulum PVX_123055 Ribosome biogenesis in eukaryotes PYYM_1032000 - PYYM_1120600 - PCYB_031930 Purine metabolism; Metabolic pathways; DNA replication; + Pyrimidine metabolism

      The orhtogroups_3.csv has 13 columns

      Cparvum Bmicroti Tparva Pberghei Pchabaudi Pcynomolgi + Pfalciparum Pknowlesi Preichenowi Pvivax Pyoelii Pma +lariae Tgondii OG0000000 PBANKA_0000600, PBANKA_0000701, PBANKA_000080 +1, PBANKA_0001001, PBANKA_0001101, PBANKA_0001201, PBANKA_0001301, PB +ANKA_0001401, PBANKA_0001501, PBANKA_0006300, PBANKA_0006401, PBANKA_ +0006501, PBANKA_0006600, PBANKA_0006701, OG0000001 PmUG01_000101 +00.1-p1, PmUG01_00010200.1-p1, PmUG01_00010400.1-p1, PmUG01_00010500. +1-p1, PmUG01_00010600.1-p1, PmUG01_00010700.1-p1, PmUG01_00010800.1-p +1, PmUG01_00010900.1-p1, PmUG01_00011000.1-p1, PmUG01_00011300.1-p1, +PmUG01_00011400.1-p1, PmUG01_00011600.1-p1, PmUG01_00011700.1-p1, PmU +G01_00012100.1-p1, PmUG01_00012200.1-p1,

      Expected output:

      Cparvum Bmicroti Tparva Pberghei Pchabaudi Pcynomol +gi Pfalciparum Pknowlesi Preichenowi Pvivax Pyoelii + Pmalariae Tgondii OG0000000 - , - , - , - , - , - , - + , - , - , - , - , - , - , - , - , - , - + , - , - , - , - , - , - , - , - , - , - + , - , - , - , - , - , - , - , - , - , +- , - , - , - , - , - , - , - , - , - , + - , - , - , - , - , - , - , - , - , - +, - , - , - , - , - , - , - , - , - , - + , - , - , - , - , - , - , - , - , - , - + , - , - , - , - , - , - , - , - , - , - + , - , - , - , - , - , - , - , - , - , - + , - , - , - , - , - , - , - , - , - , +- , - , - , - , - , - , - , - , - , - , + - , - , - , - , - , - , - , - , - , - +, - , - OG0000024 - , - , - , - - , - , - - + , - , - - , - , - , - - , - , - + Protein processing in endoplasmic reticulum , - , - , - + , - - , - , - - , - , - - , - + , - - , - , - , - - , - , - - + , - , - , - - , - , - , - , - , - , +- , - , - , - , - , - , - , - , - OG0000025 - , - , - - , - , - , - - + , - , - , - - , - , - , - - , - , + - , - Protein processing in endoplasmic reticulum , Pro +tein processing in endoplasmic reticulum , - , Ribosome biogene +sis in eukaryotes - , - , - , - - , - , +- , - - , - , - , - - , Protein processi +ng in endoplasmic reticulum , Protein processing in endoplasmic re +ticulum , Ribosome biogenesis in eukaryotes - , - , - + , - - , - , - , - - , - , - , - + , - , - , - OG0000026

      I want the column number (13) in orthogroups_3.csv and the parsed results to be same. Best regards Zillur

        Well there are many many things wrong here

        in your original example page http://stackoverflow.com/questions/11678939/replace-text-based-on-a-dictionary you missed the part where it says "I have a dictionary(dict.txt). It is space separated and it reads like this:. Your kegg_pathway_title.txt instead has a tab after the replace-from field. In a way that is easy to fix, change the Line to

        my %dict = map { chomp; split "\t", $_, 2 } <$fh>;

        Next

        > grpsTbl <- read.csv("Orthogroups_3.csv", header=T, sep = "\t", row. +names = 1, stringsAsFactors=F)
        implies that the fields are separated by a tab (\t). Yes your column headers are separated by a tab, and there are tabs in your other rows, but row OG0000000 only has 5 tab separated fields, three of them being blank due to consecutive tabs, and
        "PBANKA_0000600, PBANKA_0000701, PBANKA_0000801, PBANKA_0001001, PBANK +A_0001101, PBANKA_0001201, PBANKA_0001301, PBANKA_0001401, PBANKA_000 +1501, PBANKA_0006300, PBANKA_0006401, PBANKA_0006501, PBANKA_0006600, + PBANKA_0006701,"
        being considered as one field, being number 5. There is a tab however after the OG0000000 at least

        The next line OG0000001 does have 13 tab delimited fields, OG0000001 has a tab after it to put it in its own column, followed by 11 blank fields due to consecutive tabs, and

        "PmUG01_00010100.1-p1, PmUG01_00010200.1-p1, PmUG01_00010400.1-p1, PmU +G01_00010500.1-p1, PmUG01_00010600.1-p1, PmUG01_00010700.1-p1, PmUG01 +_00010800.1-p1, PmUG01_00010900.1-p1, PmUG01_00011000.1-p1, PmUG01_00 +011300.1-p1, PmUG01_00011400.1-p1, PmUG01_00011600.1-p1, PmUG01_00011 +700.1-p1, PmUG01_00012100.1-p1, PmUG01_00012200.1-p1,"
        being considered the contents of the 13th column

        Given the following as your dictionary

        PVX_088085 Protein processing in endoplasmic reticulum PVX_114095 Protein processing in endoplasmic reticulum PVX_123055 Ribosome biogenesis in eukaryotes PYYM_1032000 - PYYM_1120600 - PCYB_031930 Purine metabolism; Metabolic pathways; DNA replication; + Pyrimidine metabolism
        one notices that none of the replace-from fields in it even occur in your sample Orthogroups_3.csv file at all, so your expected output is a myth.

        and besides the tab after the replace-from in your dictionary file, adding this code identifies a major problem

        for my $k (keys %dict) { my $v=$dict{$k}; warn 'for lookup:'.$k.' tab in field:'.$v."\n" if ($v=~"\t"); }
        result
        for lookup:PVX_114095 tab in field:Protein processing in endoplasmic r +eticulum for lookup:PYYM_1032000 tab in field:- for lookup:PVX_088085 tab in field:Protein processing in endoplasmic r +eticulum
        Those new tabs introduce "extra columns" to the output.

        The code that identifies all these problems is

        # This script was excerpted from http://stackoverflow.com/questions/11 +678939/replace-text-based-on-a-dictionary use strict; use warnings; #use Text::CSV; use Data::Dumper; local $Data::Dumper::Deepcopy=1; local $Data::Dumper::Purity=1; local $Data::Dumper::Sortkeys=0; local $Data::Dumper::Indent=3; open my $fh, '<', 'kegg_pathway_title.txt' or die $!; my %dict = map { chomp; split "\t", $_, 2 } <$fh>; warn Dumper \%dict; for my $k (keys %dict) { my $v=$dict{$k}; warn 'for lookup:'.$k.' tab in field:'.$v."\n" if ($v=~"\t"); } #my %dict = map { chomp; split ' ', $_, 2 } <$fh>; my $re = join '|', keys %dict; #close $fh; open $fh, '<', 'Orthogroups_3.csv' or die $!; while (<$fh>) { print $. ."\n"; next if $. < 2; my @a0=split("\t",$_); warn Dumper \@a0; s/($re)/$dict{$1}/g; print; }

        All this leads me to think you dont have much of a clue as to what you are doing and are just trying cookie-cutter examples found on the web. This is a bad thing to do

        Edit: code tabs around the huge fields, but im not sure its any better

Re: Parsing csv without changing dimension of original file
by huck (Prior) on Mar 06, 2017 at 20:27 UTC

    I would love to see the output when you run this program first.

    use strict; use warnings; open my $fh, '<', 'kegg_pathway_title.txt' or die $!; while (<$fh>) { print $.,' ',$_ if /\t/;}

      Thank you very much for your comment. This code give me the original file with line numbers in the left.

Re: Parsing csv without changing dimension of original file
by GotToBTru (Prior) on Mar 06, 2017 at 21:43 UTC

    The only way I could imagine this messing up your columns would be if the data in kegg_pathway_title.txt contains commas.

    But God demonstrates His own love toward us, in that while we were yet sinners, Christ died for us. Romans 5:8 (NASB)

      tabs ... sep = "\t"

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1183774]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2024-04-26 00:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found