Parsing csv without changing dimension of original file

zillur has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parsing csv without changing dimension of original file by Eily (Monsignor) on Mar 06, 2017 at 17:40 UTC
I don't understand what you mean by "changing dimension" but I do see one problem with your code. $. is the line count on a given filehandle, reset each when you close the handle. Except you never close $fh, so your line count is not 1 for the first line of the second file (but 1 + the number of lines in the first file). You should probably use distinct names for the two handles (eg: $dict_fh and $csv_fh). Another way (not excluding the previous one) to correct your code is to properly close the handles when done using them. This can be done implicitly by limiting the life of the handle to a block, so that it is closed automatically on destruction. Using do, this can be done like this: `my %dict = do { open my $dict_fh, "<", $path or die "Couldn't open $path: $!"; map { chomp; split ' ', $_, 2 } <$dict_fh>; }; # $dict_fh doesn't exist here, so it is closed` [download]	[reply] [d/l]
Re^2: Parsing csv without changing dimension of original file by Not_a_Number (Prior) on Mar 06, 2017 at 19:02 UTC
I totally agree. But the only mention of `$.` in the OP code is the line `next if $. < 2;` Which suggests that the real problem is not in the line count but elsewhere. Also, what purpose does this line serve? `use Text::CSV;` Edit: The OP seems to have been silently updated while I was writing this. But I still don't understand the problem.	[reply] [d/l] [select]
Re^3: Parsing csv without changing dimension of original file by zillur (Novice) on Mar 06, 2017 at 22:08 UTC
Thank you very much for your reply. Here is the original code: `use strict; use warnings; open my $fh, '<', 'kegg_pathway_title.txt' or die $!; my %dict = map { chomp; split ' ', $_, 2 } <$fh>; my $re = join '\|', keys %dict; open $fh, '<', 'Orthogroups_3.csv' or die $!; while (<$fh>) { s/($re)/$dict{$1}/g; print; }` [download] I added the line `next if $. < 2;` [download] because it giving me more column than the original csv. My original csv contains 13 columns. After replacing the content using dictionary it gives me many more column and couldn't load in R. You saw my previous code in loading the file in R. Is there anything I need to do to make it loadable in R. The 2nd column of the dict may contain multiple name. In the original csv the cells also contains single/multiple or no entry. Thanks again for helping me. Best Regards Zillur	[reply] [d/l] [select]
Re^4: Parsing csv without changing dimension of original file by huck (Prior) on Mar 06, 2017 at 22:21 UTC
Re: Parsing csv without changing dimension of original file by NetWallah (Canon) on Mar 06, 2017 at 20:02 UTC
It looks like you added the line: `next if $. < 2;` [download] to the stackoverflow script. Without access to the data - I'm guessing that this suppresses output of the heading line, and might cause the problem you have seen. You can avoid speculation about your intent, and get better responses if you show a sample of the original data, and the output you expect. ...it is unhealthy to remain near things that are in the process of blowing up. man page for WARP, by Larry Wall	[reply] [d/l]
Re^2: Parsing csv without changing dimension of original file by huck (Prior) on Mar 06, 2017 at 20:17 UTC
Because there is no `close $fh;`between the two opens `next if $. < 2;` is meaningless if there are any lines in kegg_pathway_title.txt	[reply] [d/l] [select]
Re^2: Parsing csv without changing dimension of original file by zillur (Novice) on Mar 06, 2017 at 23:56 UTC
Thank you very much for your reply. Here is the sample of my original data "kegg_pathway_title.txt": `PVX_088085 Protein processing in endoplasmic reticulum PVX_114095 Protein processing in endoplasmic reticulum PVX_123055 Ribosome biogenesis in eukaryotes PYYM_1032000 - PYYM_1120600 - PCYB_031930 Purine metabolism; Metabolic pathways; DNA replication; + Pyrimidine metabolism` [download] The orhtogroups_3.csv has 13 columns Cparvum Bmicroti Tparva Pberghei Pchabaudi Pcynomolgi + Pfalciparum Pknowlesi Preichenowi Pvivax Pyoelii Pma +lariae Tgondii OG0000000 PBANKA_0000600, PBANKA_0000701, PBANKA_000080 +1, PBANKA_0001001, PBANKA_0001101, PBANKA_0001201, PBANKA_0001301, PB +ANKA_0001401, PBANKA_0001501, PBANKA_0006300, PBANKA_0006401, PBANKA_ +0006501, PBANKA_0006600, PBANKA_0006701, OG0000001 PmUG01_000101 +00.1-p1, PmUG01_00010200.1-p1, PmUG01_00010400.1-p1, PmUG01_00010500. +1-p1, PmUG01_00010600.1-p1, PmUG01_00010700.1-p1, PmUG01_00010800.1-p +1, PmUG01_00010900.1-p1, PmUG01_00011000.1-p1, PmUG01_00011300.1-p1, +PmUG01_00011400.1-p1, PmUG01_00011600.1-p1, PmUG01_00011700.1-p1, PmU +G01_00012100.1-p1, PmUG01_00012200.1-p1, [download] Expected output: Cparvum Bmicroti Tparva Pberghei Pchabaudi Pcynomol +gi Pfalciparum Pknowlesi Preichenowi Pvivax Pyoelii + Pmalariae Tgondii OG0000000 - , - , - , - , - , - , - + , - , - , - , - , - , - , - , - , - , - + , - , - , - , - , - , - , - , - , - , - + , - , - , - , - , - , - , - , - , - , +- , - , - , - , - , - , - , - , - , - , + - , - , - , - , - , - , - , - , - , - +, - , - , - , - , - , - , - , - , - , - + , - , - , - , - , - , - , - , - , - , - + , - , - , - , - , - , - , - , - , - , - + , - , - , - , - , - , - , - , - , - , - + , - , - , - , - , - , - , - , - , - , +- , - , - , - , - , - , - , - , - , - , + - , - , - , - , - , - , - , - , - , - +, - , - OG0000024 - , - , - , - - , - , - - + , - , - - , - , - , - - , - , - + Protein processing in endoplasmic reticulum , - , - , - + , - - , - , - - , - , - - , - + , - - , - , - , - - , - , - - + , - , - , - - , - , - , - , - , - , +- , - , - , - , - , - , - , - , - OG0000025 - , - , - - , - , - , - - + , - , - , - - , - , - , - - , - , + - , - Protein processing in endoplasmic reticulum , Pro +tein processing in endoplasmic reticulum , - , Ribosome biogene +sis in eukaryotes - , - , - , - - , - , +- , - - , - , - , - - , Protein processi +ng in endoplasmic reticulum , Protein processing in endoplasmic re +ticulum , Ribosome biogenesis in eukaryotes - , - , - + , - - , - , - , - - , - , - , - + , - , - , - OG0000026 [download] I want the column number (13) in orthogroups_3.csv and the parsed results to be same. Best regards Zillur	[reply] [d/l] [select]
Re^3: Parsing csv without changing dimension of original file by huck (Prior) on Mar 07, 2017 at 01:08 UTC
Well there are many many things wrong here in your original example page http://stackoverflow.com/questions/11678939/replace-text-based-on-a-dictionary you missed the part where it says "I have a dictionary(dict.txt). It is space separated* and it reads like this:*. Your kegg_pathway_title.txt instead has a tab after the replace-from field. In a way that is easy to fix, change the Line to `my %dict = map { chomp; split "\t", $_, 2 } <$fh>;` [download] Next `> grpsTbl <- read.csv("Orthogroups_3.csv", header=T, sep = "\t", row. +names = 1, stringsAsFactors=F)` [download] implies that the fields are separated by a tab (\t). Yes your column headers are separated by a tab, and there are tabs in your other rows, but row OG0000000 only has 5 tab separated fields, three of them being blank due to consecutive tabs, and `"PBANKA_0000600, PBANKA_0000701, PBANKA_0000801, PBANKA_0001001, PBANK +A_0001101, PBANKA_0001201, PBANKA_0001301, PBANKA_0001401, PBANKA_000 +1501, PBANKA_0006300, PBANKA_0006401, PBANKA_0006501, PBANKA_0006600, + PBANKA_0006701,"` [download] being considered as one field, being number 5. There is a tab however after the OG0000000 at least The next line OG0000001 does have 13 tab delimited fields, OG0000001 has a tab after it to put it in its own column, followed by 11 blank fields due to consecutive tabs, and `"PmUG01_00010100.1-p1, PmUG01_00010200.1-p1, PmUG01_00010400.1-p1, PmU +G01_00010500.1-p1, PmUG01_00010600.1-p1, PmUG01_00010700.1-p1, PmUG01 +_00010800.1-p1, PmUG01_00010900.1-p1, PmUG01_00011000.1-p1, PmUG01_00 +011300.1-p1, PmUG01_00011400.1-p1, PmUG01_00011600.1-p1, PmUG01_00011 +700.1-p1, PmUG01_00012100.1-p1, PmUG01_00012200.1-p1,"` [download] being considered the contents of the 13th column Given the following as your dictionary `PVX_088085 Protein processing in endoplasmic reticulum PVX_114095 Protein processing in endoplasmic reticulum PVX_123055 Ribosome biogenesis in eukaryotes PYYM_1032000 - PYYM_1120600 - PCYB_031930 Purine metabolism; Metabolic pathways; DNA replication; + Pyrimidine metabolism` [download] one notices that none of the replace-from fields in it even occur in your sample Orthogroups_3.csv file at all, so your expected output is a myth. and besides the tab after the replace-from in your dictionary file, adding this code identifies a major problem `for my $k (keys %dict) { my $v=$dict{$k}; warn 'for lookup:'.$k.' tab in field:'.$v."\n" if ($v=~"\t"); }` [download] result `for lookup:PVX_114095 tab in field:Protein processing in endoplasmic r +eticulum for lookup:PYYM_1032000 tab in field:- for lookup:PVX_088085 tab in field:Protein processing in endoplasmic r +eticulum` [download] Those new tabs introduce "extra columns" to the output. The code that identifies all these problems is # This script was excerpted from http://stackoverflow.com/questions/11 +678939/replace-text-based-on-a-dictionary use strict; use warnings; #use Text::CSV; use Data::Dumper; local $Data::Dumper::Deepcopy=1; local $Data::Dumper::Purity=1; local $Data::Dumper::Sortkeys=0; local $Data::Dumper::Indent=3; open my $fh, '<', 'kegg_pathway_title.txt' or die $!; my %dict = map { chomp; split "\t", $_, 2 } <$fh>; warn Dumper \%dict; for my $k (keys %dict) { my $v=$dict{$k}; warn 'for lookup:'.$k.' tab in field:'.$v."\n" if ($v=~"\t"); } #my %dict = map { chomp; split ' ', $_, 2 } <$fh>; my $re = join '\|', keys %dict; #close $fh; open $fh, '<', 'Orthogroups_3.csv' or die $!; while (<$fh>) { print $. ."\n"; next if $. < 2; my @a0=split("\t",$_); warn Dumper \@a0; s/($re)/$dict{$1}/g; print; } [download] All this leads me to think you dont have much of a clue as to what you are doing and are just trying cookie-cutter examples found on the web. This is a bad thing to do Edit: code tabs around the huge fields, but im not sure its any better	[reply] [d/l] [select]
Re^4: Parsing csv without changing dimension of original file by zillur (Novice) on Mar 07, 2017 at 03:54 UTC
Re: Parsing csv without changing dimension of original file by huck (Prior) on Mar 06, 2017 at 20:27 UTC
I would love to see the output when you run this program first. `use strict; use warnings; open my $fh, '<', 'kegg_pathway_title.txt' or die $!; while (<$fh>) { print $.,' ',$_ if /\t/;}` [download]	[reply] [d/l]
Re^2: Parsing csv without changing dimension of original file by zillur (Novice) on Mar 06, 2017 at 23:59 UTC
Thank you very much for your comment. This code give me the original file with line numbers in the left.	[reply]
Re: Parsing csv without changing dimension of original file by GotToBTru (Prior) on Mar 06, 2017 at 21:43 UTC
The only way I could imagine this messing up your columns would be if the data in kegg_pathway_title.txt contains commas. But God demonstrates His own love toward us, in that while we were yet sinners, Christ died for us. Romans 5:8 (NASB)	[reply]
Re^2: Parsing csv without changing dimension of original file by huck (Prior) on Mar 06, 2017 at 21:53 UTC
tabs ... `sep = "\t"`	[reply] [d/l]


P is for Practical
	PerlMonks