http://qs321.pair.com?node_id=1079751

xxArwaxx has asked for the wisdom of the Perl Monks concerning the following question:

Hello, folks;

so, I'm working on a project that I'm trying to understand more about the genotyping data that was given to us(.gen files), I wrote a perl script to extract the lines I'm interested in from the given genotyping data and saved them in lines in a comma-delimited txt file(CVS format), the lines in .txt file are in the format of:

physical_location, Allele 1, Allele2

34787638,A,C

34787800,A,G

the big question is:

are the given locations in .gen files do correspond to the same alleles(1,2) in my txt file or not? to do that, I need to search for that exact physical_location (in each line) against any genome database(NCBI, Ensembl..etc) to retrieve the alleles(1,2) that fall in that exact location (for specific chromosome).

Using genome browser to do that manually is time consuming, and I believe it is a common task so there should be a BioPerl module to retrieve the alleles in a specific location for a given chromosome.

any ideas if there is a BioPerl module that can do that, or how to approach this problem ?

EDIT(more info):

the genotyping data that was given to us was in .gen files in which each file represents the genotype info for ONE chromosome, so I have a total of 22 .gen files (sizes between 700 MB up to 4 GB). Each line in a .gen file is in the format:

(Chromsome, MarkerID, Ph_location, Allele1, Allele2, .......some other irrelevant info)

Then there is another small .txt file that have the 'list of genes' that I'm interested in(50 genes) where each line is a gene, in the following format (GeneName, Chromosome, StartPosition, EndPosition).

The Perl script I wrote was to extract the lines I'm interested in from the .gen file (which are the lines that have a Ph_location (now experimenting on the .gen for chr1) that falls in the range from StartPosition to EndPosition from 'list of gene' .txt file (looking to those lines for chr1, it happens to be 10 genes) and saved them as lines in a comma-delimited txt file(CVS format) that have the format I mentioned in the very beginning of this post before the editing part(with location, allele1, allele2 format).

Now, I have a .txt file with the lines I'm interested in, again with location, allele1, allele2 format. NOW, I'm searching for a BioPerl module to retrieve (from any genome database(NCBI, Ensembl..etc)) to retrieve the alleles(1,2) that correspond to that location (from the .txt file)and report these corresponding alleles in another text file along with their location, so I finally end up with two .txt files, one (from genotype extracted lines), that I'll pull the locations from to use it to search for alleles1,2, and another .txt file that will have the location and the corresponding alleles after the search. Our goal is to validate the alleles given to us by retrieving those from ncbi or wherever using their locations, to eventually try to draw a different approach to classify the data moving forward.

I apologize for the lengthy post, I'm trying to make it more clear to get the best out of this discussion. I hope it is more clear now. Thanks in advance! :)

FINAL OUTPUT

two .txt files, one with the lines I'm interested in in the format (physical_location, Allele 1, Allele2), file 1 looks like (this is in chromosome 1, in case anyone wanted to run this against ncbi or any genome database):

34787638,A,C

34788686,A,G

34789549,C,T

34789695,C,G

34789808,C,T

347890859,C,G

then another .txt file, file2.txt that has the same first column(the locations), I need the other two colomns(Allele1,Allele2) for all the locations in this file these need to be retrieved from any genome database(NCBI for example) by a bioperl module that goes through each line in file1 extract the first column, go to ncbi, fetch the corresponding alleles. this module should accept two arguments; the location, and (chromosome number); in which chromosome it should fetch the corresponding alleles from. That's our goal, so we can doublecheck if the given alleles in file1 actually falls in the given locations.