Extract a spreadsheet row with a column

Singular · July 2, 2020, 2:56pm

I have a datafile containing genetic variants.
I also have a list of genetic variants which I want to extract from this file.

Here is the setup. I have a file with information about genetic variants. One column has the SNP identifier (rs1, etc.)

rs1  A B C
rs2  B C D 
rs3  C D E

…

Now I want to provide a list and retrieve the information for a subset of the SNPs. Is there a way to do this in LO?

rs2
rs3

{Hmm, this might be a very easy question. One could simply use MATCH.}

Singular · July 2, 2020, 4:39pm

Thank you Lupp! How did you turn them into boxes?

I am making progress!
The next problem I want to solve is how to count the number of effect genotypes.

I have some effect genotypes.

A
C
T
T

And I have the actual genotypes:

AC
GG
AC
TT

I want to count the number of effect genotypes for each of the actual genotypes.
For example, there are 1,0,0,and 2 effect genotypes in the above columns.

Lupp · July 2, 2020, 4:43pm

Quoting @Singular: “…this might be a very easy question.”
If so, you found a perfect way to make it look difficult.

Shall the “A B C” (and so on) be three entries in three colums or what?
In what way should the “subset” be presented.
Why didn’t you present the expected result above?
By what criteria should the selection be defined?

Are you aware of the fact that there are dabases well capable of reurning selections based on simple SQL commands?
How will you be able to take responsibility for the results if you use probably unsuitable means?

Lupp · July 2, 2020, 4:44pm

There is a tool preformatted text. You can also apply it by Ctrl+k to the current selection.
Depending on the situation it may put some lines into a kind of box or emphasize a part of a flowing text in red and fix width.
Otherwise:
Sorry. I don’t feel to understand the problem, and I wouldn’t like to be made responsible for wrong results in a scientific field.
Hopefully somebody familiar with the field can help you.

Singular · July 2, 2020, 4:52pm

Thank you Lupp! I was unaware of SQL commands.
I will investigate that approach further.

First though I want to complete the effect versus actual genotype idea.
Is there an LO command that can count the number of occurances of a text string in text?
E.G., number of times G occurs in GG or AC?

Singular · July 2, 2020, 5:03pm

Very asesome!
LO appears to have the functionality that I need.

=LEN(A1)-LEN(SUBSTITUTE(A1,B1,"")) does the trick.
A1 is the actual genotype.
B1 is the effect genotype.

This is great!
I had tried to do this with 10 columns one for each of the possible genotypes.
Using Len, and Substitute allows me to do this in one line.

Singular · July 2, 2020, 5:35pm

Here is the LO code needed to construct a polygenic score given a genotype file:

Open your genotype file, at the bottom of the file paste in the rs identifiers from a GWAS study, . create a new column besides the rs identifers and adapt the code below to conform to your data.
=INDEX($A$1:$E$1,000,000,MATCH(A1,000,011,$A$1:$A$1000,000,0),3)
Paste the effect alleles and betas for all the rs identifiers (usually given with the GWAS report) as new columns.
Create a new column which determines the number of effect allele that occur in each genotype.
=LEN(BB1,000,011)-LEN(SUBSTITUTE(BB1,000,011,CC1,000,011,""))
Create a new column which determines the beta for each genotype. Adapt code to conform to your data.
=AA1*B11
Sum the column while ignoring #NAs.
=SUMIF(C11:C1000,"<>#N/A")

Genome scale genotype files are now into the tens of GB in length.
Not sure how one might be able to use the same idea using LO with files that large.