Life's Lessons Learned: PLINK: Association Analysis / Accounting for Clusters

It wasn't one of the best days. Had a word with the man above when I bumped into him in the pantry and he didn't seem pleased that it took me two weeks to look at the paper on placing individuals in their geographical location. I'm taking very long because the methods are fundamental, but it wasn't a knowledge that I am born with, unfortunately. I've long suspected that everyone is assumed to be a genius of some sort where I am based and I did feel stupid as I am taking longer time than "most people" to pick up the basics. I know he thinks I'm too slow to learn the skills needed. It's ok, it is not the first time I felt like a complete idiot in this group of geniuses. I have been feeling stupid for a while now. This stupidity is suffocating me from within. I guess this is a price to pay to eventually able to carry the responsibility of having the Cantab post-nominal and the permanent effect of a head damage for ramming into the world of graduate life. Why in the world did I decide to do this? I no longer am in the mood of discern it right now.

Source: http://thegradstudentway.com/blog/wp-content/uploads/2012/07/PhdComics2.jpg

Anyway, back to my learning progress. Any form of progress is better than nothing at all. After the long break from learning, finally I'm back at PLINK tutorial. Today I plotted a multi dimensional scaling (MDS) plot using the HapMap example of two population.

Population stratification is the presence of systematic difference in allele frequencies between subpopulations in a population due to different ancestries, also known as population structure. Stratification analysis use whole genome SNP data to cluster individuals into homogeneous groups. In the tutorial, simple stratification was performed, but the details of it could be found in another chapter in the PLINK website. I think it is worth me spending the whole afternoon going through the main documentation on population stratification as I probably would need to perform this as a routine data treatment procedure and if I don't get it now, I won't get it later. I don't want to form bad habits in programming and jeopardize the quality of my research in future.

If two or more individuals have identical nucleotide sequences in a DNA segment, it is known as identical by state (IBS). If this segment is inherited without recombination by a common ancestor, and is found in two or more individuals, then it is identical by descent (IBD). I find that both IBS and IBD are of the few jargons often mentioned during lab meeting, so it is important that I have these two definition registered in the brain and here. The clustering which I did today was based on pairwise identity-by-state (IBS) distance clustering. No constraint was applied to the process. Usually phenotype criterion and cluster size restriction, plus external matching criteria are specified.

In order to create the MDS plot for the HapMap example of two populations, I first created matrix pairwise IBS distances using this command line:

plink --bfile mydata --cluster --matrix --out myplot

A few files were generated: myplot.mibs, myplot.cluster0, myplot.cluster1, myplot.cluster2, myplot.cluster3, myplot.log. Information are stored in different formation within the four output files resulted from performing the --cluster option.

Using RStudio (it means I used R statistical tool), I created the MDS plot with the code given:

m <- as.matrix (read.table ("myplot.mibs"))
mds <- cmdscale (as.dist (1-m))
k <- c( rep ("purple", 45), rep ("orange", 44) )
plot (mds, pch=20, col=k)

# RStudio was used on Win8 laptop while PLINK was used on UNIX server.

Here's how my plot looked like:

Very interesting combo to use both PLINK and R to generate the plot. According to the tutorial, I could also generate the MDS plot using the --mds-plot option. I have not tried it, so I'm unsure how does it work. I guess it is best that I keep to one which I will be good at, rather than to learn 101 alternatives. I am certain I can beautify my MDS plot. That I can wait.

I better sign off now. I am attending training on "How to Write First Year Report" in the Clinical School later.

Pages

Tuesday, 27 May 2014

PLINK: Association Analysis / Accounting for Clusters

No comments:

Post a Comment