Source: http://thegradstudentway.com/blog/wp-content/uploads/2012/07/PhdComics2.jpg |
Anyway, back to my learning progress. Any form of progress is better than nothing at all. After the long break from learning, finally I'm back at PLINK tutorial. Today I plotted a multi dimensional scaling (MDS) plot using the HapMap example of two population.
Population stratification is the presence of systematic difference in allele frequencies between subpopulations in a population due to different ancestries, also known as population structure. Stratification analysis use whole genome SNP data to cluster individuals into homogeneous groups. In the tutorial, simple stratification was performed, but the details of it could be found in another chapter in the PLINK website. I think it is worth me spending the whole afternoon going through the main documentation on population stratification as I probably would need to perform this as a routine data treatment procedure and if I don't get it now, I won't get it later. I don't want to form bad habits in programming and jeopardize the quality of my research in future.
If two or more individuals have identical nucleotide sequences in a DNA segment, it is known as identical by state (IBS). If this segment is inherited without recombination by a common ancestor, and is found in two or more individuals, then it is identical by descent (IBD). I find that both IBS and IBD are of the few jargons often mentioned during lab meeting, so it is important that I have these two definition registered in the brain and here. The clustering which I did today was based on pairwise identity-by-state (IBS) distance clustering. No constraint was applied to the process. Usually phenotype criterion and cluster size restriction, plus external matching criteria are specified.
In order to create the MDS plot for the HapMap example of two populations, I first created matrix pairwise IBS distances using this command line:
plink --bfile mydata --cluster --matrix --out myplot
A few files were generated: myplot.mibs, myplot.cluster0, myplot.cluster1, myplot.cluster2, myplot.cluster3, myplot.log. Information are stored in different formation within the four output files resulted from performing the --cluster option.
Using RStudio (it means I used R statistical tool), I created the MDS plot with the code given:
m <- as.matrix (read.table ("myplot.mibs"))
mds <- cmdscale (as.dist (1-m))
k <- c( rep ("purple", 45), rep ("orange", 44) )
plot (mds, pch=20, col=k)
# RStudio was used on Win8 laptop while PLINK was used on UNIX server.
Here's how my plot looked like:
Very interesting combo to use both PLINK and R to generate the plot. According to the tutorial, I could also generate the MDS plot using the --mds-plot option. I have not tried it, so I'm unsure how does it work. I guess it is best that I keep to one which I will be good at, rather than to learn 101 alternatives. I am certain I can beautify my MDS plot. That I can wait.
I better sign off now. I am attending training on "How to Write First Year Report" in the Clinical School later.
No comments:
Post a Comment
Positive suggestions help to keep me going. Thanks! :)