Tuesday 13 May 2014

PLINK by Purcell

Source: http://pngu.mgh.harvard.edu/~purcell/plink/gplink.shtml 

Today's task is simple - to start learning PLINK. The version which I am learning is the command-prompt version of PLINK on UNIX server, which the diagram shows the gPLINK, which is the Java-based software package which can run most of the common PLINK operations. I supposed gPLINK is more user-friendly towards Windows users (like me) who are used to perpetual mouse-clicking rather than using the keyboard. However, it will be more logical for me to utilise our group server, so I'm trying to pick up the UNIX version of it. Right now, I'm learning it using the tutorial provided on the PLINK webpage.

Source: http://pngu.mgh.harvard.edu/~purcell/plink/img/gp_overview2.png
Some brief introduction of PLINK. It was developed by Shaun Purcell, and it's a free open source whole genome association analysis toolset. PLINK itself is used solely to analyse genotype/phenotype data. With the development of gPLINK and Haploview, subsequent visualisation, annotation and storage of results could be performed. By which is a GREAT news for newbies like me!

Let's recount what I find fascinating, which would amuse any normal computer scientist immensely, is that I could call out the command rm to delete the files which I no longer needed on the UNIX server. The ability to finally understand the difference between PED and BED files mentioned on ADMIXTURE. Oh wow! That is a relief! BED is the binary PLINK file which saves space and speeds up subsequent analyses. Tested it using the example dataset "hapmap1":

  1. plink --file hapmap1 --xxxx --xxxx xxxx --out xxxx 
  2. plink --bfile hapmap1 --xxxx --xxxx xxxx --out xxxx

The first one used a normal PLINK file (PED), while the latter used a binary PLINK file (BED). Guess what? The first command took about 5 secs while the second took a sec. It did speed up the analysis! Ok, it is well-known, but it fascinated me.

The first figure is an overview of a structure of the start and end of PLINK really. How PLINK command(s) would eventually generate information which could help others to understand what the scientist has been testing on (or trying to find out). It is important that there is a visualisation of the information, rather than just boring numbers (sorry fellow Mathematicians, I know numbers amuse you, but for general audience, colourful charts still stand out).

Best thing of all, an answer to my previous error, where I need to "apply genotype filter to dataset", appeared after going through the first part of the tutorial. YAY!

No comments:

Post a Comment

Positive suggestions help to keep me going. Thanks! :)