Wednesday 28 May 2014

Toolkit Story

Everyone needs a personalised toolkit in order to achieve their goals.

A surgeon needs his tools and team to ensure his patient can undergo a safe and successful operation under him.
Source: http://upload.wikimedia.org/wikipedia/commons/0/08/Surgeons_at_Work.jpg
An artist needs his brushes and paints and canvases to produce beautiful paintings.
Source: http://columbiametro.com/images/cache/cache_3/cache_d/cache_c/3dc8a408a597b5d093d73c57cc67ea3a.jpeg?ver=1400749198&aspectratio=1.5

A mountaineer needs reliable mountain gears to ensure he arrives at the top of a snowy mountain safely.
Source: http://www.mountainguides.com/photos/gear/backpack-tahoma.jpg
As for me, a PhD probationary kid, I need to achieve part of this network of computational biology.
Source: http://compbio.mit.edu/teaching.jpg

So, the first step is to understand what it is all about, and to find the right tools to start working on it. It seems to me that I am looking more into the area of computational genomics in the sense of biological anthropology these days, so there are a set of tools (packages or whatever one would call that) which I can pack in my imaginary backpack to get me going.
Source: http://www.compbio.cmu.edu/images/background.gif
Will I be able to form a proper research question by the end of my 6 months here? I have another month to ponder about this. While pondering, more packing to be done for the next chapter of my journey...

Source: https://cord-global.terradotta.com/_customtags/ct_Image.cfm?Image_ID=57

Tuesday 27 May 2014

PH525x Data Analysis for Genomics

Source: http://media-cache-ec0.pinimg.com/736x/ac/5d/b1/ac5db17580e6bd3ad70f2584b0cb3cc8.jpg
I need a change of learning pace. I've realised it for a while now, but yes, I procrastinated like crazy and persisted in stubbornness. Ahh... well, yeah. I've heard about edX a while ago, but thanks to the procrastinator who lives in me, I've delayed utilising this very good website properly. It is high time to enter into a learning frenzy before I fly off to Krakow next Thursday.

I've signed up for two courses, one is "Data Analysis for Genomics". This program is conducted by Professor Rafael A Irizarry and Dr. Michael Love from Harvard University, which is the founder of edX. Actually the course started on 7th April but I signed up to audit for this course a few minutes ago. Fingers crossed it will help me in my PhD. I'm a bit desperate, but I guess being desperate right now can be a good thing rather than being desperate at the end of my first year not knowing if I'll successfully proceed to my 2nd year or not.

You could find out more about this leader of an online learning movement, MOOCs (Massive Open Online Courses), on their website (click on the edX image below). As mentioned, the founders of edX are Harvard University and the MIT (Massachusetts Institute of Technology). edX offers online college courses to everyone and anyone in the world. I guess the only reason one will not learn is when one decides not to learn.

So wish me luck and pray for me!

Cheers!




Writing Your First Year Report

I attended the course conducted by Dr. Geraint Wyn Story for GSLS (Graduate School of Life Sciences) PhD students to learn how to write my first year report. I know it is still a few months away before I start freaking out totally about it, but I guess it is better that I prepare myself right from the start rather than stressing myself out not knowing what I need to do when it is time to submit that very report which will determine if I stay or leave Cambridge. To be honest, I am freaking out already despite the fact that the first year report seems to be a very informal event in the division which I am in. Furthermore, the man above has been pushing me to produce results. I feel like a child who is asked to run when I am still learning how to crawl. Yes, the sense that I am stupid exists, and I feel very real about it, though I am more assured after attending the session that it is alright to feel how I feel.

One thing which is encouraged in developing our writing skills is to write regularly as a on-going process. A new concept (maybe not-so-new) was introduced -- "Writing to Prompts". This concept uses question or fragment of a sentence to stimulate the writing process. I do find it beneficial as it helps to focus our mind into writing something. Personally, I tend to break down the things I want to write into subtopics, and then explore the literature and read them accordingly, sometimes making notes based on sections of the same paper.

Another interesting idea is to do "Free Writing" of about 5-10 minutes with a suggested ideal frequency of 3 times a week. Writing without stopping for a duration of time on a certain topic indeed will help to free-up the writer who is bogged by the research and improve self-confidence. When we practised this concept, I came up with a short entry on "First Year Report Training".

One more concept before I end this post is the introduction of "Writing Groups" to act as a social activity plus an encouragement to help us write, and to make writing fun. Anyone would like to be my partner in crime for this?!

The usual culprits of a good report would the standard format, and everything else is rather dependent on the department. First year report should focus on introduction and future works with some prelim results and methods if there is any. A little about referencing was covered, and the importance of knowing who will be reading the first year report, and who are the examiners, plus the importance of appointing a secondary adviser. Gantt's chart is suggested to be a part of the first year report to give an idea of the audience what I plan to do.

This takes a bit of my time to digest it.

I'm ending this with something from PhD Comics. Cheers!

Source: http://www.phdcomics.com/comics/archive/phd030712s.gif

First Year Report Training

The day was gloomy when I woke up after the multiple alarms went off. It was a dread to try to wake up these days but I knew how important it is for me to connect with people studying Life Sciences and to know about first year report. The people in........***censored***........... It has been a lonely journey.

After a series of missing bus, getting lost and finally getting there to listen to Dr. Geraint talking and the PhD probationary students speaking of first years' worries, though more questions popped up but I'm assured that I'm not alone in my struggle to bear future responsibility as a Cambridge grad. Perhaps it doesn't take a super genius to get a PhD done here.

There's still some time left. Let's see how it goes.

PLINK: Association Analysis / Accounting for Clusters

It wasn't one of the best days. Had a word with the man above when I bumped into him in the pantry and he didn't seem pleased that it took me two weeks to look at the paper on placing individuals in their geographical location. I'm taking very long because the methods are fundamental, but it wasn't a knowledge that I am born with, unfortunately. I've long suspected that everyone is assumed to be a genius of some sort where I am based and I did feel stupid as I am taking longer time than "most people" to pick up the basics. I know he thinks I'm too slow to learn the skills needed. It's ok, it is not the first time I felt like a complete idiot in this group of geniuses. I have been feeling stupid for a while now. This stupidity is suffocating me from within. I guess this is a price to pay to eventually able to carry the responsibility of having the Cantab post-nominal and the permanent effect of a head damage for ramming into the world of graduate life. Why in the world did I decide to do this? I no longer am in the mood of discern it right now.

Source: http://thegradstudentway.com/blog/wp-content/uploads/2012/07/PhdComics2.jpg

Anyway, back to my learning progress. Any form of progress is better than nothing at all. After the long break from learning, finally I'm back at PLINK tutorial. Today I plotted a multi dimensional scaling (MDS) plot using the HapMap example of two population.

Population stratification is the presence of systematic difference in allele frequencies between subpopulations in a population due to different ancestries, also known as population structure. Stratification analysis use whole genome SNP data to cluster individuals into homogeneous groups. In the tutorial, simple stratification was performed, but the details of it could be found in another chapter in the PLINK website. I think it is worth me spending the whole afternoon going through the main documentation on population stratification as I probably would need to perform this as a routine data treatment procedure and if I don't get it now, I won't get it later. I don't want to form bad habits in programming and jeopardize the quality of my research in future.

If two or more individuals have identical nucleotide sequences in a DNA segment, it is known as identical by state (IBS). If this segment is inherited without recombination by a common ancestor, and is found in two or more individuals, then it is identical by descent (IBD). I find that both IBS and IBD are of the few jargons often mentioned during lab meeting, so it is important that I have these two definition registered in the brain and here. The clustering which I did today was based on pairwise identity-by-state (IBS) distance clustering. No constraint was applied to the process. Usually phenotype criterion and cluster size restriction, plus external matching criteria are specified.

In order to create the MDS plot for the HapMap example of two populations, I first created matrix pairwise IBS distances using this command line:

plink --bfile mydata --cluster --matrix --out myplot

A few files were generated: myplot.mibs, myplot.cluster0, myplot.cluster1, myplot.cluster2, myplot.cluster3, myplot.log. Information are stored in different formation within the four output files resulted from performing the --cluster option.

Using RStudio (it means I used R statistical tool), I created the MDS plot with the code given:

m <- as.matrix (read.table ("myplot.mibs"))
mds <- cmdscale (as.dist (1-m))
k <- c( rep ("purple", 45), rep ("orange", 44) )
plot (mds, pch=20, col=k)

# RStudio was used on Win8 laptop while PLINK was used on UNIX server.

Here's how my plot looked like:

Very interesting combo to use both PLINK and R to generate the plot. According to the tutorial, I could also generate the MDS plot using the --mds-plot option. I have not tried it, so I'm unsure how does it work. I guess it is best that I keep to one which I will be good at, rather than to learn 101 alternatives. I am certain I can beautify my MDS plot. That I can wait.

I better sign off now. I am attending training on "How to Write First Year Report" in the Clinical School later.

Thursday 15 May 2014

Word Count using Python 3

In two days' time I need to submit the oral presentation slides to the IDB Scholars Scientific Symposium organiser. I need to re-organise the data which I got from my Masters dissertation since I sorta made a mess with the data analysis back then. Looking at it almost a year later amazed me how oft I was! *chuckles* I guess I did progress slightly after being at Cambridge for four months. Today is the first day (maybe 2nd day) of my 5th month.

BTW, my MFSc. (Erasmus Mundus) project was of DNA profiling.

As I was arranging the edited data with all the missing alleles and I got to the point of... "Err... How many lines are there in this document?" I know I can always use MS-Word's wordcount function, but I am feeling kinda adventurous, I used Python 3 script which I've recently learned... And they worked fine. So anyone who want to do word count, the script is as below.

data = open ("filename")
# e.g. of "filename" is "C:/Users/KherXing/treasure.txt"

n_lines = 0
n_words = 0
n_chars = 0

for line in data:
    n_lines += 1
    line_words = line.split()
    n_words += len (line_words)
    n_chars += len (line)

data.close ()
print (n_lines, n_words, n_chars)

Tuesday 13 May 2014

PLINK by Purcell

Source: http://pngu.mgh.harvard.edu/~purcell/plink/gplink.shtml 

Today's task is simple - to start learning PLINK. The version which I am learning is the command-prompt version of PLINK on UNIX server, which the diagram shows the gPLINK, which is the Java-based software package which can run most of the common PLINK operations. I supposed gPLINK is more user-friendly towards Windows users (like me) who are used to perpetual mouse-clicking rather than using the keyboard. However, it will be more logical for me to utilise our group server, so I'm trying to pick up the UNIX version of it. Right now, I'm learning it using the tutorial provided on the PLINK webpage.

Source: http://pngu.mgh.harvard.edu/~purcell/plink/img/gp_overview2.png
Some brief introduction of PLINK. It was developed by Shaun Purcell, and it's a free open source whole genome association analysis toolset. PLINK itself is used solely to analyse genotype/phenotype data. With the development of gPLINK and Haploview, subsequent visualisation, annotation and storage of results could be performed. By which is a GREAT news for newbies like me!

Let's recount what I find fascinating, which would amuse any normal computer scientist immensely, is that I could call out the command rm to delete the files which I no longer needed on the UNIX server. The ability to finally understand the difference between PED and BED files mentioned on ADMIXTURE. Oh wow! That is a relief! BED is the binary PLINK file which saves space and speeds up subsequent analyses. Tested it using the example dataset "hapmap1":

  1. plink --file hapmap1 --xxxx --xxxx xxxx --out xxxx 
  2. plink --bfile hapmap1 --xxxx --xxxx xxxx --out xxxx

The first one used a normal PLINK file (PED), while the latter used a binary PLINK file (BED). Guess what? The first command took about 5 secs while the second took a sec. It did speed up the analysis! Ok, it is well-known, but it fascinated me.

The first figure is an overview of a structure of the start and end of PLINK really. How PLINK command(s) would eventually generate information which could help others to understand what the scientist has been testing on (or trying to find out). It is important that there is a visualisation of the information, rather than just boring numbers (sorry fellow Mathematicians, I know numbers amuse you, but for general audience, colourful charts still stand out).

Best thing of all, an answer to my previous error, where I need to "apply genotype filter to dataset", appeared after going through the first part of the tutorial. YAY!

Friday 9 May 2014

Precious Bytes

I'm currently residing in Fitzwilliam College accommodation. Guess what? Our internet rate is £0.33/GB. For a moment, I forgot I was in the room, and I had to pause my download as it was a 716-MB file! Yes, I suddenly remember that I would have to pay that 33p if I were to download it. I think I better work in the office this weekend so that I could use free internet in the university. Today, I'm skipping the office and continue the momentum of working at home on the paper I'm looking at reproducing the results. I wonder if a week is sufficient for me to reproduce something which experts had spent months on. Feel like crying but I guess I'll have to bite my lips and continue with work!

I found this diagram on the website. It is an estimation of all the precious bytes we spend online by all the clicks we do on the browsers. No idea how accurate it is but I guess we should be concerned on how much we spend. I only started this obsession of counting how many bytes when I am charged per GB of my internet usage!

Source: http://techessentials.rogers.com/Libraries/Site_Images/data_calc.sflb.ashx

Sketches

Source: http://fc04.deviantart.net/fs71/i/2012/312/6/c/how_to_paint_digitally_over_a_pencil_sketch__gimp__by_crystal_89-d5kbt1t.jpg

Every artist knows that the first step to amazing paintings is sketching the art framework. The counterpart of artists would usually be the scientists (very loosely I defined it here). I'm guilty as charged when it comes to strictly sketching my experiments, since we somewhat know what is the concepts. I'm not saying that all scientists are like me, I know of many colleagues who have meticulous planning methods.

Upon attending programming classes for absolute beginners, I realised one repetitive warning from all lecturers - PLAN the logic of programming even before starting. For the past two days, I worked on the basis that I need to know this and that, without realising how dodgy it has become. This morning, I woke up thinking about it after some hiccups last night trying to figure out the meaning of the error I got trying to run ADMIXTURE. Apparently, I need to apply filters to the dataset. Yes, I have NO idea how to apply filters. Froze me on track.

Then, I began to liken myself as an artist, of a different field. Not the colourful flowers blooming in the middle of spring, but a starter of another type of bloom - the programming one. Running back to the previous checkpoint, I'm giving myself some time this morning to sketch the structure of the whole process before I get into the minute details of each step. Yes, ADMIXTURE is just part of the whole process and I was obsessed with it the whole of yesterday.

Source: http://isaiahbowling.com/wp/wp-content/uploads/2011/02/Example_looping_program-full.jpg

This is an example of planning a looping program. Not sure what language they are using. So any computer scientist who might know, please drop a comment! Thanks!

Thursday 8 May 2014

Upgrade Exam


I'm changing the format of this blog. My FB chronicles which I've posted here until October last year will still be available (wade through the entries please) but I will focus more on my PhD progress in this blog. After all, this is the blog of a wimpy scientist (aka me).

I saw myself as an infant of my field the day I stepped into the office and met my supervisor. After the 4 months of swimming in the sea of overwhelmed emotions (both at work and in my personal life), I've finally moved on from infancy to maybe kindergarten stage when I'm sort of done attending the important Absolute Beginners classes for most of the programming tools I need.

Today, I passed a little benchmark. I finally able to run some basic analyses using the example dataset on ADMIXTURE. Strictly following the examples given, and my brain would crash when it doesn't work. I guess I'm officially in Primary 1 right now. Hopefully they would make more sense as I explore this tomorrow. It is already Friday!!

I wonder how long I will stay in Primary 1? A week? Hopefully shorter than that, but I need to fully understand what I am doing with ADMIXTURE before I proceed to analyse the data provided for the papers I'm supposed to "emulate".

This is my very first barplot of Q estimates using R after running ADMIXTURE. Very raw and inadequate. Yet, a benchmark!