Computational biology and more...

# Month: May 2022

Another un-posted thing from a few years ago!  I guess I figured we had already discussed on twitter so “what’s the point”.  Still, nice to summarise with a few more words here…

Chris Cole (@drchriscole) posted a link back in 2019 to

https://www.r-bloggers.com/fizzbuzz-in-r-and-python/

on twitter (you can see the thread here), which poses a simple coding problem that is claimed to be used often to test a candidate’s coding knowledge:

In pseudo-code or whatever language you would like: write a program that prints the numbers from 1 to 100. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz”.

The author of the article shows some simple and clear ways to solve this problem in R and Python that would certainly convince a possible employer that you knew how to do basic coding.

`for (i in 1:100){    if(i%%3 == 0 & i%%5 == 0) {    print('FizzBuzz')  }  else if(i%%3 == 0) {    print('Fizz')  }  else if (i%%5 == 0){    print('Buzz')  }  else {    print(i)  }  }`

Chris though correctly pointed out you can solve the problem with only two “if” statements and wrote some pseudocode to show how:

`I only have my phone, but pseudo-code would be something like this.for i in 1:100outstr = ""num = iif i divisible by three outstr = "fizz" num = ""if i divisible by five  outstr = outstr + "buzz"  num = ""print num + outstr`

The big strength of R though is that it is a vector language so I (@gjbarton) came up with the following:

`My version in R but with three “if”s. There is prob. a really elegant way to do this in R...a = 1:100b = a %% 3c = a %% 5d = b + ct = replace(a,b==0,"Fizz")t = replace(t,c==0,"Buzz")t = replace(t,d==0,"FizzBuzz")tI'm a very rusty R programmer, first code in at least 3 years probably :-(`

Here I create three vectors, b,c,d that have zeros at positions where the number is divisible by 3, 5, or both 3 and 5.  I then substitute the appropriate text for zero in the result array t using the built-in “replace” function.

David Martin pointed out that I actually used zero if statements… so I guess this is a “win” ?

You might ask why solve the problem like this?  It appears to hide the basic logic that is very clear in the first solution with an “if” block.   For a very small task like this, the “if” block is probably clearest, but as soon as you start to extend the logic, if blocks can get very messy and hard to read.  For example, it would be easy to extend the vector method to include more options such as “also divisible by 7” or to explore all combinations while still maintaining fairly concise code.

There is also an efficiency consideration as the size of the vector grows.  Generally, computers are great at running down a big vector and doing the same simple operation on it.   Once you introduce complex nested “if” statements on each element of the vector then the code has multiple possible branches and so takes more steps to execute at the CPU level.  This can mean it runs a lot slower or that it might run into memory issues.   If the code is compiled though, the compiler might be able to spot things like this (The people who write compilers are true artists!) and so there will be no performance advantage.

I hate the expression “Experimentally Validate…” or “Validate Experimentally…” particularly when applied to computational biology studies.

There is an assumption in biological sciences that the only way to be sure of a result suggested by computational analysis is to perform an experiment.   However, it does not make sense if you have used sophisticated computational methods to analyse the results of thousands of experiments published over decades to arrive at a conclusion, to then run a handful of experiments to “validate” your findings!

In biology, since the systems being studied are complex it is hard to minimise the variables down to one under investigation.   As a result, it is usually very difficult for a single experiment to “prove” a finding.   The very low threshold of statistical significance that is acceptable in biology is a consequence of this problem.  Whereas in most physical sciences you need a 5 or 6 sigma result to publish, in biology (and medicine) the threshold is 1.6 sigma.  In other words, about a 1 in 20 chance that the result is wrong.

Accordingly, experimental biology works by starting with a hypothesis and then gathering evidence for and against that hypothesis by carrying out a series of experiments that address the question from different angles.  At some point, enough evidence is gathered to support the original hypothesis or a hypothesis modified in the light of the experimental data gathered in the laboratory and from work published or otherwise communicated from laboratories world-wide.   This may be enough evidence to justify writing up for publication of the study with some conclusions based on all the evidence accumulated in favour of the hypothesis.  Although rarely combined into a single statistic, the combination of multiple lines of evidence that are consistent provides confidence that the result is real and not a chance artifact.

Of course, many experiments don’t work or perhaps give ambiguous results, or possibly give results that contradict what all other experiments on the system seem to indicate.   Contradictory results suggest further investigation is necessary: “Can I believe these contradictory results?   Do I trust that single experiment more than all the others?”   Further experiments might be designed to test this, or perhaps in the worst case, the investigator might put those contradictory results to one side as untrustworthy and push ahead to publish the positive data.   This is not necessarily wrong if the experimental method being excluded is generally known to be unreliable.  Also, it is rarely possible to explore every possible angle in a single study; time and money are not limitless and there is value in publishing results even if not completely conclusive.  However, the emphasis in the current publishing model is that only positive results are valuable, so scientists gain skills in putting a positive “spin” on their data and conclusions when writing up.  This also requires readers of the scientific literature to “read between the lines” of the positive spin to understand the true confidence in the result.

Referees of papers may spot potential deficiencies in the justification of the results given the experimental data and so suggest further experiments or analysis.  Authors may have to do these experiments and present results that satisfy the referees in order to get their work published.  In this way, the final published account should be as good a representation of understanding as is possible with the resource and minds that have been thinking about it.

Of course, it may turn out that the one experiment that did not fit the hypothesis was actually showing something fundamentally wrong with the hypothesis.    This may not be obvious immediately, but only emerge years later after substantial resources have been spent building on the erroneous findings.

Good experimentalists are highly skilled at identifying possible flaws in their own experiments and those of others.  They are superb at suggesting further experiments to carry out to help eliminate possible artefacts in the study.   This critique of their own work and that of others is what drives science forward and while many results will be misleading and contradictory, leads ultimately to greater understanding of biology.

“Validation” in experimental terms might mean using multiple technologies that explore different aspects of the biological system.  For example, NGS to look at transcript expression, proteomics to probe protein complement, the response of both to chemical probes that are known to affect specific processes and pathways, the effect of “knocking down” a specific gene and so on.

However, “validation” is not really the right word to use.  What is being done is seeking support from complementary methods.  A better word would be “Consistent”.  In other words, that the results of analysing experiments from multiple techniques are consistent with an underlying hypothesis.   The scientific process is to “Seek consistency” towards a clearer understanding.

Computational analysis is no different in this respect.  It is fine to do some new experiments based on the computational study to look for consistency.  Indeed, the goal of an analysis is often to suggest new avenues of experimentation.  However, this is rarely a true validation of the computational analysis.

After studying really hard, getting a good B.Sc., flying high in your Ph.D. and as a postdoc, you have finally amassed enough research experience and scientific research publications to apply for Principal Investigator (PI) positions.   You succeed in winning, against massive competition, a prestigious Fellowship or other grant to start a research group in a great department and so can now start to forge your own independent research career as a PI!  Fantastic!   As I explain in my summary of Academic research careers this is in some ways, just the beginning!

Once your group grows beyond yourself, there is one important question.  What do you call the Group?

There is a long tradition, at least in places I have worked, to call the group after the person who forms it.   Some would argue this is egotistical and somehow gives the impression that all the work is done by the PI and that the group members are just a supporting act.   I think it is very rare for this to be true though – science is a team effort and in my experience, most PIs do a lot to acknowledge and and support their team.    In fact, the  “Group called after the PI” tradition reflects the realities of how research groups are created and funded in Universities and many research institutes.  They are built around a PI.  If the Group is to succeed, the PI, with the support of colleagues in their group has to win grants, support and train research students and postdocs and steer the Group to publish scientific papers and other outputs that advance the field.   If the PI decides to leave their job or retires, then typically the group dissolves.

Some argue that it is better to call the group after the research theme that the PI works in.   Some do this and that is fine.  Unfortunately, PIs often shift the direction of their research, so a group that is called one thing when the group is first created, may become unconnected with what the group actually does a few years later.  I suppose you could change the group name if that happens, but that is a chore, particularly if the name is used on the Web and Social Media such as twitter.

That’s the generalities, what about me?

I first established an independent research group in the Laboratory of Molecular Biophysics, Department of Zoology, University of Oxford in 1989.  The Laboratory head was (Sir) David Phillips to begin with and then Louise Johnson (later Dame Louise).  The Laboratory had 8 or 10 research groups in it, each led by an individual PI and the groups were all referred to by the PIs name (e.g. Johnson Group, Stuart Group, Barford Group and so on…).  Naturally, when I joined, my group was called the Barton Group by the Laboratory and the Department.

I can’t remember if I thought this was weird or egotistical or not.  I do remember as a fresh PI, enjoying the fact that I was independent of what the other groups were doing and our Head of Laboratory (Louise Johnson for most of the time I was there).  The group name also made clear that I was not someone else’s postdoc, though as is common with young PIs, I had to point this out to people a lot in my early career.

There is a slight computational twist to this.  When I started as a PI I had just enough equipment funds to buy a Unix workstation (A Sun SPARCstation 1) to use for my research.  I had to configure this and manage it myself.  At one point in the slow installation of the operating system from tape, I had to provide a name for the computer.  I was stumped, “Err, what do I call it?”  Since this was the first time I had ever “owned” a computer, I decided for some reason to call it “geoff”.  I thought, “I can change that later to something more sensible…”.  However, I didn’t know how to change the name (it was harder to find out things like this without Google) and so the name stuck.

So, the network name for my computer was “uk.ac.ox.biop.geoff” which, when we went fully onto the internet became “geoff.biop.ox.ac.uk” .  I was one of the first in the UK to establish a group web site in about 1993 I think, so naturally, it was called the Barton Group website and was hosted on www.geoff.biop.ox.ac.uk.  Sigh…   This was a real embarrassment but at least it did make clear whose group it was…

Roll the clock forward a bit to 1997.  I went to work at the European Bioinformatics Institute near Cambridge, UK.   In the early years of the EBI the tradition for the few research groups was to call them after the surname of the PI.  So, my group site became “barton.ebi.ac.uk” alongside “sander.ebi.ac.uk” for Chris Sander’s group and others.

When I moved to the University of Dundee in 2001, I compromised by keeping the name of my group as Barton Group, but naming the website, “compbio.dundee.ac.uk”.   This gave continuity and “Brand identity” since by this time we provided a number of bioinformatics resources to the world and it made them easier for people to find.

This is pretty much where we are today…. There is a problem though.  The generic domain name “compbio.dundee.ac.uk” is just “Barton Group” but in 2013 we formed a Research Division of which I am one of six PIs, which after much discussion is also, confusingly called “Computational Biology” – it has its own simple webpages on lifesci.dundee.ac.uk.

Ho hum…

© 2024 Geoff Barton's blog

Theme by Anders NorenUp ↑