The current model of philogenetic evolution is basically a random searchto optimize a fitness function. Genetic programming works by the same principles and it is able to find good approximations to global optima.
Genetic programming is an optimization technique where we maintain a set of candidate solutions, the 'genomes', and modify them randomly by 1) pointlike mutations, or 2) crossovers: creating a new solution by combining halves from existing 'genomes'.
Crossovers are very important; without crossover, genetic programming would be only a local search, capable of finding only local optima.
Seemingly, evolution has the same twofold mechanism: small genetic changes that 'implement' a local search on the fitness landscape; and large, abrupt changes like whole-genome duplication, etc. causing jumps that land on the slopes of faraway peaks.
However, this is only an illusion. Even chromosome or whole-genome duplication is not radical enough for a truly global search for optimum, as no new genes are introduced. Lateral gene transfer works only for bacteria and viruses.
There is one more problem with abrupt, large-scale genetic changes in multicellular, diploid life forms: the offsprings of the new phenotype would be isolated, and cannot find a mate. While this is not a problem for slow genetic drift, where offsprings are similar enough to their parents and relatives, it poses a problem for 'jumps' that are so successful in genetic programming.
One solution could be that we state that evolution is a local search, and the fitness landscape is smooth enough to make every peak attainable on a local search path.
I don't know if the fossil records support this hypothesis or not; I feel that there are large 'gaps' between groups (taxa), that would suggest that from time to time abrupt changes are necessary, and evolution cannot be explained as a smooth, simple local search.
In the following I share my own 'crackpot' theory on evolution; it is not the result of any kind of study or investigation, just my thoughts on the topic; it was not reviewed or approved by any expert in the field (and what regards me, I am an absolute novice or amateur...). So quit reading or take the following cautiously, as it could be the dumbest thing you've ever heard.
I think there are two distinct mechanisms for evolution: one is the well-known darwinian process: small random mutations accumulate in exons and selected or eliminated by natural selection: this is the local search.
And I think there is another mechanism, that is behind the abrupt, jump-like changes in the genome: it is much less known or discussed in mainstream science. These changes accumulate silently, in non-expressed parts of the genome, like introns or 'junk' DNA, and they are passed on to offsprings so that they are widespread in the genome before the Big Change.
Then there comes a time when the genome becomes bistable: a single point-mutation can turn lots of the exons into introns or junk DNA, while the latent, hidden new genes suddenly become expressed.
As the latent new genome is already widespread at this point, this 'flip over' can happen simultaneously in many offsprings, so they will be able to find mates and create their on population.
The crucial notion here is the bistability of the genome: a state where it contains really two genomes, one of which is suppressed, the other expressed, but a single mutation can turn it around, so that the latent genome becomes expressed. Of course there could be thousands of such mutations, so this flipover event is not at all improbable.
Alternative splicing in some genes is a model for this flip-over in small scale.
Saturday, February 21, 2009
Thursday, February 12, 2009
PROPKA vs H++: pKa predictors for protein residues
Recently I decided to try out some free, online servers for predicting pKA values on protein residues: PROPKA, H++, and Karlsberg+. This latter promises to send the results in email, but I never got back anything, so I guess it is dropped from the competition.
First, I tried a famous example for abnormal pKa value: the bovine chymotrypsin, which was one of the first proteases whose catalytic mechanism was unveiled. All textbooks on biochemistry describe this: we have a 'catalytic triad', which are Asp102, His57, and Ser195. Asp102 forms a Hydrogen bond with His57, making its other aromatic nitrogen a strong bases, which deprotonates Ser195. Thus the Serine becomes a nucleophile and attacks the carbon in the peptide bond.
So I expected from the pKa predictors that they find out that the Histidine becomes a strong base, and that the Serine is deprotonated.
First, I tried the PROPKA server, giving 1ab9 (BOVINE GAMMA-CHYMOTRYPSIN)
This is not what I expected. The Histidine is actually getting a bit acidic; while the Aspartate is very acidic. Trying on H++:
I am satisfied with the prediction on the Histidine: it became a strong base; but -5.1 for the Asp is a bit off the mark...
Next, I tried Caspase 3, a cysteine protease, with PDB structure 1rhk. H++ did not accept it, complaining about missing residues. The catalytic residue is Cys165A, for which propka gives pKa = 11.14, instead of the standard 9.0. So the cysteine indeed became a strong base and nucleophile.
First, I tried a famous example for abnormal pKa value: the bovine chymotrypsin, which was one of the first proteases whose catalytic mechanism was unveiled. All textbooks on biochemistry describe this: we have a 'catalytic triad', which are Asp102, His57, and Ser195. Asp102 forms a Hydrogen bond with His57, making its other aromatic nitrogen a strong bases, which deprotonates Ser195. Thus the Serine becomes a nucleophile and attacks the carbon in the peptide bond.
So I expected from the pKa predictors that they find out that the Histidine becomes a strong base, and that the Serine is deprotonated.
First, I tried the PROPKA server, giving 1ab9 (BOVINE GAMMA-CHYMOTRYPSIN)
pKa pKmodel
ASP 102B 0.40 3.80
HIS 57B 6.19 6.50
This is not what I expected. The Histidine is actually getting a bit acidic; while the Aspartate is very acidic. Trying on H++:
HIS57: 11.3
ASP102 -5.1
I am satisfied with the prediction on the Histidine: it became a strong base; but -5.1 for the Asp is a bit off the mark...
Next, I tried Caspase 3, a cysteine protease, with PDB structure 1rhk. H++ did not accept it, complaining about missing residues. The catalytic residue is Cys165A, for which propka gives pKa = 11.14, instead of the standard 9.0. So the cysteine indeed became a strong base and nucleophile.
Labels:
catalytic triad,
H++,
pKa,
pKa predictor,
PROPKA,
proteases,
protein
Sunday, February 01, 2009
MIT OpenCourseWare: there is no such thing as a free lunch
The idea is great. Collect all the course materials: handouts, ppt slides, lecture notes, problem sets & answers, 'further reading lists', and so on from the brilliant professors of the MIT, and put them on the web, so that anyone can access them freely.
A generous gesture for all the students and young people in poor countries, who are eager to learn, but can't afford it. Obviously MIT has nothing to lose with this move, but can gain a lot by its PR value.
I come back to the MIT OCW website from time to time, look around, and leave disappointed. There are appealing titles that I would love to go through: like "7.343 Sophisticated Survival Skills of Simple Microorganisms", or "7.343 Neuron-glial Cell Interactions in Biology and Disease" or "7.343 Protein Folding, Misfolding and Human Disease", - just to mention a few (I am mostly browsing graduate courses in biology); - but when I want to download the 'courseware', it turns out that this includes nothing more than a meagre syllabus, a short abstract of each or some of the lectures, and a list of references to papers in scientific journals that are behind paywalls.
Especially this list of references is the most common 'courseware' that is offered, although I think that someone who goes to MIT OCW for knowledge does not have a ScienceDirect account. Obviously all these courses are taught using a PPT slideshow - why cannot they just put them online? (I dare not think of copyright issues...)
Working at a huge multinational company, comparable in size with MIT, I think I know the answer. Top managers love to be creative, they are enthusiastic about novel ideas and being a 'leader' in an area like OCW (undisputably initiated by MIT), - but they are sluggish and lazy to follow up these ideas and see that they are implemented correctly and completely.
Most probably these top officials of MIT never actually go to the OCW site and click through to see any of the course materials they are so proud of. If they did, they would probably feel cheated and angry, just as any third-world student who find nothing but a teaser in place of the knowledge they seek.
A generous gesture for all the students and young people in poor countries, who are eager to learn, but can't afford it. Obviously MIT has nothing to lose with this move, but can gain a lot by its PR value.
I come back to the MIT OCW website from time to time, look around, and leave disappointed. There are appealing titles that I would love to go through: like "7.343 Sophisticated Survival Skills of Simple Microorganisms", or "7.343 Neuron-glial Cell Interactions in Biology and Disease" or "7.343 Protein Folding, Misfolding and Human Disease", - just to mention a few (I am mostly browsing graduate courses in biology); - but when I want to download the 'courseware', it turns out that this includes nothing more than a meagre syllabus, a short abstract of each or some of the lectures, and a list of references to papers in scientific journals that are behind paywalls.
Especially this list of references is the most common 'courseware' that is offered, although I think that someone who goes to MIT OCW for knowledge does not have a ScienceDirect account. Obviously all these courses are taught using a PPT slideshow - why cannot they just put them online? (I dare not think of copyright issues...)
Working at a huge multinational company, comparable in size with MIT, I think I know the answer. Top managers love to be creative, they are enthusiastic about novel ideas and being a 'leader' in an area like OCW (undisputably initiated by MIT), - but they are sluggish and lazy to follow up these ideas and see that they are implemented correctly and completely.
Most probably these top officials of MIT never actually go to the OCW site and click through to see any of the course materials they are so proud of. If they did, they would probably feel cheated and angry, just as any third-world student who find nothing but a teaser in place of the knowledge they seek.
Labels:
Biology,
MIT,
OCW,
open courseware
Friday, January 16, 2009
GNU make: target-specific settings
I am a great fan of Make and specifically of the GNU Make: I use it as my primary scripting and building tool. Although I tried other tools like ANT, I soon got disappointed and fled back to Make.
An important feature of GNU make is the use of target-specific variable settings. This is the feature you need to create Makefiles that work for several platforms and compilers. Eg. you could write (just an oversimplified illustration):
- so that you can compile your code both with GNU and Intel compiler, depending on the main target (gcc or icc).
I use this technique all the time. But recently I found that these Makefiles are not really portable, since they don't work well on GNU make version 3.79 or earlier. And as the current version of GNU make is 3.81, we cannot say that 3.79 is a terribly old and obsolete release, and in fact lots of computers at my workplace have this version.
Actually the feature of target-specific variable settings does exist on version 3.79, the problem is that it is buggy and crashes at slight changes in the Makefile. For example, consider the following code:
which works fine with make 3.81, printing "Hello, world!", as intended. However, on 3.79, the output is "Error world!". The following Makefile (adding only the export keyword)
works fine with 3.81, but it crashes with 3.79 with the error
It is unfortunate that this bug exists in a widely distributed and used version of GNU Make, since open source projects rely on GNU make on the user's site.
An important feature of GNU make is the use of target-specific variable settings. This is the feature you need to create Makefiles that work for several platforms and compilers. Eg. you could write (just an oversimplified illustration):
gcc: CC = gcc
gcc: CCFLAGS = -g -I.. -std=c99 -pedantic -O3
gcc: CCFLAGS += -Wall -Wwrite-strings -Wmissing-format-attribute -Wstrict-aliasing
gcc: all
icc: CC = icc
icc: CCFLAGS = -g -I.. -std=c99 -fast -fbuiltin -finline -check-uninit
icc: all
all: ... ( the actual targets)
- so that you can compile your code both with GNU and Intel compiler, depending on the main target (gcc or icc).
I use this technique all the time. But recently I found that these Makefiles are not really portable, since they don't work well on GNU make version 3.79 or earlier. And as the current version of GNU make is 3.81, we cannot say that 3.79 is a terribly old and obsolete release, and in fact lots of computers at my workplace have this version.
Actually the feature of target-specific variable settings does exist on version 3.79, the problem is that it is buggy and crashes at slight changes in the Makefile. For example, consider the following code:
.PHONY: test
MYVAR = Error
test: MYVAR = Hello,
test: MYVAR += world!
test:
@echo $(MYVAR)
which works fine with make 3.81, printing "Hello, world!", as intended. However, on 3.79, the output is "Error world!". The following Makefile (adding only the export keyword)
.PHONY: test
export MYVAR = Error
test: MYVAR = Hello,
test: MYVAR += world!
test:
@echo $(MYVAR)
works fine with 3.81, but it crashes with 3.79 with the error
../3.79.1/expand.c:489: failed assertion `current_variable_set_list->next != 0'
0 [sig] make-3.79 7772 open_stackdumpfile: Dumping stack trace to make-3.79.exe.stackdump
Aborted (core dumped)
It is unfortunate that this bug exists in a widely distributed and used version of GNU Make, since open source projects rely on GNU make on the user's site.
Wednesday, January 14, 2009
Automatic Vectorization - Part 2
I wonder if automatic vectorization is also possible without the
When compiling with
We can see that four float values are added in every iteration, using the
Now if we try to compile the following, very similar C function:
the output is much more complicated, see it here. There are two loops: the L7 loop is with scalar instructions (addss), the L5 is with addps (parallel addition of 4 floats). Which one is used is decided according to the alignment of the input data: this is checked by the
instruction at the beginning.
I was not able to modify my C code so that the function arguments are aligned at 16-bytes boundaries. For example, we could try
... but it gives the same assembler code as before, checking the alignment and deciding if it is worth to vectorize the loop or not.
__attribute__((vector_size(16))) annotation, which is cheating in a way, since we give a quite explicit hint about how to pack data in vectors... The simplest example of truly automatic vectorization is this:When compiling with
gcc -O3 -msse3 -mfpmath=sse -std=c99 -S, we get this assembly output:We can see that four float values are added in every iteration, using the
addps instruction, which adds four packed single-precision floating point numbers. - If the length of the arrays is not a multiple of 4 we still obtain this vectorization using addps, up to 1020, and the last 3 elements are added using addss.Now if we try to compile the following, very similar C function:
the output is much more complicated, see it here. There are two loops: the L7 loop is with scalar instructions (addss), the L5 is with addps (parallel addition of 4 floats). Which one is used is decided according to the alignment of the input data: this is checked by the
testb $15, %blinstruction at the beginning.
I was not able to modify my C code so that the function arguments are aligned at 16-bytes boundaries. For example, we could try
... but it gives the same assembler code as before, checking the alignment and deciding if it is worth to vectorize the loop or not.
Labels:
automatic vectorization,
cpu,
GCC,
Intel,
optimization,
SIMD,
SSE
Sunday, December 28, 2008
Automatic Vectorization - does it work?
In my previous post I showed an example how smart GCC can be. But the most incredible feature of GCC is automatic vectorization. Normally, game developers etc use 'intrinsics', that is, predefined small inline functions that translate directly into SIMD instructions, to exploit vector processing capabilities of the CPU. All major compilers: GNU, Intel's, and Microsoft's, provide support for intrinsics. However, GNU is the only one (?) trying to vectorize, without human intervention, the code as a step of optimization. Of course, we need to give some hints how to group data into vectors, using the
will produce this output:
Impressive! The arguments are passed in XMM registers, so there is even no read/write from memory!
Should I worry about my job? Will compilers finally outsmart humans in producing better assembly output? Not yet... If we take another, very similar example:
- we just want vectors of 3 floats, instead of 4, which is in a way natural, living in a 3D space; we get this error message from GCC:
Oops! Good that I don't get this message from malloc() when I want to reserve lets say 95 bytes. Next, look at the following simple function:
Compiling with
What is disturbing for me is that register moves are not optimized. Basically all the
It seems that for performance critical applications or code segments we'd better do it manually, in assembly or using intrinsics.
vector_size attribute. For example, the following C function:will produce this output:
_addv4:
addps %xmm1, %xmm0
ret
Impressive! The arguments are passed in XMM registers, so there is even no read/write from memory!
Should I worry about my job? Will compilers finally outsmart humans in producing better assembly output? Not yet... If we take another, very similar example:
- we just want vectors of 3 floats, instead of 4, which is in a way natural, living in a 3D space; we get this error message from GCC:
error: number of components of the vector not a power of two
Oops! Good that I don't get this message from malloc() when I want to reserve lets say 95 bytes. Next, look at the following simple function:
Compiling with
gcc -S -O3 -msse3 -fomit-frame-pointer -foptimize-register-move givesWhat is disturbing for me is that register moves are not optimized. Basically all the
moveaps instructions could be optimized away, if we pack the floats in the right registers, something like this:It seems that for performance critical applications or code segments we'd better do it manually, in assembly or using intrinsics.
Saturday, December 20, 2008
Optimized code from GCC
Some time ago I got into the habit of writing small, simple functions in C, compiling them, and then check the assembly code output of the compiler. For the GNU C compiler, you may get the assembly listing by using the -S command-line switch. In most cases, the output is more or less what I expected. In some cases, however, the compiler excel my expectations. Here is an example:
This is a stupid, useless function that decrements its argument, and calls itself recursively. The result is obviously zero, but this result is obtained in a tedious and inefficient way. For a big input, eg. 1000000, the program obviously results in stack overflow. Compiling this function without optimization will result in this code:
We can see that _f is indeed calling _f, that is, itself. However, if we compile this function with the -O3 command line option, meaning level-3 optimization, we get a much different object code:
We can see that the whole recursivity is optimized away. Instead, the function argument is removed from the stack, the EAX register is loaded with zero, and the function returns.
I am really, totally impressed. How could the compiler find out that this function will always return zero? OK, but let us modify the function f a little bit:
Did you spot the difference? Now the input value x can be negative. In that case, the function never returns, it exhausts the stack and crashes. Let's see what GCC makes out of it:
- Exactly the same. Did it realize that there is no point to implement the case when x is negative? Or this is simply a bug in GCC? - If you have an opinion or know the answer please leave me a comment.
This is a stupid, useless function that decrements its argument, and calls itself recursively. The result is obviously zero, but this result is obtained in a tedious and inefficient way. For a big input, eg. 1000000, the program obviously results in stack overflow. Compiling this function without optimization will result in this code:
We can see that _f is indeed calling _f, that is, itself. However, if we compile this function with the -O3 command line option, meaning level-3 optimization, we get a much different object code:
We can see that the whole recursivity is optimized away. Instead, the function argument is removed from the stack, the EAX register is loaded with zero, and the function returns.
I am really, totally impressed. How could the compiler find out that this function will always return zero? OK, but let us modify the function f a little bit:
Did you spot the difference? Now the input value x can be negative. In that case, the function never returns, it exhausts the stack and crashes. Let's see what GCC makes out of it:
- Exactly the same. Did it realize that there is no point to implement the case when x is negative? Or this is simply a bug in GCC? - If you have an opinion or know the answer please leave me a comment.
Labels:
assembler,
compilers,
GCC,
optimization,
recursion,
runtime checking
CPUID: getting the brand name of the CPU
Last time I gave an example how to check SIMD support using the CPUID assembly instruction. We can use CPUID also to obtain the exact brand name of the CPU. For example, running this C program on my desktop I get
We need to invoke the CPUID instruction three times with different integer constants in EAX, and copy the 16 bytes from EAX, EBX, ECX and EDX into a char buffer. So the brand name can be 48 characters long; and in my case, it is padded by spaces.
CPU brand: Intel(R) Pentium(R) 4 CPU 2.40GHz
We need to invoke the CPUID instruction three times with different integer constants in EAX, and copy the 16 bytes from EAX, EBX, ECX and EDX into a char buffer. So the brand name can be 48 characters long; and in my case, it is padded by spaces.
Labels:
assembler,
C++,
CPUID,
Intel,
programming
Subscribe to:
Posts (Atom)
