Genome-Wide Association Studies

Genome-Wide Association Studies


Dr. Teri Manolio:
— this new approach is that everyone’s so excited about in terms of genome-wide association.
And I hope I’m being picked up now. And Larry had asked me to make the point that there’s
a revolution going on. So I went for the most revolutionary picture I could find, which
is Willard’s Spirit of ’76, and actually it is well worth making that point, that really
something dramatic has changed here, and Larry has talked about it a bit, and Emily, and
Francis, in that we’re much more able to scan the genome and look for differences between
people than we have ever been before. So technologic advances now allow us to measure hundreds
of thousands, now millions of variable points across the genome at a relatively low cost,
certainly not the 50 billion that Francis had mentioned earlier, probably about $500
per person, you know, depending on the platform that you use, and using relatively little
DNA. So it used to be we needed, you know, close to a microgram of DNA in order — and
you only get maybe 20 or 100 micrograms out of a blood sample — in order to measure really
even just one genotype. And then that went down quite a bit, but it was still so much
that you couldn’t really measure this many variants in one person without a whole lot
of blood or DNA. And now we can do all of these measurements in a microgram or even
considerably less than that. What this also means is that these technologies
can be applied to unrelated individuals. And you heard earlier that there have been many
studies funded by NIH, and many other groups around the country and the world, who basically
have identified lots and lots of people and have studied them for the development of a
whole bunch of different diseases, schizophrenia, or autism, or psoriasis, or whatever. And
those folks, many of them, have DNA stored in freezers around the country and the world,
just waiting for these kinds of technologies to be available, and now they can be applied.
We can also identify a multitude of subtle genetic effects that increase the risk of
complex diseases. And Neil Risch, who many of you may have heard speak — whoops, oh,
that’s the advance, no, that’s not even advance, there we go, okay — Neil likes to say that
genetically complex diseases, which we describe as diseases that are due to multiple genes,
are only complex because we looked for single genes and didn’t find them so they must be
complex. And remember that when we talk about Mendelian diseases, we’re really talking about
a single gene, and those are the sort of the lamp posts that we had been studying for so
many years. So what is a genome-wide association study?
It’s a method for interrogating all of the 10 million or so variable points across the
genome. We’ve heard already that this variation is inherited in groups, luckily for us, so
we don’t have to measure all 10 million points. We can just measure a small subset of them.
These blocks are longer in people who are more closely related. They’re very long in
identical twins. In fact, they’re the entire genome. But in siblings, or parents and offspring,
they are about maybe 10 million base pairs long or perhaps a little bit longer than that,
and the less closely you’re related, the shorter those blocks are. So when you have people
who are not related at all except for 100,000 years ago when we all came out of Africa,
you do need to test more and more of these things, but you don’t have to test them all.
And so we now are able to do studies in unrelated people assuming about a 10,000 base pair length
that’s shared, and that does vary by population. So in older populations like populations in
Africa, that length is much shorter. In younger populations, American Indians, other such
populations, those lengths are much longer. One of the challenges in studying populations
of recent African ancestry is that you do need to test more spots. And until we realize
this, a lot of times we would look, and really not see an association in Africans, and yet
we would see it in non-Africans, then we’d say, “Oh, must be something funny here.
Let’s just focus on the non-Africans.” And I think Vence will talk a little bit later
about how that’s been a challenge to deal with but one that we can deal with. So just
to kind of go back over again this concept of linkage disequilibrium, this is a paper
from the “New England Journal” just very recently, talking about what SNPs in genome-wide
association may mean for medicine, and it shows here a chromosome and pulling out a
gene. And here are various SNPs, these little red things are the exons, and you see SNP
1, 2, 3, and 4, just again a hypothetical gene. What’s shown down here, you may see
these kinds of triangles. This is very much like — you remember that AAA would give you
these maps, and they’d say, you know, “How far is it from New York to Chicago, or from
New York to San Francisco, or New York to Tokyo?” and that’s — whoops, sorry, I keep
doing that — that’s basically what this is. So this is the correlation between SNP 1 here
and SNP 2. And when this block is very dark, it means that, “Boy, if you know SNP 1,
you can be pretty sure, like Larry’s gray sock, that other sock is probably gray.”
And SNP 1 and SNP 3 are pretty closely correlated, as are SNP 1 and SNP 4. But when you get down
here to SNP 1 and SNP 5, they’re not well correlated at all. So probably what happened
between SNPs probably 4 and 5 was that there was a recombination event that DNA crossed
over, and there was some advantage, or there was not, it was just a random event, but at
any rate, those two are not well correlated. So say we just look at these SNPs, 3, 4, and
5, in this very nice diagram that they did here, showing here SNP 3 and SNP 4, this could
be a G or a T in SNP 3, a C or a T in SNP 4, but notice that every place where you have
a G — where this person has a G, they have a C, every place. Every place where person
2 has a G, there’s a C, or where there’s a T, there’s a T. So these two are very closely
correlated. If you know one, you know the other, as opposed to here, SNP 4, you know,
here sometimes you have a C and you’ve got a T, sometimes you have a C and you have an
A. Sometimes you have a T here and there’s an A, sometimes a T and a T. So those SNPs
are not well correlated. And this was the concept of tag SNPs. So this one can act as
a proxy for that. It can’t act as a proxy for that, so you need a different one. So
that’s all that we’re talking about here. So, mapping those was what the HapMap did,
was to really define which ones are closely related to which others. And while that was
going on and partly as a stimulus stimulated by the HapMap project, genotyping technology
became much, much more efficient and less costly. This is a slide from my colleague,
Steven Channick [spelled phonetically], showing that way back in 2001, we were probably spending
about 100 cents per genotype, maybe a little bit less than that, for the standard genotyping
technology of that day. And over time to 2005, those costs and use of the different platforms
have gone down, and the number of SNPs that one can measure has gone up fairly dramatically.
So here at the end of 2005 we were measuring between 100,000 and 500,000 SNPs at the cost
of about a penny a genotype, or even less. Those costs have continued to decline. This
is only through October, 2006, from my colleague, Stacey Gabriel, and I should probably update
it further, and that shows — now we’re showing these not by cost per SNP but cost per person.
So for a person’s entire genome, both — I mean, sorry — both sets of their DNA, initially
starting at about maybe $1,600 per person for the Affymetrix platform in July of 2005,
that has declined dramatically. And other products have come on the market that have
more and more SNPs, and Affymetrix now has one that’s a million SNPs, as does Illumina.
These probably have dropped down to about, oh, 200 or $300, you know, per person genotyped,
and the one million SNP 1s [spelled phonetically] will come down in cost as well. So this has
been really quite a dramatic change, and it has enabled us to afford these kinds of studies. Larry talked a bit about the chips, and you
see them around there. This is the data that you get off of these. So when a genotyping
lab does this, basically their computer produces a picture like this, which for SNP rs2990510,
shows you the three different genotypes. So here you have someone who’s a homozygous for
one allele. I don’t know which. Here’s the heterozygote, and here’s the homozygote, and
likewise here. You can probably ignore these for the moment. But anyway, these are the
numbers of people, and shown up here is basically the intensity of the light that’s reflected
back in red by the computer. And then there’s a clustering algorithm — and these algorithms
are very important and very complicated, and they also change fairly rapidly — that tries
to basically read three different intensities, assuming that you have a SNP that is polymorphic,
so you have two different copies, you have the A and a T. You could have picked up a
sample that just by chance only had Ts. That SNP would be called monomorphic in that population,
and in that case, you should see everybody clustering at one end. Now, sometimes the
computer algorithms get confused when they see that, and they try to make it into two,
or three, or whatever. And so it’s important when you have a SNP that you’re very interested
in, you really want to take a look at these plots, you can’t look at all 300,000 or one
million, but you can look at the two, or three, or five, or 10 that you’re quite interested
in. As you can see here, these are called —
these purple ones are called as heterozygotes. But then there are a couple of folks that
are kind of hanging out here that the algorithm doesn’t quite know what to do with, and so
there are errors or challenges in the technology in being able to read these. These would be
called not — not called or missing SNPs. There are different rates of missing SNPs
in different platforms, plus different genotype DNA quality will give you different rates.
And all of these things are things that are recommended to be reported in the report of
a genome-wide association study. Unfortunately, these days, the reports are so short that
you end up having to look at that in the supplementary material. But most labs that are doing these
now, you know, will report out their quality control, and it’s very, very good. It’s, like,
99.7 percent, you know, fidelity, for these measures. So if you wanted to look at a dataset
from a genome-wide association study, you could actually look at the Coriell Web site.
The National Institute of Neurologic Diseases and Stroke has done a study of Parkinson’s
Disease, 297 cases, 297 controls — can go onto their Web site, agree to keep the data
confidential per person, and not to try to identify anyone, and to use them only for
scientific purposes, and then basically you’d have a chance to look at these data. So if you pulled up chromosome 22, which I
picked because I’m a wimp, and it’s the smallest chromosome there is, and it’s still a huge
dataset, the first two SNPs in that, and the first three cases in that dataset shown here,
and here are the alleles. So allele 1 at this SNP for person 14 is a T, allele 2 is also
a T, so they’re a homozygote. Person 20 is a heterozygote. And then for the controls,
the first three controls are shown here, and for allele — sorry — for this SNP, this
allele, and you’ll notice that this one is — the frequency of the A allele is much
less. It’s about eight percent, actually. And when you get these results back, they
actually give you a file that says, “Okay, in your sample we had eight percent As at
this point, we had 50 percent Ts at this point,” whatever. So what you can then do is do what
Emily and Francis were showing you, is basically count up all the cases you have, and the controls
that you have, and see how many of them have As and how many of them have Gs. And if you
were to do this — and these are totally made up data, do not report this — but suppose
you took a look at this one SNP, which was the second one that I showed you here, so
allele 2, the one that only about eight percent of people have an A at that spot, say you
took a whole bunch of people, 1,000 people that you collected from greater, you know,
Richmond, Virginia, and you genotyped them, and you find that maybe about eight percent
of them have an A, the variant allele at this particular point, so 920 of them don’t have
the A, they have the G variant there, and then you follow them forward in time, and
you say, “How many of these people actually develop Parkinson’s Disease?” And you find that, gee, of the 80, 10 of them
develop Parkinson’s Disease. And of the 920, only 40 developed Parkinson’s Disease. You
could then estimate a risk, a relative risk, it’s called, and this was sort of the standard
measure of disease risk for many, many years until we got other computer programs that
started calculating other things, which we’ll talk about. But basically you could look at
the risk in the exposed, which is 10 out of 80, or 12½ percent, compare it to the risk
in the unexposed, 40 out of 920, or 4.3 percent, and you would get a relative risk of 2.9.
So somebody carrying this A allele is 2.9 times more likely to develop disease than
somebody not carrying that allele. And that’s a measure of risk. Usually we see estimates
of things like smoking or family history in the three to fourfold range for common diseases.
The measures that we get for genes for common diseases are much less than that, so they’re
much less than one and a half usually in the 1.2, 1.3 range typically. Well, there’s a measure called the odds ratio,
which you’re much more likely to see in genome-wide association studies. And there are two reasons
for that. One is that you have to have a certain study design in order to be able to calculate
a relative risk because you have to know what the denominator of your population is. So
you had to know that there were 80 people total, of whom 10 had the disease and the
allele, and 920 total of whom a certain proportion had the allele and the disease. Sometimes
you don’t know that. In a case control study, you won’t, and we’ll talk about that in a
second. In addition, there are many modeling systems
that basically focus on the odds ratio because it’s computationally simpler. And so just
to talk about odds, and everybody really intuitively I think knows what odds are. Odds are related
to probability. They’re actually the probability of an event over the probability of it not
happening, so the probability of it happening over one minus the probability of it happening,
which is the probability of it not happening. So if the probability of a horse winning a
race is 50 percent, we all know the odds are one to one. If the probability is 25 percent,
the odds are one to three for a win, or three to one against a win, so those are odds. And
again if a probability of a person who’s exposed to a given risk factor — if their probability
of getting a disease is 25 percent, their odds are 25 percent over 75 percent, one to
three, pretty simple. When we don’t have denominators for risk estimates, which is typical in a
case control study, we calculate an odds ratio. You may have heard about this, if any of you
took sort of, you know, basic statistics long ago, as a cross-product ratio, AD over BC,
and I’ll show you a two by two table where we get these names of these cells. And again it’s computationally easier, and
if the disease is rare, the odds ratio approximates the relative risk. It always tends to overestimate
it a little bit, so your relative risk might be 2.1 in that example that I showed you earlier,
if you calculate it in an odds ratio, which we would do by just multiplying — we call
these cells A, B, C, D, very novel — but anyway if we just multiply these two, AD times
BC, we would get an odds ratio of about 3.1, so not that far off. So here, I actually took
the data that Francis is going to show you from the Hilgadauter [spelled phonetically]
Paper, which is under embargo — and I won’t give you the name of the SNP, but many of
you probably have already seen it anyway — and basically took the data that they had
in their tables, which you had to back-calculate a bit, but at any rate figuring out how many
— they had basically a group of cases, 1,507 people who had myocardial infarction, 6,700
who did not have myocardial infarction, but we really don’t know what the denominators
of these are. These are sort of cases that they identified through a whole series of
different studies. [low audio] Oh, and I’m sorry, this is not in the book.
We were trying to update things and be as up-to-date as we possibly could be, so my
apologies, it’s not in the book. What I had given you was a totally made up example from
the Parkinson’s data, and I thought a real example might be more fun. So, anyway, so
basically if you look at the frequencies of their alleles in their cases and controls,
you can calculate these numbers, and then you can look at the odds in the exposed. So
the odds of disease in the people with the variant allele would be 813 over 3061, right,
that’s probability, P over one minus P, and the odds in people who did not have that allele
would be 794 over 3667. If you wanted to do this cross-product, it would be here, 813
times 3667 over 794 times 3061 equals 1.23. And in the paper they quote 1.22, so hopefully
I did my math right, and maybe we just have a little bit of a rounding error. And again,
just remember that’s embargoed until Thursday, I think. So the thing that’s important, that’s, you
know, really conceptually very, very different for somebody like me, and Larry was describing,
that we used to do these one at a time. At the very end of my talk, I have some slides
that they made me take out because they said they were boring, but what they show is basically
genotypes for one person in chromosome 22, it’s a slide of them, and it’s like several
slides of them, it’s basically about one page of WordPerfect — I’m sorry — Word single
print, single line, single space type. For chromosome 22 for one person, what you get
from the NINDS Web site is about seven pages, and that’s only chromosome 22. So there are lots of other chromosomes that
are much bigger, and trying to manage these data is just mind-boggling, so, basically,
new approaches are needed for accessing, manipulating, and visualizing these data. And there have
been some very creative approaches to doing this. But it does require an entirely new
perspective, so we’re no longer looking under the lamppost, essentially saying, “Gee,
I know there’s a gene related to angiotensin converting enzyme, ACE, which I know is somehow
related to hypertension, so I’m going to relate my ACE gene polymorphisms to hypertension.”
And in some studies it was associated, and in some studies it was not. In this kind of
a paradigm, we’re basically saying, “We don’t know much at all about the genome. We’re
going to interrogate across the entire thing and see what sort of comes up as being associated.”
And we do have to recognize that when you do two, or five, or 10 kinds of tests like
this, or hundreds of thousands of them, it is possible that the differences we observe
just happen by chance. Differences do happen by chance. That’s why people gamble. And what
you want is to try to sort of filter out the ones that might be due to chance versus the
ones that are likely to be real. So, I’m sorry, I know it’s before lunch, and
we haven’t had a break, but I do have to give you a little bit of statistical, epidemiologic
kinds of stuff. So you probably all have heard of p-values. P-value is the probability of
finding a result as extreme or more extreme than you observed in your study, by chance
alone. We used to focus on p-values of about P less than 10 to the minus fourth, .0001.
When I was in epidemiology school, I was told, “Don’t bother to look any — for p-values
any smaller than that. They don’t mean anything.” No, really, this is what they used to teach
us. And then the geneticists came along and said, “Hey, we can test 100,000 or 500,000
things. We actually want to know if our p-value is
10 to the minus 10th, or the minus 20th, or the minus 30th, because we want to correct
for the number of times we’ve looked.” And when you’re looking for, you know, a million
times or so, you do want to have a much smaller p-value. You may have heard of type I error
or alpha error. This is the probability of finding a difference when really in the truth
of the universe there isn’t one. And it’s also called sort of a spurious association.
This has been the bane of what we called candidate gene association studies because, you know,
you test the ACE gene, and the angiotensinogen gene, and lots of different genes for a relationship
with hypertension, and if you did them in small samples, and you just happened to get
lucky or unlucky, you might find an association. Very few of those associations were subsequently
replicated. A type II error is the probability here —
so here you find a difference where there isn’t one, here you don’t find a difference
when really there is one. This is one that we tend to worry about a bit more because
we’re concerned that we have done a study that isn’t — basically isn’t big enough in
order to be able to detect a difference. The difference was smaller than we expected it
to be. We didn’t look hard enough, whatever. We might have missed it. Power of a study
is closely related to type II error. Basically, you know, there are two things that could
happen. If there really is a difference, you can either find it or you can’t find it —
sorry — you can either — or you don’t find it. So if you don’t find it, you’ve committed
type II error, if you do find it, it’s just one minus the type II error. So that’s the
power of your study. We usually like to have studies that are powered for about 80 percent
power so you have an 80 percent chance of picking up a difference if it’s really there.
Most people actually prefer 90 percent or even a little bit more than that to pick it
up. And then the effect size is the magnitude of risk associated with a variance. So those
are those measures that I mentioned, the relative risk, 2.9, the odds ratio, 1.23. There’s also
something called a hazard ratio that you’ll see in some of these papers, which is the
risk of a disease occurring over a given time period, and it takes into account the amount
of time it takes for a disease to develop. And just be aware that you need very large
sample sizes for these if you’re looking for very small p-values, which we are tending
to do because we make so many comparisons. If the effect size is smaller, so if you’re
looking for a 1.2-fold relative risk or a 20 percent increase in risk, you need many,
many more people to detect that than for a threefold increase in risk or a fivefold increase
in risk. Now, you might ask yourself, “Well, do I really care if the increased risk is
only 1.2?” And it used to be, again, that we would sort of say, “Mm, you know, that’s
probably not all that important,” but we are finding that there are genes that actually
are quite important pathophysiologically, and sort of as hints to treatment that have
risks about this size. So we probably do want to detect those. Allele frequency, if you
have an allele that’s only present in eight percent of the population, you’re going to
need a lot more people to be able to find the association with the gene, and the disease,
or the trait, than if it’s present in 40 or 50 percent. And the measure that you’re measuring,
if it’s very variable, if you have a lot of error in your measurement, it’s going to be
more difficult to separate out your groups with, you know, one gene that — with one
variant versus another if that variant is having an effect on that measure. So the more
variable that a measure is — blood pressure is a very variable measure. It changes minute
to minute, essentially. So that’s another challenge in needing a large sample size. And you’ll see displays like this. This was
probably the first and best-known truly genome-wide association study published by Klein, et al.,
looking at age-related macular degeneration, and what they did was plot with these little
lines here every spot along the genome that they had tested, 100,000 of them. This is
the log — sorry — minus log 10 of P, so remember your logs were those exponents, so
if you’re looking at a p-value of 10 to the minus fourth, the minus log 10 of that would
be four, and so here is your four level, here is your six level. And just above six, 10
to the minus seventh is where this one that turned out to be very, very suspicious for
being causative, complement factor H, was hiding. So that’s one way of looking at them.
This is another way, much more colorful, from the Brode [spelled phonetically], MIT folks,
looking at their diabetes scan, and basically what they did was to color-code by chromosomes.
And as you can see, this chromosome is very big. This gap here is — sorry — the central
mirror where you can’t measure it, and then they get smaller and smaller, and here’s my
chromosome 22 way down there, but anyway, looking at the SNPs that are associated, and
just plotting the minus log of the p-value. So here’s one that’s really, really associated
very strongly, at least highly unlikely to be due to chance, could be due to things like
genotyping error, or it could be due to things like, you picked a sort of a funky population
and that. So you need to be able to replicate them. But at least it’s not due to chance.
Okay. This same group published recently a genome-wide scan for prostate cancer. What
you’ll sometimes see is that instead of showing you the entire genome, although often they’ll
do this, they’ll say, “You know, this area looks very interesting. We think it’s interesting
because there’s a clump of them and because we know that this particular chromosome,”
which Francis will talk about, “is related to a disease, whatever it might be.” And
that was done here for prostate cancer. This is the 8q24 region, which everyone has sort
of — is sort of scratching their heads, “Why is it that this is related to prostate cancer
when there don’t seem to be any genes there?” And it’s probably because we know so little
about the genome that we’ll learn a great deal about it from this kind of example. What
they did was to look at this particular area. Here is the SNP that they found most strongly
associated with it. And then basically, statistically, they adjusted for the presence of this SNP.
So if you’re using a model, you’re calculating out an odds ratio for each one of these things,
here’s the p-value for that odds ratio. And once you basically hold this constant,
statistically — go to this way — once you hold this constant, then all the [spelled
phonetically] rest of these kind of fall down to the bottom [inaudible]. Their association
is much less [inaudible] because they’re [inaudible] with this one. And [inaudible] this one [inaudible]
the next most strongly [inaudible] and once you adjust for that, then all of the rest
of these kind of fall down below the threshold. And they did this about five times. It’s really
a very nice progression in the paper, Hamann, et al., in Nature Genetics. And this is one
that you’ll see from Francis in a little. Eh [spelled phonetically], sorry to have stolen
it from you. We’re looking at chromosome 11 again. Here’s a SNP that was of great interest.
Here are a bunch of SNPs that are associated with it. Once you adjust for that, all of
the rest of these fall out. And here’s one of those triangle AAA diagrams that we showed
you previously. This shows why, that basically there’s a strong block of linkage disequilibrium.
All of these things are correlated with each other. And that’s basically what you’re picking
up with this one SNP. Okay. I mentioned about how genome-wide association
studies — sorry — about how candidate gene associations have had some challenges in replicating
their findings. Did you see here, 600 associations, only six of them were significant in more
than three studies. This is a nice paper by Joel Hirschhorn. And this is not to say that
candidate gene studies are bad. What it is to say is that it’s very easy for us to find
spurious associations when we only look once or twice. And what this taught us was that
when we start doing things like genome-wide association, we have to replicate multiple
times. And as we’ve seen, replication really now is considered to be the sine qua non.
So you’ll see these papers coming through where they’ve done three, or four, or five
studies at a time showing yes, it does replicate in all of these populations. So large sample
sizes, multiple studies are needed to replicate the findings. These produce massive datasets,
the analysis requires a huge and a very specialized effort, and better analytic methods are needed.
And we recognize that if we make these data widely available, that will stimulate the
analysis of these methods. In addition, once you measure somebody’s genome, you can relate
it to anything. So you’ve already got it measured, you can look at their height, you can look
at their weight, you can look at lots and lots of different things. So these datasets are very rich, and one of
the things that we are focusing on a great deal at NIH is making sure that the datasets
are made available to lots of different investigators so that you don’t have sort of this syndrome
of this first fly on a beached whale who lands on it and says, “Dibs, this is all mine.”
And we certainly don’t want that happening with genome-wide studies. And there has been
a tendency for that to happen. So we are pushing very hard to make these data widely available.
So the revolution is probably here. Extensive characterization is now possible. It can applied
to unrelated individuals to find putative genetic causes of diseases. Many existing
studies are out there, basically waiting for this technology to be applied to them. But
we do need new approaches to manipulating the data. And we need responsible approaches
to sharing data, so that participants are protected, and the investigators who produce
the datasets also get some recognition for their efforts. And we believe strongly that
collaboration for both replication of findings and investigation of function is absolutely
crucial. So I think at that point I’ll stop, and be happy to take questions. Okay. Yes,
Joe. Male Speaker:
I’m just curious, since the p-values are so important in terms of giving some sort of
a credibility, and you know that you’re getting multiple comparisons, so you have to have
smaller p-values, isn’t there some statistic that accommodates — Dr. Teri Manolio:
Sure. Male Speaker:
— the fact that you are doing — I mean, how does that work? Dr. Teri Manolio:
Sure, well, there are a couple of different ways. People debate, you know, what’s the
best way to correct for that. I mean, you could say that basically, a p-value — it’s
very interesting, you know, how we ever got to — back in the olden days, a p-value was
.05, and if you had a chance of less than one in 20 of picking up a difference totally
by chance, people were sort of comfortable with that. But where did that come from? Well,
the way it was explained to me is that, if you flip a coin, when you think about flipping
a coin, and you get heads, and you say, “Well, you know, I could have gotten heads, anyway,
it’s about 50 percent.” And you flip it again, and you get heads, and you say, “Well,
you know, 25 percent chance of that.” And you flip it again, and you get heads, and
you kind of say, “Mm, ah, it’s a little bit odd.” But you do it a fourth time, and
you get heads, now you want to look at the coin, okay? So that’s something that’s unusual.
And that’s 6.25. So maybe that’s how we got to a five percent being a level that people
were uncomfortable with. But, you know, but it really is totally arbitrary. And when we
look more than one time, we may say, “Well, you know, if I actually checked five percent,
you know, I take these five percent of differences as being statistically significant or unlikely
to have happened by chance,” well, if I do 20 differences, you know, out of 20 roughly
on average one of them is going to be — you know, appear to be different, even though
it really isn’t. So maybe I need to do — I need to correct
for that, for the number of times that I’ve looked. And one way that people do that is
to divide the p-value by the number of times that you’ve looked. That’s called a Bonferroni
correction. It’s thought to be very conservative, because it assumes that every test you’re
doing is independent. And these tests are not independent, because they’re all stuck
on — you know, many of them are stuck on the same chromosomes, so there are other ways
of doing this. You can permute the genotypes basically so you say, “I’m going to randomly
generate genotypes and see how often, just by — where I know that it’s random — how
often I see an association with my trait.” And that’s another way of correcting. There
are a couple of others. Dr. Francis Collins:
But actually — Male Speaker:
Go back to the microphone. Dr. Francis Collins:
Page back to your slide, to the — Dr. Teri Manolio:
Microphone, Dr. Collins. Dr. Francis Collins:
Yes, thank you. [low audio] Please page back to the macular generation.
You just went past it. [low audio] So see that p-value of 4.8 times 10 to the
minus seven, where does that come from? Well, that comes from the fact that in this study
they tested about 100,000 SNPs, and they were assuming sort of this conservative correction,
so they said they wanted to achieve, effectively, a p-value of .05, but they’re doing 100,000
independent tests, so they’ve got to divide .05 by 100,000, and that gives you that dashed
line. So they were arguing that any result that fell above that, that is, that the p-value
was even better, was likely to be significant and not noise, and anything that fell below
that dotted line might be significant but you haven’t proved it. Dr. Teri Manolio:
Yes. Male Speaker:
Yeah, you said that, that — a p-value like that wouldn’t — can’t happen by chance
alone, but — Dr. Teri Manolio:
I didn’t say that. It wouldn’t [spelled phonetically] be very likely, but if you did,
you know, 10 million [inaudible], you might come across [unintelligible] like that. Male Speaker:
Okay, but actually — but you said, “But it could be a genotyping error.” I guess
I don’t know, what is a genotyping error? Dr. Teri Manolio:
Well, as I showed you before with these calling algorithms, sometimes they get confused, and
particularly if you genotype your cases and your controls differently – sorry — in
sort of different batches. So when you’re doing this test, say that, for some reason,
your controls — the DNA is different, it’s come from a different source [spelled phonetically],
buccal — you know, buccal source instead of blood source, you know, for whatever reason,
when you do this test instead of getting these nicely separated, they’re actually more
of a smear together. So just from that kind of an error, you could sort of generate a
difference between them that where, really, there is no difference. That’s a technological
error. I’m not explaining it well, and I’m sorry. I’m caffeine deprived. But also you
could get errors — it is possible or conceivable, and maybe, Larry, you’d want to comment
on this as well, that there might be other genes or other variants in the region that
would interfere with this, that might be related to cases, and not in controls, or something.
What other kinds of genotyping errors can give you spurious associations? Larry Thompson:
I think those are the — Dr. Teri Manolio:
Those are the main ones? Larry Thompson:
— the main ones. I mean, some things don’t behave well in the assay, in cases and controls,
and get tossed out. They never get into the final dataset. The other issue that is the
population stratification one, that you might want to discuss — Dr. Teri Manolio:
Oh, yes. Larry Thompson:
— because that’s another place where you can produce false positive results that Teri
can explain. Dr. Teri Manolio:
Right, and population stratification is another really horrible name for differences in the
sort of ancestry between cases and controls. So say you selected all your cases — you
know, not that anyone would do this but it does happen sometimes in not very good studies
— say you selected all of your cases from people from Finland, and all of your controls
from people from Japan, they’re going to have different allele frequencies just because
of the population history, and so any differences between them, if there are differences in
disease as well, you’re going to start ascribing those to the disease, where they might not
be related at all. This actually has been used as a way of finding genes that might
be related to diseases, called admixture mapping, and it has been a technique that’s been
used in the past. It’s not used very often anymore, but that’s another thing that could
cause [unintelligible]. If there are systematic differences between your cases and controls,
those are — you know, old time epidemiologists call this confounders, and it’s just a confounder
between a [unintelligible] genotype instead of environmental [unintelligible]. Does that
help? Female Speaker:
Could you go back to the science embargoed slide and talk slowly about the probability
issue again and how you got to where you did? Dr. Teri Manolio:
I can go back. Whether I can talk slowly is something [inaudible]. We tried, and tried,
and just never quite gotten to it, but at any rate, what we have here — what they basically
published in the paper or in the table, you may have it there, is the number of cases,
the number of controls, and the allele frequencies. So for this particular SNP they gave the number
of cases was the 1,507, the number of controls was 6,728, and then they said it [inaudible]
.453 of the cases had the allele A, and point whatever, that was of the controls, had allele
— sorry — of the controls were [unintelligible] — yes, of the controls had allele [unintelligible]
A. So basically what I do is just take those proportions and multiply them by 1,507, and
figure out how many people had [unintelligible] both cases and had allele A, how many people
were controls and had allele A, subtracted that from this total number to get this number,
and this from that total number to get that number, and then did the cross-product. [low audio] Sorry, it was allele A in the other example
on [unintelligible]. Does that help? Female Speaker:
Think so. Dr. Teri Manolio:
Okay, yeah, it’s just a matter of filling out [unintelligible]. Yes? Female Speaker:
I think I’m still not understanding why it is that genome-wide association studies
are, as I think you’re saying, less likely to find spurious associations than the earlier,
you know, single candidate gene approach. Could you take another stab at that? Dr. Teri Manolio:
Yes, sure. Please don’t leave this room thinking that genome-wide association [spelled
phonetically] are less likely to find as spurious. They are more likely to, because you’re
doing many more tests. The reason now that we’re less likely, hopefully, to find spurious
associations is that we recognize that so many of them are possibly spurious that we
do replications of them, and we require — in the same [unintelligible] — even the
same populations, more likely a different population, but a completely different study.
And so you may see in one of the science papers — you know, there were several different
groups around Canada, and Texas, and different places. Some of them used different phenotypes,
which is a little bit risky. Some used MI, and some used coronary calcification, and,
you know, if the gene that you’re looking at is related to both MI and coronary —
myocardial infarction and coronary calcification, then, you know, you’re golden, and it actually
gives you more reassurance that we’re finding something that’s likely to be important.
Yes, Steve [spelled phonetically]. Male Speaker:
Another reason why Canada gene studies were particularly likely to give false positives,
is because there were so few true positives, right? So, if you’re assuming that the genes
that are on your short list must contain some that are actually right, and you keep trying
over and over again, sooner or later, by chance, you’re going to get one that looks encouraging.
There’s a natural pressure, of course, to publish something that looks positive. You
say, “Well, let’s put it out there and see if anybody can validate it.” In most
of those instances, the validation didn’t happen, so you ended up with a paper reporting
the finding, and then a paper refuting the finding. So we filled up the literature. One
of the things that’s different here is with a genome-wide association study for most of
the diseases we go after, if you have a sufficient number of samples, there are going to be true
positives, and as long as you’re rigorous about your statistics and making sure your
cases and controls were well-matched, you’re going to have real results to be able to write
about. And so when you see those publications coming out, it’s likely, if they did everything
right, that the top tier of what they have found will be validated. And I think that’s
certainly been true in the last month or so with these particular studies. So, again,
if you’re trying to do a study where you know you’re doomed to failure, but there’s
still a pressure to publish, there’s going to be stuff coming out. If you have a study
where you’re almost certainly going to find something, then if you do it right you’ll
publish something that’s right. Dr. Teri Manolio:
I think a point that’s important to make as well is in these replication studies, very
often it’s not the most strongly associated step in the first study that actually survives
to be replicated. There are those who argue that the ones with the sets [spelled phonetically]
that are really, really extreme are the ones you should be nervous about, where others
you say, “Oh, no, I like the ones that [unintelligible] very strongly.” So regardless, when you
do the replication study, very often [unintelligible] that replicates [spelled phonetically]. So
you take a large number of them, replicate [unintelligible] a sample, and then a smaller
number of those that are associated in both studies, and replicate them as a third, and
maybe even a fourth, and a fifth before you say, “You know, this looks pretty good.
I think I’ll buy it.” And then still you put it out in the literature and, you know,
20 groups try and replicate it, and hopefully most of them do.

One comment

Leave a Reply

Your email address will not be published. Required fields are marked *