April 30, 2014

Nature Communications, the Genographic Project, Elhaik et al. re-discover zombies, the Oracle, etc. 3 years after the fact...

... and (sadly) do not care to cite my lowly blog.

From the new paper's Methods:
To infer the putative ancestral populations, we applied ADMIXTURE46 in an unsupervised mode to the filtered data set. This analysis uses a maximum likelihood approach to determine the admixture proportions of the individuals in question assuming they emerged from K hypothetical populations. We speculated that our method will be the most accurate when populations have uniform admixture assignments. In choosing the value of K that seemed to best satisfy this condition, we experimented with different Ks ranging from 6 to 12. We identified a substructure at K=10 in which populations appeared homogeneous in their admixture composition. Higher values of K yielded noise that appeared as ancestry shared by very few individuals within the same populations. ADMIXTURE outputs the speculated allele frequencies of each SNP for each hypothetical population.  
Using these data, we simulated 15 samples for each hypothetical population and plotted them in a PCA analysis with the Genographic populations. We observed that two hypothetical populations were markedly close to one another, suggesting they share the same ancestry and eliminated one of them to avoid redundancy. The remaining nine populations were considered the putative ancestral populations and were used in all further analyses.   
Given nine admixture proportions for a sample of unknown geographic origin obtained using ADMIXTURE’s supervised approach with the nine putative ancestral populations, we calculated the Euclidean distance between its admixture proportions and the N reference populations (GEN). All reference populations were sorted in an ascending order according to their genetic distance from the sample.
I'm sure my readers, and users of DIYDodecad know exactly why this is a carbon-copy of the tools I developed for the Dodecad Project. But, in any case...
The most exciting use of "zombies" is to convert unsupervised ADMIXTURE runs into supervised ones. In unsupervised mode, ADMIXTURE treats all individuals alike, and tries to infer their ancestral proportions. In supervised mode, some individuals are treated as "fixed" (belonging 100% in one of K ancestral components), and the ancestry of the rest is inferred.  
The idea is fairly simple: run an unsupervised ADMIXTURE analysis once to generate allele frequencies for your K ancestral components; then generate zombie populations using these allele frequencies; whenever you want to estimate admixture proportions in new samples run supervised ADMIXTURE analysis using the zombie populations.
... and the first post on the Oracle which shows how to find proximity to a population by calculating Euclidean distance in the space of admixture proportions between reference populations and a test individual (and also considers mixtures of populations).

I am flattered that the zombie approach has been copied and tested, but I doubt that all of the paper's 32 authors were unaware of the previous publication of the gist of their "new" method.

Nature Communications 5, Article number: 3513 doi:10.1038/ncomms4513

Geographic population structure analysis of worldwide human populations infers their biogeographical origins

Eran Elhaik et al.

The search for a method that utilizes biological information to predict humans’ place of origin has occupied scientists for millennia. Over the past four decades, scientists have employed genetic data in an effort to achieve this goal but with limited success. While biogeographical algorithms using next-generation sequencing data have achieved an accuracy of 700?km in Europe, they were inaccurate elsewhere. Here we describe the Geographic Population Structure (GPS) algorithm and demonstrate its accuracy with three data sets using 40,000–130,000 SNPs. GPS placed 83% of worldwide individuals in their country of origin. Applied to over 200 Sardinians villagers, GPS placed a quarter of them in their villages and most of the rest within 50?km of their villages. GPS’s accuracy and power to infer the biogeography of worldwide individuals down to their country or, in some cases, village, of origin, underscores the promise of admixture-based methods for biogeography and has ramifications for genetic ancestry testing.

Link

15 comments:

Davidski said...

Interestingly, here's what I wrote in a blog entry in March last year.

"Another way to look at it is that the ancestry proportions are like map coordinates, and they'll place you with a very high degree of accuracy on a genetic map featuring other users."

http://bga101.blogspot.com.au/2013/03/eurogenes-k36-at-gedmatch.html

The plot thickens...

bellbeakerblogger said...

We used dodecad this last summer. Thanks for making available and user friendly.

Jacques Beaugrand said...

The morale is 'publish in well recognized journals otherwise you'll simply get ignored'.


Matt said...

I really agree it's very unfortunate they didn't cite your pioneering of the zombies technique (simulating individual genotypes based on admixture data).
Getting to the meat of what they did with the zombies IIRC as I read it was

Generate them based on Admixture. Apply PCA to the zombies. Generate a set of distances between zombies from the (presumably scaled and rotated) PCA distances. Apply distances to plot the "position" of the real samples using admixture fractions (e.g. samples are plotted as a mix of the position of the components that make up the sample).

They called this "GPS". It seems basically a way of using Admixture to "clean" up genetic signals that act as "noise" when extracting geographical signals and other signals that "compress" populations spatially together in a normal PCA.

It's an interesting idea, if you're looking for a fit that represents geography, but I think you could argue it obscures true "relatedness" more than it removes error (I don't understand the math side of this well enough to comment), and there is a certain introduction of more arbitrariness in the choice of reference populations for the supervised admixture.

Did you (or do you intend to) carry out exactly the same kind of "GPS" analysis with Dodecad or World9 or Globe13? If so, that would be interesting to look at as a comparison.

Eran Elhaik said...

Hi Dienekes, following my original email to you, looks like I wasn't clear enough, so perhaps a clarification is warranted.

The GPS algorithm is the novel method and it was NOT invented by you. The GPS algorithm is relying on admixture coefficients calculated using putative ancestral populations, which as I wrote you, bear resemblance to your zombies. They were invented independently from your zombies, though looking at the timeline, you came up with the concept first. There is a still long way to walk from the concept of Zombies to predicting biogeography as accurately as we did. Looking at your posts, geographic origin is not mentioned at all, but rather ancestry calculations. The GPS code is available in our paper if you wish to challenge that statement.

Again, you came up with a great concept (the Zombies)! and I regret it could not be acknowledged somehow. But it is important to clarify that the GPS algorithm is the novel invention and not the zombies as may be understood from your post.

Adrian Purcell Heathcote said...

Speaking as someone who has had ideas stolen several times I think you should do something about this. Write to Nature, at the very least.

Dienekes said...

@Eran Elhaik

Your paper uses at least two ideas that were first published by me: converting unsupervised ADMIXTURE ones into supervised ones via "zombies"; testing a population's similarity to a reference panel by calculating Euclidean distance over the space of admixture coefficients and finding the closest matching population.

Prior work should be cited when it presents a method that is used in the current work. For example, you cite ADMIXTURE which is a component in your current work. You should have done the same for the ideas used in your paper that were previously published by myself.

You can argue independent invention, but this is hard to believe given the timeline, the known readership of my blog, and the fact that my ideas have spread beyond it and have been used by other genome bloggers and third party tools have been developed around them.

In any case, even if you came up with these concepts independently, it is still proper form to cite prior work that is relevant (which this clearly is). And, if you really "were informed of your work only at the reviewer stage and acknowledgment was not allowed", then you can always write a letter to the editor acknowledging the prior publication of part of your method.

Palisto said...

"Did you (or do you intend to) carry out exactly the same kind of "GPS" analysis with Dodecad or World9 or Globe13? If so, that would be interesting to look at as a comparison."

@Matt
I don't quite see the difference between the described "GPS algorithm" and calculating the "Biogeographical Ancestry using Dodecad Globe13 data"

http://kurdishdna.blogspot.com/2012/11/biogeographical-ancestry-using-dodecad.html

MOCKBA said...

Isn't Elhaik's paper also largely bragging about the performance of an algorithm over exactly the same dataset on which it was optimized? They do mention how much worse is the performance on other datasets down in the paper, but all in all it looks like a PR piece rather than a discovery...
But following up on some questions and references, I spotted the following intriguing study of phylogeny of weaving culture in the SE Asia ... have you reviewed it before, Dienekes?

ikat weaving: rooted in Bronze Age or Neolithic?

Rokus said...

I am sure the GPS algorithm is a novel method, else it shouldn't have published as such. Too late for co-authorship, probably a missed chance for the credibility of the paper, though here Eran Elhaik even admits that Dienekes came up with the concept first. Notwithstanding all possible improvements or increased resolution, the authors owe the courtesy to mention this in the text as a historical note, or at least in Acknowledgements. Once my son was mentioned just for being a helpful student (10.1021/pr4005629). Clearly understated, still this is how scientific progress works.

PF said...

Perhaps this is a cost of anonymity.

Also: I wouldn't expect much from Mr. Elhaik. He clearly was pushing the whole Khazar thing out of some twisted personal/political, rather than scientific, motivations.

Matt said...

@Palisto I don't quite see the difference between the described "GPS algorithm" and calculating the "Biogeographical Ancestry using Dodecad Globe13 data"

That method you described didn't immediately spring to mind earlier... Still if I understand the methods correctly:

- The "Biogeographical Ancestry using Dodecad Globe13 data" compares subject proportions to best matches or combinations of best matches within the set of analysed populations, then uses the best match(es) global latitude and longitude to generate a predicated latitude and longitude for the sample.

- "GPS" just seems to be PCA on zombies generated using ADMIXTURE to generate a PCA that better fits to geography than using a PCA directly on the samples.

The "Biogeographical Ancestry using Dodecad Globe13 data" would also give you an output latitude and longitude for a sample, which might even be better, but that's not really the "point" of GPS at all, and is kind of more of a relatively trivial side effect of it.

The point of GPS is to give a better set of allele frequency dimensions that map to geographic, which can then be used for various forms of geographic genetic analysis (looking for signals of selection, disease, etc.).

I do think this "GPS" seems kind of a low hanging fruit once the zombies idea is out there, so I would be surprised if no genome blogger has looked into this before (but this appears to be the case).

Matt said...

Actually, on reflection, I'm wrong here, having read again, it does seem that the method is only a mere patching together of latitude and longitude data using references and clusters. That does seem unoriginal and not very useful.

Razib Khan said...

bet you more human genomicists read this blog than 99% of human genomics papers. some perspective.

Charles Nydorf said...

You deseve to be credited.