微生物组溯源分析¶

介绍两款发表在Nature Methods上的R包，用于扩增子数据的溯源分析或者污染分析。

SourceTrack¶

SourceTracker软件于2011年被发表在Nature Methods[https://www.nature.com/articles/nmeth.1650]上，由圣地亚哥大学的Scott教授及其团队完成。

该软件称目标样本为Sink，微生物污染源或来源的样品为Source；基于贝叶斯算法，探究目标样本（Sink）中微生物污染源或来源（Source）的分析。根据Source样本和Sink样本的群落结构分布，来预测Sink样本中来源于各Source样本的组成比例。

其官方网站为：danknights/sourcetracker (github.com)

软件有Qimme脚本和R脚本两种形式，建议直接使用R脚本。

metadata手动生成，otu表格必须是纯整数型变量

#Sampleid	Description	Env	SourceSink	Study	Details
Run20100430_H2O-1	PCR	water 1	sink	Lab 1	PCR_water_sample_1_2010_04_30_run
Run20100524_ESC_C-3ss	Swab 2	Lab 2	sink	Lab 2	Sterile_swab_sample_3_2010_05_24_run
F11Fcsp	Gut 1	Gut	source	1
F12Fcsp	Gut 2	Gut	source	1
F13Fcsp	Gut 3	Gut	source	1

# This runs SourceTracker on the original "contamination" data set
# (data included in 'data' folder)

# load sample metadata
metadata <- read.table('data/metadata.txt',sep='\t',h=T,row.names=1,check=F,comment='')

# load OTU table
# This 'read.table' command is designed for a 
# QIIME-formatted OTU table.
# namely, the first line begins with a '#' sign
# and actually _is_ a comment; the second line
# begins with a '#' sign but is actually the header
otus <- read.table('data/otus.txt',sep='\t', header=T,row.names=1,check=F,skip=1,comment='')
otus<-otus[rowSums(otus)>10,]
otus <- t(as.matrix(otus))

# extract only those samples in common between the two tables
common.sample.ids <- intersect(rownames(metadata), rownames(otus))
otus <- otus[common.sample.ids,]
metadata <- metadata[common.sample.ids,]
# double-check that the mapping file and otu table
# had overlapping samples
if(length(common.sample.ids) <= 1) {
    message <- paste(sprintf('Error: there are %d sample ids in common '),
                    'between the metadata file and data table')
    stop(message)
}

# extract the source environments and source/sink indices
train.ix <- which(metadata$SourceSink=='source')
test.ix <- which(metadata$SourceSink=='sink')
envs <- metadata$Env
if(is.element('Description',colnames(metadata))) desc <- metadata$Description


# load SourceTracker package
source('src/SourceTracker.r')

# tune the alpha values using cross-validation (this is slow!)
# tune.results <- tune.st(otus[train.ix,], envs[train.ix])
# alpha1 <- tune.results$best.alpha1
# alpha2 <- tune.results$best.alpha2
# note: to skip tuning, run this instead:
alpha1 <- alpha2 <- 0.001

# train SourceTracker object on training data
st <- sourcetracker(otus[train.ix,], envs[train.ix])

# Estimate source proportions in test data
results <- predict(st,otus[test.ix,], alpha1=alpha1, alpha2=alpha2)

# Estimate leave-one-out source proportions in training data 
results.train <- predict(st, alpha1=alpha1, alpha2=alpha2)

# plot results
labels <- sprintf('%s %s', envs,desc)
plot(results, labels[test.ix], type='pie')

# other plotting functions
# plot(results, labels[test.ix], type='bar')
# plot(results, labels[test.ix], type='dist')
# plot(results.train, labels[train.ix], type='pie')
# plot(results.train, labels[train.ix], type='bar')
# plot(results.train, labels[train.ix], type='dist')

# plot results with legend
# plot(results, labels[test.ix], type='pie', include.legend=TRUE, env.colors=c('#47697E','#5B7444','#CC6666','#79BEDB','#885588'))

FEAST¶

2019年来自加州大学洛杉矶分校的Eran Halperin教授在Nature Methods上发表了快速准确的微生物来源追溯工具FEAST[][FEAST: fast expectation-maximization for microbial source tracking | Nature Methods](https://www.nature.com/articles/s41592-019-0431-x]

快速准确的微生物来源分析一直是本领域的难点，之前发布的SourceTracker仍有速度慢，准确率不高的问题；
本文提出一种新的方法FEAST，可以实现快速、更准确的微生物来源追踪；
软件基于R语言开发，保证了方法跨平台的可用性；
应用于婴儿和厨房两个微生物组项目，结果的微生物来源解释比例更合理；
此方法在分类问题中，也比JSD、加权UniFrac指标有更好的AUC值，在医学诊断中有更好的应用前景。

本R包安装比SourceTrack要复杂一些

注意meta信息需要确定是study是1对1还是1对多，1对多sink study id不要重复，source id留空白。

Installation¶

FEAST will be available on QIIME 2 very soon. Until then you can you can simply install FEAST using devtools:

devtools::install_github("cozygene/FEAST")

Usage¶

As input, FEAST takes mandatory arguments:

C - An m by n count matrix, where m is the number samples and n is the number of taxa.
metadata - An m by 3 table, where m is the number of samples. The metadata table has three columns (i.e., 'Env', 'SourceSink', 'id'). The first column is a description of the sampled environment (e.g., human gut), the second column indicates if this sample is a source or a sink (can take the value 'Source' or 'Sink'). The third column is the Sink-Source id. When using multiple sinks, each tested with the same group of sources, only the rows with 'SourceSink' = Sink will get an id (between 1 - number of sinks in the data). In this scenario, the sources’ ids are blank. When using multiple sinks, each tested with a distinct group of sources, each combination of sink and its corresponding sources should get the same id (between 1 - number of sinks in the data). Note that these names must be respected.
EM_iterations - A numeric value indicating the number of EM iterations (default 1000).
COVERAGE - A numeric value indicating the rarefaction depth (default = minimal sequencing depth within each group of sink and its corresponding sources).
different_sources_flag - A Boolean value indicating the source-sink assignment. different_sources_flag = 1 if different sources are assigned to each sink , otherwise = 0.
dir_path - A path to an output.txt file.
outfile - the prefix for saving the output file.

Value:

FEAST returns an S1 by S2 matrix P, where S1 is the number sinks and S2 is the number of sources (including an unknown source). Each row in matrix P sums to 1. Pij is the contribution of source j to sink i. If Pij == NA it indicates that source j was not used in the analysis of sink i. FEAST will save the file "demo_FEAST.txt" (a file containing matrix P) .

Demo¶

We provide a dataset for an example of FEAST usage. Download the demo files here.

First load the FEAST packages into R:

library(FEAST)

Then, load the datasets:

metadata <- Load_metadata(metadata_path = "~/FEAST/Data_files/metadata_example_multi.txt")
otus <- Load_CountMatrix(CountMatrix_path = "~/FEAST/Data_files/otu_example_multi.txt")

Run FEAST, saving the output with prefix "demo":

FEAST_output <- FEAST(C = otus, metadata = metadata, different_sources_flag = 1, dir_path = "~/FEAST/Data_files/",
                      outfile="demo")
F

FEAST will then save the file demo_FEAST.txt - A file containing an S1 by S2 matrix P, where S1 is the number sinks and S2 is the number of sources (including an unknown source). Each row in matrix P sums to 1.

Graphical representation:

As input, PlotSourceContribution takes mandatory arguments:

SinkNames - A vector with the sink names to plot.
SourceNames - A vector with all the sources' names.
Same_sources_flag - A Boolean value indicating the source-sink plotting assignment. Same_sources_flag = 1 if the same sources are assigned to the pre-defined sink samples , otherwise = 0.
dir_path - A path to an output .png file.
mixing_proportions - A list of vectors, where entry i corresponds to the vector of source contributions (summing to 1) to sink i.
Plot_title - Plot's title and output .png file's name.
N - Number of barplots in each output .png file.

PlotSourceContribution(SinkNames = rownames(FEAST_output)[c(5:8)],
                       SourceNames = colnames(FEAST_output), dir_path = "~/FEAST/Data_files/",
                       mixing_proportions = FEAST_output, Plot_title = "Test_",Same_sources_flag = 0, N = 4)

Input format¶

The input to FEAST is composed of two tab-delimited ASCII text files:

(1) count table - An m by n count matrix, where m is the number samples and n is the number of taxa. Row names are the sample ids ('SampleID'). Column names are the taxa ids. Every consecutive column contains read counts for each sample. Note that this order must be respected.

count matrix (first 4 rows and columns):

	ERR525698	ERR525693	ERR525688	ERR525699
taxa_1	0	5	0	20
taxa_2	15	5	0	0
taxa_3	0	13	200	0
taxa_4	4	5	0	0

(2) metadata - An m by 3 table, where m is the number of samples. The metadata table has three columns (i.e., 'Env', 'SourceSink', 'id'). The first column is a description of the sampled environment (e.g., human gut), the second column indicates if this sample is a source or a sink (can take the value 'Source' or 'Sink'). The third column is the Sink-Source id. When using multiple sinks, each tested with the same group of sources, only the rows with 'SourceSink' = Sink will get an id (between 1 - number of sinks in the data). In this scenario, the sources’ ids are blank. When using multiple sinks, each tested with a distinct group of sources, each combination of sink and its corresponding sources should get the same id (between 1 - number of sinks in the data). Note that these names must be respected.

*using multiple sinks, each tested with the same group of sources:

SampleID	Env	SourceSink	id
ERR525693	infant gut 2	Sink	2
ERR525688	Adult gut 1	Source	NA
ERR525699	Adult gut 2	Source	NA
ERR525697	Adult gut 3	Source	NA

*using multiple sinks, each tested with a different group of sources:

SampleID	Env	SourceSink	id
ERR525688	Adult gut 1	Source	1
ERR525691	Adult gut 2	Source	1
ERR525699	infant gut 2	Sink	2
ERR525697	Adult gut 3	Source	2
ERR525696	Adult gut 4	Source	2

Output -

infant gut 2	Adult gut 1	Adult gut 2	Adult gut 3	Adult skin 1	Adult skin 2	Adult skin 3	Soil 1	Soil 2	unknown
5.108461e-01	9.584116e-23	4.980321e-12	2.623358e-02	5.043635e-13	8.213667e-59	1.773058e-10	2.704118e-14	3.460067e-02	4.283196e-01