【RAD-seq 解析】ipyrad、やってみた

わたしは今までRAD-seq データからのSNP コールにはstacks を使っていたのですが、系統比較にはipyrad のほうが強いかも、という話を聞いてipyrad も使ってみることにしました。stacks の使い方をまだまとめていないのに先にipyrad 書いてしまって後輩各位には申し訳ないのですが、使いながら書いていくのが一番楽なので、こっちを先に書きます。Stacks は今しばらくお待ちください。

というわけで、今回使ったソフトウェアはこちら。

github.com

#install

#一応仮想環境を新しく構築

conda create -n ipyrad -y

conda activate ipyrad

conda install ipyrad -c conda-forge -c bioconda

#parameter file 作成

#hoge は自分で考えたアセンブリ名

ipyrad -n hoge

パラメーターの説明一覧

Assembly: Parameters — ipyrad documentation

------- ipyrad params file (v.0.9.14)-------------------------------------------
iptest                         ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
/home/deren/Documents/ipyrad/tests ## [1] [project_dir]: Project dir (made in curdir if not present)
                               ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
                               ## [3] [barcodes_path]: Location of barcodes file
                               ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
denovo                         ## [5] [assembly_method]: Assembly method (denovo, reference)
                               ## [6] [reference_sequence]: Location of reference sequence file
rad                            ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
TGCAG,                         ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2)
5                              ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read
33                             ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)
6                              ## [11] [mindepth_statistical]: Min depth for statistical base calling
6                              ## [12] [mindepth_majrule]: Min depth for majority-rule base calling
10000                          ## [13] [maxdepth]: Max cluster depth within samples
0.85                           ## [14] [clust_threshold]: Clustering threshold for de novo assembly
0                              ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes
0                              ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)
35                             ## [17] [filter_min_trim_len]: Min length of reads after adapter trim
2                              ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences
0.05                           ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus (R1, R2)
0.05                           ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus (R1, R2)
4                              ## [21] [min_samples_locus]: Min # samples per locus for output
0.2                            ## [22] [max_SNPs_locus]: Max # SNPs per locus (R1, R2)
8                              ## [23] [max_Indels_locus]: Max # of indels per locus (R1, R2)
0.5                            ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus
0, 0, 0, 0                     ## [25] [trim_reads]: Trim raw read edges (R1>, <R1, R2>, <R2) (see docs)
0, 0, 0, 0                     ## [26] [trim_loci]: Trim locus edges (see docs) (R1>, <R1, R2>, <R2)
p, s, l                        ## [27] [output_formats]: Output formats (see docs)
                               ## [28] [pop_assign_file]: Path to population assignment file
                               ## [29] [reference_as_filter]: Reads mapped to this reference are removed in step 3

## [7] data type

paired ddrad type (2 different cutters) pairgbs

paired gbs type (1 cutter cuts both ends)

ペアエンドシーケンスを行った時はpairddrad もしくはpairgbs とする！

これを見落としていて痛い目を見ました
## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2)
デマルチプレックスとフィルタリングに使用される制限酵素配列

制限酵素認識配列ではなく、この配列のうち、消化後にシークエンシングされたリードに付着したままの部分を入力する

生データフィルタリング後にやるときは不要

## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read

低クオリティの塩基が最大いくつ続くか

## [13] [maxdepth]: Max cluster depth within samples

繰り返し配列を除くのに役立つが、ただの高カバレッジ領域を除去してしまうリスクもある

## [14] [clust_threshold]: Clustering threshold for de novo assembly

2 つの配列が相同であると識別され、したがってクラスタリングされる配列類似度のレベル、0.95以上は非推奨
## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes

バーコードファイル内のバーコードとシーケンシングされたリードで見つかったバーコードとの間に許容される不一致の最大数、すでに個体ごとのソートが終わっている場合には関係なし
## [18] [max_alleles_consens]: Max alleles per site in consensus sequences

シーケンシングエラーを考慮した後、(個々の)consensリードで許可されるユニークな対立遺伝子の最大数、デフォルトは2（二倍体向け）
## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus

コンセンサス配列に対し、いくつの"N"を許容するか　デフォルトは0.05
## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus

コンセンサス配列に許容されるヘテロ接合塩基の最大割合　デフォルトは0.05
## [21] [min_samples_locus]: Min # samples per locus for output

最終的なデータセットにデータを保持するために、指定された遺伝子座でデータを保持しなければならない最小のサンプル数

データセットの全サンプル数に等しい数値を入力すると、すべてのサンプルでデータが共有されている遺伝子座のみを返す
## [22] [max_SNPs_locus]: Max # SNPs per locus

最終的な遺伝子座で許容されるSNPの最大数

これは、この数以上の SNP を持つ遺伝子座を除外することで、最終データセットの反復領域におけるアライメント不良の影響を除去可能、デフォルトは0.2（かなり高いらしい）

※要検討の値
## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus

遺伝子座における共有多型部位の最大数（または割合）

このオプションは、多くのサンプルで共有されたヘテロ接合部位は、真のヘテロ接合部位ではなく、固定された差を持つパラログのクラスタリングを表している可能性が高いため、潜在的なパラログを検出するために使用される

デフォルトは0.5

## [27] [output_formats]: Output formats (see docs)

ipyrad/output_formats.rst at master · dereneaton/ipyrad · GitHub

p: PHYLIP (Full dataset)
s: PHYLIP (SNPs only)
u: PHYLIP (One SNP per locus)
n: NEXUS
k: STRUCTURE
g: EIGENSTRAT .geno
G: G-PhoCS
v: VCF (SNPs only)

様々な解析に必要なVariant Call Format *.vcf.gz

STRUCTURE 用に一遺伝子座当たり一つのSNP をランダム抽出したい場合には*.u.str を指定

STRUCTURE *.str & *.u.str

There is an additional *.u.geno file output that includes only unlinked SNPS, with one SNP being randomly chosen per locus and the rest ignored.

実行

# select a params file (-p) and steps to run (-s) for this assembly

ipyrad -p params-hoge.txt -s 1234567

このs はどのステップを行うか、です。全部のステップを踏むときは1234567とします。

参考：Assembly: Seven Steps — ipyrad documentation

最後に、インプットファイル名についてですが、ペアエンドの場合は_R1_, _R2_ を含み、それ以外の部分は同じ名前にしないと解析が進みません。わたしはそういう初歩的なミスで３時間を溶かしました。同じ目に合わないように、違う形式になっている人はあらかじめrenameコマンドで名前を変更しておきましょう。

-s 1234567で実行して、次の日に見たら止まっていました。

…

[####################] 100% 1:58:08 | aligning clusters

Encountered an Error.
Message: OSError: [Errno 12] Cannot allocate memory
Error: ipcluster shutdown and must be restarted
Parallel connection closed.

メモリ不足？

aligning clusters、すなわちステップ３までは終わっているようなので、４から再開。

20210211

できました！こんな感じでhoge_outfiles というフォルダが作られて、中にアウトプットファイルが入っています。

f:id:nemunemu_nyanko:20210213151718p:plain

ただ、わたしはvcfファイルとStructure 用インプットが欲しいので、

## [27] [output_formats]: Output formats (see docs) に*.u.str 、vを足してステップ７だけやり直します。

ipyrad.assemble.utils.IPyradError:
Warning: Population assignment file not found. This must be an
absolute path (/home/wat/ipyrad/data/my_popfile.txt) or relative to
the directory where you're running ipyrad (./data/my_popfile.txt)

集団情報がないって警告されました。ただ、今回は全個体を一集団として解析したいので一旦無視します。

もし、細かい集団に分けて比較を行いたいときは ## [28] [pop_assign_file]: Path to population assignment file に集団情報ファイルを追加します。

集団情報ファイルの作り方は28. pop_assign_file を確認してください。

vcf ができたので、TASSEL というソフトウェアで開いてみると

f:id:nemunemu_nyanko:20210215154848p:plain

やたらとNが多い…。

集団情報を入れなかったから？

仕方がないので集団情報をつくってやってみました。やや面倒です。

＃example (Assembly: Parameters — ipyrad documentation)

Sample1 pop1
Sample2 pop1
Sample3 pop1
Sample4 pop2
Sample5 pop2

# pop1:2 pop2:2

公式に乗っている例は上記のようにスペース区切りでしたが、stacks と同じでタブ区切りで保存したデータでも問題なかったので、stacks のときにExcel で作っておいた集団情報をベースに作ることができました。

f:id:nemunemu_nyanko:20210216115903p:plain — Excel で作っておいた集団情報の表

ただ、面倒なのが例の一番下に#付きで示されている集団ごとのmin_samples_locus value（例では# pop1:2 pop2:2 と表記されています）で、これは集団ごとに「個体数*共有割合＝共有個体数」を出して：の後ろに記入しなくてはならないのです。この手順がやや面倒でした。stacks だとpopulations の-r オプションで共有割合を打ち込むだけで済むので。

また、stacks のときは集団を番号表記しましたが、ipyradでは例通り、「pop (group number)」表記をしました。popつけなかった場合は試してないのでどうなるのかはわかりません。

この集団情報をパラメーターファイルの## [28] [pop_assign_file]: Path to population assignment file に追加して実行した結果が以下です。

f:id:nemunemu_nyanko:20210216120741p:plain

うむ、N だらけの状況は改善されました。

しかし、SNP が少なすぎるので改善する必要がありますね…先は長い。