Extract noncall snps from soy
Extract the SNPs that are not called in Soy
From the .ped
file, only the soy samples were extracted, and call rate calculated like this:
<source lang='bash'> ./plink --file soy --out soy --missing-genotype N --freq --noweb </source>
The output data looks like this:
CHR SNP A1 A2 MAF NCHROBS 1 MARC0044150 0 0 NA 0 1 ASGA0000014 A C 0.5 26 1 ASGA0000021 G T 0.375 16 1 ALGA0000009 G A 0.1333 30 1 ALGA0000014 T C 0.5 28 1 H3GA0000032 A C 0.5 38 1 ASGA0000005 C A 0.4 20 1 INRA0000015 A G 0.4643 28 1 M1GA0000060 T C 0.06667 30
Subsequently, only SNPs with less than 4 calls (i.e. 2 individuals) were extracted. This amounts to only a single Soy sample being called for that SNP. <source lang='bash'> cat soy.frq | sed 's/ \+/\t/g' | sort -k6 -nr | awk '$6<4' | cut -f3 >noncallsnps2.txt </source> Note that the 3rd column is selected because the first column is shifted due to leading whitespace.
Make reduced SNP set in Plink
From the original SNP set (bones2.ped), that contains all samples (including the soy samples) for 49802 SNPs that map to autosomes in Ssc10.2, a reduced SNP set is made that only contains a subset of 8552 SNPs for the same individuals (including soy). <source lang='bash'> ./plink --file bones2 --out bones2_reduced --missing-genotype N --noweb --extract noncallsnps2.txt --recode </source>
Calculate heterozygosities from .ped
file
Python3 script to calculate heterozygosity from .ped
file:
<source lang='python'>
import argparse
import sys
import os
parser = argparse.ArgumentParser( description='calculate heterozygosities for individuals in a ped file') parser.add_argument("-p", "--pedfile", help="name of pedfile", nargs=1)
args = parser.parse_args() ped=args.pedfile[0]
def pedfile_to_list_of_lists(pedfile):
flist=[] with open(pedfile) as pedf: for l in pedf.readlines(): l=l.rstrip().split() flist.append(l[:]) return flist
def calc_het(geno):
het=0 hom=0 for i in range(0,len(geno),2): pair=geno[i:i+2] if pair[0] == pair[1] and pair[0] != 'N': hom+=1 elif pair[0] != pair[1] and pair[0] != 'N': het+=1 obshet=het/(het+hom) return [obshet,het,hom]
if __name__=="__main__":
allgenos=pedfile_to_list_of_lists(ped) for genos in allgenos: het=calc_het(genos[6:]) print(genos[0],genos[1],het[0],het[1],het[2],sep='\t')
</source>
The script is invoked like this:
<source lang='bash'> python3 het_stats.py -p bones2_reduced.ped </source>
The output will look something like this:
BABA HAPmap3_box3_G9_BABA_29 0.003549362 27 7580 BABA HAPmap3_box3_H9_BABA_31 0.007805386 60 7627 BK22 WG0093132-DNAH08_SSBK22M01 0.247806248 2118 6429 BK23 WG0093131-DNAA02_SSBK23F09 0.243493552 2077 6453 DU22 WG0093131-DNAC01_SSDU22F01 0.287118287 2454 6093 DU23 WG0093132-DNAD11_SSDU23M06 0.290866511 2484 6056 GL100 HAPmap5_Box7_B03_SSRB417_U01 0.137879911 186 1163 GL101 HAPmap5_Box7_C03_SSRB420_U01 0.010806217 89 8147 GL102 HAPmap5_Box7_D03_SSRB421_U01 0.014296464 114 7860
See also
Bioinformatics tips, tricks, workflows at ABGC