Extract noncall snps from soy

From HPCwiki
Jump to navigation Jump to search

Extract the SNPs that are not called in Soy

From the .ped file, only the soy samples were extracted, and call rate calculated like this:

<source lang='bash'> ./plink --file soy --out soy --missing-genotype N --freq --noweb </source>

The output data looks like this:

CHR           SNP   A1   A2          MAF  NCHROBS
  1   MARC0044150    0    0           NA        0
  1   ASGA0000014    A    C          0.5       26
  1   ASGA0000021    G    T        0.375       16
  1   ALGA0000009    G    A       0.1333       30
  1   ALGA0000014    T    C          0.5       28
  1   H3GA0000032    A    C          0.5       38
  1   ASGA0000005    C    A          0.4       20
  1   INRA0000015    A    G       0.4643       28
  1   M1GA0000060    T    C      0.06667       30

Subsequently, only SNPs with less than 4 calls (i.e. 2 individuals) were extracted. This amounts to only a single Soy sample being called for that SNP. <source lang='bash'> cat soy.frq | sed 's/ \+/\t/g' | sort -k6 -nr | awk '$6<4' | cut -f3 >noncallsnps2.txt </source> Note that the 3rd column is selected because the first column is shifted due to leading whitespace.

Make reduced SNP set in Plink

From the original SNP set (bones2.ped), that contains all samples (including the soy samples) for 49802 SNPs that map to autosomes in Ssc10.2, a reduced SNP set is made that only contains a subset of 8552 SNPs for the same individuals (including soy). <source lang='bash'> ./plink --file bones2 --out bones2_reduced --missing-genotype N --noweb --extract noncallsnps2.txt --recode </source>

Calculate heterozygosities from .ped file

Python3 script to calculate heterozygosity from .ped file: <source lang='python'> import argparse import sys import os

parser = argparse.ArgumentParser( description='calculate heterozygosities for individuals in a ped file') parser.add_argument("-p", "--pedfile", help="name of pedfile", nargs=1)

args = parser.parse_args() ped=args.pedfile[0]

def pedfile_to_list_of_lists(pedfile):

 flist=[]
 with open(pedfile) as pedf:
   for l in pedf.readlines():
     l=l.rstrip().split()
     flist.append(l[:])
 return flist

def calc_het(geno):

  het=0
  hom=0
  for i in range(0,len(geno),2):
     pair=geno[i:i+2]
     if pair[0] == pair[1] and pair[0] != 'N':
        hom+=1
     elif pair[0] != pair[1] and pair[0] != 'N':
        het+=1
  obshet=het/(het+hom)
  return [obshet,het,hom]

if __name__=="__main__":

 allgenos=pedfile_to_list_of_lists(ped)
 for genos in allgenos:
    het=calc_het(genos[6:])
    print(genos[0],genos[1],het[0],het[1],het[2],sep='\t')

</source>

The script is invoked like this:

<source lang='bash'> python3 het_stats.py -p bones2_reduced.ped </source>

The output will look something like this:

 BABA	HAPmap3_box3_G9_BABA_29	0.003549362	27	7580
 BABA	HAPmap3_box3_H9_BABA_31	0.007805386	60	7627
 BK22	WG0093132-DNAH08_SSBK22M01	0.247806248	2118	6429
 BK23	WG0093131-DNAA02_SSBK23F09	0.243493552	2077	6453
 DU22	WG0093131-DNAC01_SSDU22F01	0.287118287	2454	6093
 DU23	WG0093132-DNAD11_SSDU23M06	0.290866511	2484	6056
 GL100	HAPmap5_Box7_B03_SSRB417_U01	0.137879911	186	1163
 GL101	HAPmap5_Box7_C03_SSRB420_U01	0.010806217	89	8147
 GL102	HAPmap5_Box7_D03_SSRB421_U01	0.014296464	114	7860

See also

Bioinformatics tips, tricks, workflows at ABGC

External links

Plink website and manual