FORMAT DESCRIPTION We described four different types of files, including 1.ContigCov.txt : describes each chromosome with a list of coordinates to the aligned regions. 2.SNPlist.txt : list of the SNPs and INDELs with external 3.exon_intron.txt : details of the cDNA-to-genome alignment 4.GeneSNPcon.txt : SNPs association to gene structure etc., indicating functional context. ---------------------------------------------------------------------------- Please email chickvd@genomics.org.cn if you have any problem about format. ---------------------------------------------------------------------------- ----------------- 1. ContigCov.txt ----------------- Documentation Source: The file shows a list of coordinates to the aligned regions. We used reads (or contigs by using phrap) from one of three strains to align to chromosome from the RJF genome assembly. *** ContigCov.txt format and data structure *** Example: Covered Region List 6906 7483 7895 8377 8654 9168 The aligned region means at least one read is aligned to the RJF genome assembly. Notes: filename = the name of the chr. 6906 = start position of aligned region with chr coordinates. 7483 = end position of aligned region with chr coordinates. ---------------- 2. SNPlist.txt ---------------- Documentation source: SNPs identitied with detailed information list. This format provides the details of SNP information, including , , , , , , , , , , , , , , , . *** SNPlist.txt format and data structure *** Example: snp.0.1001.817.1 S 817 161 97 51 A->C TGTTAACAATGAGTAACAAAATCAGGCAGGGCTCTAGTAA TGTTAACAATGAGTAACAACATCAGGCAGGGCTCTAGTAA rchsxb0_104909.y1.scf F BLS TCTACTCAGTTTTGGATCATTTC CTAGGAATTGCCTCAGTGAC 398 123 H Y Notes: 1. filename : the name of the chr where a SNP is identified. 2. Each columns were seperated by "\t". 2. The detailed information of each SNP presented in columns. Column symbol Description 1 snpId : a unique name asigned to each SNP snp.10.103.59910.S.1: 10.103 indicates Contig.10.103 59910 indicates the base position at Contigs.10.103 S indicates type of polymorphism. 1 strain name and version number, 1: Broiler strain 2: Layer strain, 3: Silkie strain, 4: Wag_BES_BAC, 2 SNP type : SNP type, S: substitution; I: insertion; D: deletion 3 posChr : position of each SNP with chromosome coordinates Note that insertion-deletion positions are given as two numbers. for example, 1-bp insertion * * RJF sequence TCCAGAATA-CAGATTTTGTACAGGCATACAGCCTG Layer sequence TCCAGAATAGCAGATTTTGTACAGGCATACAGCCTG 3-bp deletion * * RJF sequence TCCAGAATACAGATTTTGTACAGGCATACAGCCTGG Layer sequence TCCAGAATACAGATTTTGTACA---ATACAGCCTGG 4 posReads : position of each SNP with read coordinates for indel, see description in posChr. 5 qualChr : quality value of each SNP in chromosome 6 qualReads : quality value of each SNP in reads 7 Changes : "C->G" changes of the sequences, from the reference(genome) sequence to strain-specific(read) sequence if SNP type is insertion or deletion (indels), then we will give the inserted/deleted sequence. 8 flankingSeqLeft: a 20-bp sequence, left of the SNP position on the reference sequence side. 9 flankingSeqRight: a 20-bp sequence, right of the SNP position on the reference sequence side. 10 ReadsName : the name of the reads(or contigs assemblied by using phrap), SNP is found by comparison between the reference contigs(RJF) and the reads. 11 strand : "F" means forword, "R" means reversed. relation between genome and reads. 12 occurStr : "BLS" means this snp was found in B (Brolier), L (Layer) and S (Silkie) strains. "BL-" "-" means this position was coverd but do not have snp in S (Silkie) strains. "BLX" "X" means this position was not coverd by S (Silkie) strains. 13 primerL : the sequence of left primer 14 primerR : the sequence of right primer 15 ampliconSize: the expected amplicon size of primers 16 distanceL : distance from left primer to the snp or indel site 17 questionM : all the SNPs will be indicated with "H"; all dubious indels were indicated with "L", other indels were "H" 18 transposonM: "Y" means this position was masked by RepeatMasker; "N" means it was not. 19 reservePos : the column reserved for further -------------------- 3. exon_intron.txt -------------------- Documentation source: Gene location in a chicken chromosome with map information. *** exon_intron.txt format and data structure *** Example: gnl|UG|Gga#S7086537 chr1 + 166344623 166349403 166344623 166349403 3 166344623,166345115,166346191, 166344782,166345511,166349403, Notes: 1. the details of location information. Column symbol Description 1 GeneName : the name of gene 2 chrNum : chromosome number, location in chromosome 3 strand : + or - for the strand, +: forward; -: reverse 4 txStart : Transcription start position in chromosome coordinate 5 txEnd : Transcription end position in chromosome coordinate 6 cdsStart : Coding region start position in chromosome coordinate. 7 cdsEnd : Coding region end position in chromosome coordinate. 8 exonCount : Number of exons 9 exonStartList: Exon start positions, seperated by comma 10 exonEndList: Exon end positions, seperated by comma 11 identity : identity of gene location ------------------- 4. GeneSNPcon.txt ------------------- Documentation source: SNPs association to gene structure etc., indicating functional context. *** GeneSNPcon.txt format and data structure *** The file was constructed with the following format: 1. Rows start with one of the following keywords: GN, GP, S5, S3, U5, U3, CS, CN, CF, IR, SS 2. Fields are delimited by pipe '\t' character 3. Each "GN" row was given a gene name where the SNPs were located. 4. Each "GP" row was given the gene's location in genome. 5. Each "S5" row was given a SNP located in 5' upstream. 6. Each "S3" row was given a SNP located in 3' upstream. 7. Each "U5" row was given a SNP located in 5' UTR. 8. Each "U3" row was given a SNP located in 3' UTR. 9. Each "CS" row was given a SNP located in coding region, where the SNP is a synonymous SNP. 10.Each "CN" row was given a SNP located in coding region, where the SNP is a non-synonymous SNP. 11.Each "CF" row was given a SNP located in coding region, where the SNP caused a frame shift. 12.Each "IR" row was given a SNP located in intron region, 13.Each "SS" row was given a SNP located in splice site region. splice site region meant the splice site(GT/AG) position. The lines and fields reported in the flatfile format are: Keyword Description GN Gene Name GP Gene position at genome S5 5' upstream S3 3' downstream U5 5' UTR U3 3' UTR CS coding - synonymous CN coding - non-synonymous CF coding - frame shift IR intron SS splice site == symbol format example === GN GP S5 n/a n/a n/a S3 n/a n/a n/a U5 n/a n/a n/a U3 n/a n/a n/a CS CN CF n/a n/a n/a IR n/a n/a n/a SS n/a n/a symbol Description GeneName : the name of gene chrNum : chromosome number, location in chromosome txStart : Transcription start position in chromosome coordinate txEnd : Transcription end position in chromosome coordinate strand : + or - for the strand, +: forward; -: reverse identity : identity of gene location partM : "C" means the start codon and stop codon could be found in the genome. "P" means not. snpId : a unique name asigned to each SNP snp.10.103.59910.S.1: 10.103 indicates Contig.10.103 59910 indicates the base position at Contigs.10.103 S indicates type of polymorphism. 1 strain name and version number, 1: Broiler strain 2: Layer strain, 3: Silkie strain, SNP type : SNP type, S: substitution; I: insertion; D: deletion posGene : position of each SNP with gene coordinates posGene =0 in S5, S3, IR, SS. posChr : position of each SNP with chromosome coordinates qualChr : quality value of each SNP in chromosome qualReads : quality value of each SNP in reads advChanges : "C->G" changes of the sequences, from the reference(contig) sequence to strain-specific(read) sequence if SNP type is insertion or deletion (Indels), then we will give the inserted/deleted sequence. indel sequence will be reversed when a gene is on the - strand. Note: treat genes as forward strand. advCondonChanges:eg. "CCT->CGT" changes of the codon, from CCT to CGT. ( CCT in chr, CGT in reads) for places where cdna and genome disagree by an indel, IGNORE these codons, as we did in the rice snp analysis advCondonChanges2:eg. "Phe->Ser" changes of the codon, means from TTT to TCT. ( TTT in chr, TCT in reads) for places where cdna and genome disagree by an indel, IGNORE these codons, as we did in the rice snp analysis phase : phase of SNP position in the codon ssChanges : changes at splice site. such as GT -> TT (or AG -> AT) SSIndel:indels in splice site. otherInfo : fileds contains PCR primer sequences and all the other information from SNPtables. such as flanking sequences, read names, strand, occurStr, questionM, transposonM, etc.. pay attention to flanking sequence, we display the reverse sequence when genes is on reverse strand. strand will be changed, too. here strand will indicate relationship between reads and genes. the description about these: SIFTresult : for each non-synonymous SNPs, we run SIFT to determine likelihood of being functional basesd on the degree of consvervation across species. The result was at last column. for example, "[A102V TOLERATED 0.27 3.00]".