Search the GCP webpage:

Bioinformatics


Subprogramme Leader Graham McLaren,
g.mclaren@cgiar.org

All Templates

SSR Genotyping Template

Version: 2.0

Template Description: Version 2.0 of the SSR Genotyping templateVersion 2.0 of the SSR Genotyping template

Instructions for the GCP template for SSR fingerprinting data

This template is suitable for SSR fingerprinting data that comes directly from genotyping software or entered manually from traditional gels. You must use a separate file (or set of files for the text based template) for each study. The definition of study is a set of experiments that rely on the same protocols and conditions, even though they may be carried out in several independent experiments. It is assumed that a fully completed study would contain data points for all possible marker by accession combinations for all makers and genotypes in the study. However, it is possible that for a number of reasons that not all possible data points would be available. In this case it is important to follow the guidelines for missing data described below. For all other missing data in other spreadsheets and columns, please leave the field blank, either as an empty cell in the case of Excel or as a double <tab> character in the case of <tab> delimited files.

The template is either in Excel format as a single Excel file of 3-6 spreadsheets OR in text format as a set of 3-6 files. The text file(s) format of the template is recommended for large datasets, due to the restrictions on both row number and column number in Excel. The Excel spreadsheets should be named as below, whereas the text file names are just suggested values. The experiment and conditions spreadsheets (or files) and <b>one</b> data spreadsheet (or file) are required, whereas the others provide optional information.

Data

The data can either be in a list format or a matrix. The list format is preferred since it can be produced directly from both ABI and LI-COR genotyping software and contains more information. Please note that the matrix format should <b>only</b> be used for non-bulked and single individuals from diploid species, where the allele frequency can take only two values 1 and 0.5.

Missing Data Points for Allele/Accession combination

Within a study it is assumed that all possible marker by accession combinations for all makers and genotypes in the study would be attempted and data would be available. However, it is possible that for a number of reasons that not all possible data points would be available. There are two main reasons why a specific marker by accession data point would not be available

Currently this template only supports these two types of missing data and does not for instance differentiate between the different reasons why no allele was detected. Please see the instructions for denoting missing data in either the list or matrix format.

Mappings for this template

Sections available in this template

Section NameDescriptionConditions
SourceInformation on the source of the dataset, the species it concerns and the name and version of the datasetMandatory
ExperimentGeneral experiment dataMandatory
Quality AssessmentInformation about the quality measures usedMandatory
ConditionsExperimental conditionsMandatory
Data ListThe actual data as a listMandatory
Multiple sheets allowed
Excludes Data Matrix
Data MatrixThe actual data as a matrixMandatory
Multiple sheets allowed
Excludes Data List
MarkersInformation about markers used in the experimentOptional
Multiple sheets allowed
MapsOptional
Multiple sheets allowed
SamplesInformation about the DNA samples used in the experimentMandatory
Multiple sheets allowed
InstitutionsList of institute codes used in passport data sections and their corresponding decoded name and addresses.Optional

Source

Section Description: Information on the source of the dataset, the species it concerns and the name and version of the dataset

see section: source in GCPDataSubmissionTemplate2.0

for the following fields institute, principalInvestigator, projectCode, projectName, emailContact, species, ploidy, datasetName, version, creationDate, remark

Experiment

Section Description: General experiment data

Field NameDescriptionConditions
Operational Taxonomic UnitWhat has been assayed? For example, does the sample contain individuals, populations calculated from individuals or bulks of individuals assayed simultaneously.Mandatory
Purpose of the StudyDescription of the reason for the study.Mandatory
Missing DataInformation about missing data. Must be in the form missing data symbol=description. Multiple missing data symbols can be separated with a semi colon.
Example: 9=For each marker there are upto five possible alleles, 9 is uses to represent the absence of 2nd, 3rd, 4th and/or 5th alllele.
Mandatory
RemarknoneOptional

Quality Assessment

Section Description: Information about the quality measures used

see section: qualityAssessment in GCPDataSubmissionTemplate2.0

for the following fields qualityMeasure, standard, control, errorEstimator

Conditions

Section Description: Experimental conditions

Field NameDescriptionConditions
Sampling StrategyA description of the sampling strategy or reference to published method.Mandatory
Control GenotypesA semi-colon separated list of control genotypes present on all gels. If possible please use the accession number or name described in the Accession column of the data for these genotypes.Mandatory
Size StandardsA description size standards usedOptional
DNA ExtractionA description of the DNA extraction method or reference to published method.Mandatory
DNA Amplification and DetectionA description of the DNA amplification and detection method or reference to published method.Mandatory
Genotyping SoftwareThe name and version of the genotyping software used.Optional
ReferenceOne or more references to articles in which the genotyping procedures are published. Please place each reference on a separate row in the same column.Optional

Data List

Section Description: The actual data as a list

List Format

The spreadsheet consists of ten columns, which each row representing a band (or peak) on a gel. If a specific marker by accession data point is missing because the test was carried out but no allele was detected for <b>any reason then</b>, this should be represented in the data list as a single row with only SampleID, Accession, Marker and possibility Gel/Run and Dye fields if in use. No values for Allele, Size, Quality, Height, Volume or Amount will be recorded. If a specific marker by accession data point is missing because the test was never carried out or recorded for <b>any reason</b> then no record for a specific SampleID, Accession, Marker combination should be put in the list. This last eventuality would be the default for missing data of unknown reason.

Table of Allele Amounts

Ploidy# of allelesRelative Allele Contributions
<b><i>Pure</i></b>
2n11
20.5/0.5
3n11
20.66/0.33
30.33/0.33/0.33
4n11
20.5/0.5 <b>or</b> 0.75/0.25
30.5/0.25/0.25
40.25/0.25/0.25/0.25
4n11
20.5/0.5 <b>or</b> 0.66/0.33 <b>or</b> 0.84/0.16
30.33/0.33/0.33 <b>or</b> 0.5/0.33/0.16
40.16/0.16/0.33/0.33 <b>or</b> 0.16/0.16/0.16/0.5
50.16/0.16/0.16/0.16/0.33
60.16 each
<b><i>Bulk</i></b>VariableAn arbitrary mixture of alleles in any combination is possible
Field NameDescriptionConditions
Sample IDA unique identifier of a DNA sample, which can be a sample in a well on a gel or a LIMS entry, or even a unique ID created specifically for this dataset. The SampleID is specific to a lab and is not a universal identifier. If the accession data is provide it must relate to SampleID in the accession sheet or file.Mandatory
Unique
Germplasm IDA unique alphanumeric value which identifies the germplasm. This global identifier links data across domains. The format proposed is concatenation of "holdingInstitute:collectionName:localUniqueID".Mandatory
MarkerThe name of the marker used. If the marker data is provide it must relate to Marker name in the marker sheet or file.Mandatory
Gel/RunThe name or number of the gel or gel run from which the data was taken. This get name will be unique for a specific laboratory but is not a universal identifier. In the case of the ABI sequencer this may be the run number.Optional
DyeThe dye used for detection of the peak.Optional
AlleleThe allele name, which is normally the expected size of SSR fragment.Optional
SizeThe actual size of the peak, which will be a recorded as a real (decimal value).Optional
QualityThe quality scale takes values from 1 to 100 attributed by the genotyping software or 200 the base is corrected manually by the user.Optional
HeightHeight of chromatogram peakOptional
VolumeThe area under the chromatogram peakOptional
AmountThe relative allele contribution of this allele to all alleles at this locus. For possible values for known ploidy and bulk data please refer to the Table of Allele Amounts below.Optional

Data Matrix

Section Description: The actual data as a matrix

Matrix Format

The matrix should <b>only</b> be used for non-bulked and single individuals from diploid species, where the allele frequency can take only two values 1 and 0.5 and where there are no duplications of tests for a specific Marker by Accession combination. Please use the list format for all other data. The spreadsheet consists of two mandatory columns, which are the same as the first two columns in the list format and there is one and only one row (record) per sample. The remaining columns will contain the alleles for each marker, with the number columns equal to the ploidy of the species analysis. Therefore there would need to be 2 columns per marker for diploids, 3 columns per marker for triploids etc. For bulked data there would be a variable number of alleles per marker in each sample. For this reason it is important that each column is labeled with the marker name. Due to the limit on the number of columns in Excel spreadsheets, it may be necessary to have two or more matrix spreadsheets, in this case the first two columns for SampleID and Accession must be repeated in each spreadsheet and each spreadsheet name will be suffixed with the number of the data sheets. For example for three sheets these would be labeled data_matrix1, data_matrix2 and data_matrix3. There is no column limit in text files, so this step is not necessary.

If a specific marker by accession data point is missing because the test was carried out but no allele was detected for <b>any reason</b> then, this should be represented in the data matrix as an �X’ or �x’ character. If a specific marker by accession data point is missing because the test was never carried out or recorded for <b>any reason</b> then the data point should be left blank. This last eventuality would be the default for missing data of unknown reason.

Field NameDescriptionConditions
Sample IDA unique identifier of a DNA sample, which can be a sample in a well on a gel or a LIMS entry, or even a unique ID created specifically for this dataset. The SampleID is specific to a lab and is not a universal identifier. If the accession data is provide it must relate to SampleID in the accession sheet or file.Mandatory
Unique
Germplasm IDA unique alphanumeric value which identifies the germplasm. This global identifier links data across domains. The format proposed is concatenation of "holdingInstitute:collectionName:localUniqueID".Mandatory
AllelesnoneMandatory

Markers

Section Description: Information about markers used in the experiment

Markers (optional)

The spreadsheet consists of ten columns, which each row representing a SSR marker.

see section: markers in GCPMappingTemplate2.0

for the following fields marker, chromosome, motif, forwardPrimer, reversePrimer, annealingTm, minAllele, maxAllele, genBankAccessionNumber, references

Field NameDescriptionConditions
ChromosomeThe name of the chromosome on which the marker has been mapped in this species on this mapOptional

Maps

Section Description: none

see section: maps in GCPMappingTemplate2.0

for the following fields mapID, mapName, chromosome

Samples

Section Description: Information about the DNA samples used in the experiment

Samples (Optional)

The first field in the sample is the SampleID, which relates directly to the SampleID field in the data spreadsheet or file. This SampleID is a unique identifier of a DNA sample, which can be a sample in a well on a gel or a LIMS entry. It could even by a unique identifier developed specifically for this dataset. In the case of multiple extractions from the same material then each same would have a unique SampleID. Please refer to the section on Multiple Data Points for more details.

The GermplasmID field is an optional field for collections where a new GermplasmID is assigned each time an accession is regenerated or for some other reason a new seed or germplasm sample is taken. For this reason an accession in this case is a collection of samples with different GermplasmIDs. GermplasmID are often unique within a specific database for this reason they should be prefixed by the data name or abbreviation. For example, an entry with GermplasmID 2341 in IWIS, would be IWIS:2341.

The remaining accession data should be either in multi-crop passport descriptors (MCPD) or EURISCO descriptors format. These descriptors are MCPD defines a total of 28 descriptors for passport data, each of which equates to a column in the template. EURISCO defines an additional 6 descriptors for a total of 33 descriptors. Only a few MCPD or EURISCO descriptors are mandatory and for the sake of brevity only the mandatory and some recommended optional fields are described here. However, the mandatory descriptor provides sufficient information to allow the accession to be found in the appropriate National Inventory or genebank. For a full description of all MCPD and EURISCO descriptors please refer to the EURISCO_Descriptors.doc file, which is available fro the EPGRIS website (http://www.ecpgr.cgiar.org/epgris/) and or can be downloaded with the passport template.

see section: generalPassportData in GCPPassportTemplate2.0

for the following fields sampleID, sampleGermplasmID, localUniqueID, holdingInstitute, collectionName, genus, species, countryOfOrigin

Field NameDescriptionConditions
Sample IDA unique identifier of a DNA sample, which can be a sample in a well on a gel or a LIMS entry. The SampleID will be unique for a specific laboratory but is not a universal identifier. It must relate to SampleID in the data spreadsheet or file.Mandatory
Unique
Germplasm IDA alphanumeric value which uniquely identifies the germplasm. The format proposed is concatenation of HoldingInstitute:CollectionName:LocalUniqueID. In case a new Germplasm ID is assigned each time an accession is regenerated or for some reason sub-sampled use the current germplasm ID prefixed with the system or database name.
Example: NGA333:Genebank:252
Example: COL003:CIATBEAN:3542
Example: MEX064:IWIS:2341
Mandatory
Unique
Country of OriginCode of the country in which the sample was originally collected. Use 3-letter ISO 3166-1 extended country codes.Optional

Institutions

Section Description: List of institute codes used in passport data sections and their corresponding decoded name and addresses.

see section: institutions in GCPDataSubmissionTemplate2.0

for the following fields faoInstituteCode, organizationName, street, cityState, zipCode, country, institutionalEmail, institutionalTelephone, fax, url, primaryContactName

Copyright (c) 2004-2006 CGN, CIMMYT, CIMMYT, CIMMYT, CIRAD, CIRAD, IITA, IITA, IITA-Nairobi, IPGRI - Rome, IPGRI - Rome, IRRI, IRRI, IRRI, SCRI, University of Dundee

Developed by Richard Bruskiewich (IRRI), Brigitte Courtois (CIRAD), Guy Davenport (CIMMYT), Tom Hazekamp (IPGRI - Rome), Sarah Hearne (IITA-Nairobi), Jennifer Lee (University of Dundee), Mahalakshmi, Visvanathan (IITA), David Marshall (SCRI), Thomas Metz (IRRI), Francis Moonan (IITA), Manuel Ruiz (CIRAD), Thomas Payne (CIMMYT), Raj Sood (IPGRI - Rome), Theo van Hintum (CGN), Marilyn Warburton (CIMMYT), Genevieve Aquino (IRRI)

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License.