Sequenom

The Sequenom format has been implemented in GWASpi by adopting a number of preliminary conditions, as the original format is by itself quite amorphous. For instance, a predefined annotation file, with chromosome, position, etc. doesn’t exist in Sequenom. We had then to decide in what way best to deliver this information to GWASpi, so as to be able to get all necessary data for processing GWAS. We decided that using a standard PLINK MAP file was a satisfying and familiar way of delivering said information.

PLINK MAP files

The fields in a MAP file are:

  • Chromosome
  • Marker ID
  • Genetic distance
  • Physical position
Example of a MAP file of the standard PLINK format:
21 rs11511647 0 26765
X rs3883674 0 32380
X rs12218882 0 48172
9 rs10904045 0 48426
9 rs10751931 0 49949
8 rs11252127 0 52087
10 rs12775203 0 52277
8 rs12255619 0 52481

Another issue with Sequenom files is that the marker names (ASSAY_ID) it contains are in fact truncated dbSNP IDs (aka rsID). In other words, if the dbSNP ID presents itself in the form or the letters “rs” followed by a number, e.g. “rs1999″, the Sequenom equivalent would only be listed as “1999”.

This is a bad idea.

In GWASpi we struggle to keep the integrity of standards and formats, and if a nomenclature of SNPs has been largely accepted, there is no valid reason to tinker with it. This is the reason why we require the annotation MAP file to contain the correct dbSNP names (“rs1999″ instead of “1999”). Failing to do this will result in a botched import.

The Sequenom format, when containing data in the sizes typical in GWAS, may come as a big, million line long file or a large number of smaller files. GWASpi resolves this issue by uploading all genotype files (plates) it finds under a given folder. As such, you should save all your genotypes files under the same directory, and ONLY genotype data. This means that you should not have other files containing, for example, the annotation data in this directory!

GWASpi will scan this directory for all samples and markers, order them to build an indexed matrix and then start loading genotypes. This two-step process means that loading Sequenom data is somewhat slower to other, more predictable formats. This is the price for achieving coherence.

It has been brought to our attention that the genotype files can contain any number of columns, in any given order, as the export is quite customizable. For this reason, we have decided to use the first 5 columns of the files to be:

Sequenom Genotype files

The fields in a Sequenom Genotype file are:

  • SAMPLE_ID
  • CALL
  • ASSAY_ID
  • WELL_POSITION
  • DESCRIPTION

After these 5 coumns you may append as many others as you see fit. GWASpi, will only need the first 3 columns for now. Below is an excerpt of a valid Sequenom file to be imported in GWASpi:

SAMPLE_ID CALL ASSAY_ID WELL_POSITION DESCRIPTION CALIBRATION MASS_SHIFT
SMPL1 C 1101 C01 A.Conservative Yes 1339169
SMPL1 C 1101 D01 A.Conservative Yes 1614159
SMPL1 TC 1102 C01 A.Conservative Yes 920227
SMPL1 TC 1102 D01 A.Conservative Yes 1179755
SMPL1 G 1201 C01 A.Conservative Yes 136587
SMPL1 AG 1201 D01 A.Conservative Yes 1602552
SMPL1 G 1301 C13 A.Conservative No 2143636
SMPL1 1401 D07 0 4 I.Bad Spectrum No 0
SMPL1 1401 D07 0 4 I.Bad Spectrum No 0
SMPL2 A 1401 G07 0 4 I.Bad Spectrum No 0
SMPL2 1401 G07 0 4 I.Bad Spectrum No 0
SMPL3 AT 1501 [1] J07 772852 4 D.Low Probability No −535691
SMPL3 1501 [1] J07 772852 4 D.Low Probability No −535691

Note that this format contains duplicated entries (rows) and we only take the last valid call in account. Missing genotypes are not loaded and will remain as missing in the final loaded GWASpi matrix.

Further modifications on the processing of Sequenom GWAS may be introduced as we continue our cooperation with groups using this format. We are interested in receiving further input from people using it to improve or amend the currently available import solution.