GeneScreen

A Program for Genetic Anomaly Screening

University of Iowa

MORL - Molecular Otolaryngology Research Laboratories

Nicole Meyer Horning, Resident Technologist

Introduction
Using GeneScreen
How GeneScreen Works

Introduction

GeneScreen is a computer program for filtering genetic marker data to identify patterns of genetic inheritance in dominant or recessive traits sometimes expressed in closely intermarried families. It takes data extracted from the genetic marker data as a text file, plus a manual indication of which individuals in the data file are parents and children, and which are affected, then develops the genetic heritage analysis and exports the information as a tab-delimited spreadsheet text file.

The basic elements of the GeneScreen software are evident in this annotated screen shot:

Using GeneScreen

Clicking on the www.Help button opens this web page in Internet Explorer.

Marker File

First GeneScreen reads marker data. This is expected to be a tab-delimited file, one marker per line, with an optional header line naming the columns. Additional blank lines and/or comments may be included, so long as they cannot be mistaken for marker data. Markers may occur in the file in any order. GeneScreen sorts the data when it is read.

Marker data lines have three fields. The first column (before the first tab) must be a chromosome number (less than 30) or the letter "X", or else empty, in which case the previous good chromosome number is used.

`Chr`	`Marker`	`cM`
`1`	`D1S468`	`4.22`
`1`	`D1S214`	`14.04`
`1`	`...`

The marker name may be in any column after that, but each marker name must start with "D" and contain an "S" followed by a unique (for that chromosome) numeric marker reference number greater than zero. Extra blank spaces are ignored. Acceptable column headers are "Marker" or "STRP", but these are advisory only. If a valid marker name is found in a different column, it still works.

The distance number may also be in any column after the first, but each distance item should be a pure number, typically with a decimal point. Distance numbers must not be negative, nor greater than 9999.99; a maximum of only two digits after the decimal point are considered, with any extra digits rounded off. The advisory column header is "cM".

Only the first item recognized respectively as a marker name or distance is accepted. Once both the marker name and the distance have been found on a line, the rest of the line is ignored. If there is only a distance on the line, it is assumed to be the distance for the end of the chromosome.

GeneScreen remembers the name and location of the last successful marker file, and starts up reading that file again. However, you may direct it to read a different marker file at any time, after which it will remember the new file.

Raw Data File

GeneScreen is a bit more fussy about the format of the raw genetic data file. Again, it must be a tab-delimited file, one data item per line, with a column header line first. GeneScreen looks for the following column headers in any column, and expects to find the respective data in the corresponding columns of subsequent lines:

Marker -- This column should have marker names matching those in the marker file.

Sample Name -- This column should have reference names of the persons from this data was collected. GeneScreen assumes the name is in the form Family-person (separated by a single hyphen), but almost any string of characters is acceptable (commas, spaces, and plus are not allowed). The family name can be used to separate out family groups (see below). The person name is used in the output. It is recommended that person names be unique, but GeneScreen only requires uniqueness in the family-person combination. Each unique person is added to the Persons field in the main window, from which it can be added to families and/or marked as affected.

Allele 1 / Allele 2 -- These two columns contain the allele data for each person. Alleles are required to be decimal numbers greater than 0 and less than 30.

GeneScreen reads the entire data file into memory, then picks out the significant data. If it encounters a marker name that is not in the marker file, it will complain and offer to ignore that data. It is expected that there will be data lines with no allele data; these are effectively ignored. When there are multiple non-blank data lines for the same person and marker, only the first is retained.

Additional persons may be added to the Persons list by clicking the New button. GeneScreen chooses a number not already used and adds it to the end of the list. Of course there will be no data for that person, but GeneScreen can sometimes infer the data from the family structure and other family members for which there is data. If there is already a family group name, that name will be prefixed to the new person names.

After the file has been read into memory, the persons are extracted and listed in the Persons field of the window. If there is already family data active, it is reconciled to the data. It is recommended that if you want to start over with new raw data and new families, you should quit and restart GeneScreen to prevent confusion.

After the data file has been read in and displayed, you may designate persons in the file as Affected by selecting them one at a time and clicking the Affect button. This puts a "+" symbol to the left of the person's name. Clicking the Affect button a second time removes the designation.

Family File

Once the raw data has been read in, you may create a family structure by selecting persons one at a time, then clicking in turn the Father, Mother, and Children fields, or the Children Add button, depending on whether that person is a parent or child in the nuclear family. Additional nuclear families may be created at any time by clicking the New button, and you may jump around between nuclear families by selecting them in the popup menu. The father or mother may be added or changed at any time, and additional children may be added; there is no way to remove persons already in the family except by deleting the family and creating a new one. If you designate a new father or mother in a family that already has children, GeneScreen will offer to start a new family, but you can decline the offer and just change it. Adding a person to a family advances the person selection to the next person in the list, so if you start at the front of the list you can build the family quickly by one click per person in the family member slot.

Once constructed, you can Save the current family structure to a "familyname.fam" file. Unless you designate otherwise, the family name part of the first family member you select will be offered as a suggested Family Group Name, but you can change it to anything you like. You may Reload a saved family, make additional changes, and then save the new version for later use.

The saved family file also records whether you have the Dominant or Recessive butten checked, and which persons are affected; these are restored when the file is reloaded. If the family being reloaded contains persons not listed in the Persons field, you will be asked if you want them added.

Clicking the Restore button will quickly reload the last data file read in and the last saved (or reloaded) family file, to make it easier to take up where you left off on the previous day.

Output File

The output from GeneScreen is written to a sequence of data files using the initial name in the Output File Name field. To work properly this name should contain a couple digits, which GeneScreen will increment for each chromosome in the data. If the digits initially end in 00, the output file names will reflect the actual chromosome numbers; otherwise they will be incremented sequentially, one per file.

When you open a raw data file, GeneScreen generates a suggested output file name from the data file name. Similarly, when you load in a family data file, its name is offered as an output file name; you can also do this manually by clicking on the Family->File button. If you subsequently make changes to the family group name, the output file name will try to track the changes.

The actual output files are written when you click the Export File button, after doing the processing. The processing consists of first filling in the missing alleles for any persons that GeneScreen is able to infer, then adjusting the allele order to reduce crossovers, then looking for runs of dominant or recessive alleles. The output is in a Tab-Delimited file (or comma-delimited, if the checkbox is unchecked).

How GeneScreen Works

When you click on the Export button, GeneScreen arranges the persons in a spreadsheet, then creates an output file name for each chromosome in the data file, and performs the following operations on the selected family data:

0. Arranging the Persons (once only, for all chromosomes in file)

GeneScreen tries to place all persons in parentless families, and all persons who are parents but not children in any families, into the first row. Persons otherwise eligible for this first line, but also a parent of a family with another person who is not eligible for the first line, will be delayed to be placed next to that other parent.

After any or both parents of a nuclear family have been placed on a row, the children of that family are placed on the next row, if they will fit.

Inbred families (brother+sister, or cousins) will have the respective brother/sister or grandparent pair arranged in their respective family so that the father of the inbred family occurs on the left. Otherwise families are more or less placed in the order they occur in the family file, which is the order they were created.

Persons are not yet aligned vertically with the parents over the children, but that is planned for a future revision.

1. Headers

At the top of each output file is placed a couple of header lines consisting of the marker names and their relative distances (positional differences), followed by one line for each person in the family structure and the unprocessed allele data for that person. Affected persons are indicated, and gender (1 for male, 2 for female) is shown for persons who have been designated either father or mother. There is no other way to enter gender of persons, but non-parent persons can be added to fake one-parent, no-children families as father or mother to accomplish that purpose, if desired.

Markers for which there is no data at all are then removed from further processing.

2. Kids Allele Order

For all children with data in one or both parents, the alleles are ordered to reflect where they came from, or marked "X" if it cannot be determined.

3. Filling Gaps

For all missing alleles in parents, if the missing data can be reasonably inferred from available children's alleles, that data is added in quotes. These inferential choices are not always the best possible, but they sometimes represent a good start.

For each person who is designated a parent of some family, the data is searched for all families that person is parent of, and for which there is also a second parent. Multiple couplings tend to disable the process, or else give unreliable results. All the children of a single pair of parents are collected by marker, and organized into one of the following categories (separate letters 'x', 'y', 'z' etc represent different alleles for that marker):

3a. x,y or x,x -- If there is only one distinct pair, and neither parent has any alleles at all, then each is assigned one of the pair. If this person already has one of the alleles, then nothing further is done (the other parent will be separately considered). If the other parent already has one of the alleles and this parent has neither, then its blank is filled with that second allele. If all children are homozygous on this marker, then both parents are given that allele unless they already have it.

3b. x,x and y,y and possibly x,y -- If there are two different homozygous pairs, then both parents have one each. The extra mixed pair is not determinative and has no effect whether present or not.

3c. x,x and y,z and possibly x,y and/or x,z -- Each parent is known to have an x, and one each y or z. If either parent already has y or z, the decision is forced; otherwise they are assigned arbitrarily and later decided on the basis of crossover counts.

3d. x,y and x,z or x,x and x,y -- The latter being a degenerate form of the first (x=z). One parent is assumed to carry x, and the other parent both y and z. The prior presence of any of these forces the decision; otherwise they are assigned arbitrarily and later decided on the basis of crossover counts.

3e. x,w and x,z and y,w and y,z (or with x=w, x,x and x,y and x,z and y,z) -- One parent has x and y, the other w and z.

3f. x,y and y,z and x,z-- This could be the result of one homozygous parent in any single allele of the three, and the other two in the other parent. The prior presence of any of these forces all or part of the decision; otherwise they are assigned arbitrarily and later decided on the basis of crossover counts.

4. Crossover Minimization

Each heterozygous marker in a parent not a child in some other family with parent data forcing the allele order, is evaluated with both allele orders against all the children in that family, for a minimum total crossover count. In case of equal crossover counts, the minimum sum of inverse distance squares is preferred (this effectively minimizes incidence of double crossovers, which have a fairly large inverse square). Where inferred alleles have been arbitrarily assigned, these are tried in both alternations as well as each order. Where children have been marked as "X" because both parents have the same heterozygous pair, the children alleles are also measured in both orders. This makes for a lot of separate crossover count analyses, which might take several seconds or longer, depending on the number of different pairs to be tried in all possibilities.

5. Dominant Runs

If the Dominant checkbox is checked, GeneScreen looks for runs of the same alleles in all affected persons. If NonAffect is also checked, the non-affected (see Affect above) persons are also considered in looking for runs, and only runs of matching alleles appearing in affected persons and not in non-affected persons are identified. Otherwise the non-affected persons are not considered. A run is any sequence of two or more in all affected persons. Persons on the first two rows of putput are given more weight.

6. Recessive Runs

If the Recessive checkbox is checked, GeneScreen looks for runs of homozygous alleles in all persons, without regard to whether they are affected or not.

7. Annotations

Each person is output in the family tree with their alleles aligned in a vertical column by marker order and the father's allele to the left. When the parentage cannot be readily determined, an "X" is added in the next column. Crossovers, where a child's allele switches from one grandparent to the other, for each sequence are marked by "c" just outside the pair, to the left or right. Runs are then indicated by 1 (for paternal alleles) or 2 (for maternal alleles) or 3 (for both) in the next column to the left.

Epilog

GeneScreen is a program originally developed in Visual Basic and then converted to Qt/C++ by Tom Pittman for Nicole C Meyer Horning at the University of Iowa Molecular Otolaryngology Research Laboratories. Comments or bug reports should be directed to Tom Pittman. Requests for a customized version of this software, or for other Windows-based data manipulation software can be directed to the same email address.

Rev. 2004 November 20