KinshipCorrelationGenerator documentation

Welcome to the KinshipCorrelationGenerator’s documentation!

First time here? Best start at Get started

Contents

Get started

Installing Python

The script has been built using Python 3.7.7. Since it doesn’t rely on very extensive Python features I would expect it to run just fine on any version of Python3. If you do not already have Python installed I recommend you download Anaconda.

Installing dependencies

The script uses several dependencies, once Python is installed you can install these by running

pip install --user --upgrade numpy pandas matplotlib statsmodels scipy openpyxl

Downloading script

Download the latest version of the script from GitHub.

Reformatting pedigree

Start by generating the reformatted pedigree file, if this has not already been done for your sample. Note you will only have to do this once! See Generating a reformatted pedigree file.

Generating correlations

Finally, make sure the reformatted pedigree, script, and your data file are in the same directory (not all of which is strictly necessary, but makes it easier). Then generate correlations, for example:

python KinshipCorrelationGenerator.py --data my_input.csv --outprefix my_output

See Computing correlations.

Generating a reformatted pedigree file

Creating the input pedigree file

Generating a reformatted pedigree file requires your existing pedigree file to have a specific format. First, the file must be uncompressed, and comma-separated. The header-line must be as follows:

!FamID,ID,Father,Mother,Gender,Twincode,DZtwincode,TwinHousehold3,SibHousehold2,SpouseHousehold3

Each of these columns explained in more detail below:

  • !FamID: Family identifier, unique for each family (definition of family is up to the user)

  • ID: Personal identifier, unique for each individual

  • Father: Personal identifier (ID) of the father.

  • Mother: Personal identifier (ID) of the mother.

  • Gender: Gender (M=Male, F=Female)

  • TwinCode: MZ Twin code, unique identifier of each MZ twin pair. (First MZ twin pair in the data is TwinCode 1, second pair is TwinCode 2, etc)

  • DZtwincode: DZ Twin code, unique identifier of each DZ twin pair.

  • TwinHousehold3: Twin pair code, unique identifier of each MZ DZ pair within a family.

  • SibHousehold2: Sibling household pair, unique identifier for all sets of siblings.

  • SpouseHousehold3: Spouse household pair, unique identifier for each spousal pair.

Running the code

Running the following code:

python KinshipCorrelationGenerator.py --pedigree mypedigree.csv

Will generate a reformatted pedigree file reformatted_pedigree.pickle. This reformatted pedigree file will allow you to generate twin- sibling- spouse- and parent-offspring correlations.

If you would like to generate an extended pedigree, use the following:

python KinshipCorrelationGenerator.py --pedigree mypedigree.csv --extended

Will generate a reformatted extended pedigree file reformatted_extended_pedigree.pickle. This file also contains twin- sib- spouse- and parent-offspring pairs. Therefore if you have a reformatted extended pedigree, you do not need a regular reformatted pedigree.

Note

It is recommended to generate a reformatted pedigree file for the entire pedigree. The script can handle missing data, and reformatting the pedigree takes considerably longer than generating correlations. Therefore this reformatting only needs to be done once for an entire pedigree, and can subsequently be used for any number of correlations.

Warning

Generating a pedigree file will always overwrite any existing file named reformatted_pedigree.pickle or reformatted_extended_pedigree.pickle

If you have existing reformatted pedigrees in your directory that you would like to continue using, please rename this before generating a new one!

Computing correlations

Creating the input data file

First you will need to create an input data file. This file should be a comma-seperated file with a header and the following columns:

  • FISNumber: Personal identifier

  • age: Age

As well as any additional phenotype columns. Any column other than FISNumber and age will be treated as a phenotype, unless otherwise specified using --exclude.

The following names for phenotype columns are not allowed: FISNumber, sex, age, Source, index. As these names are used internally.

Generating weighted correlations

By default this script will assign weights to each pair of observations (within kinship within phenotype) depending on the number of occurrences of the personal identifier of both members of that pair to prevent bias from larger families. Pairs where both phenotypic values are missing are excluded from weight calculations. Other options for dealing with larger families are available, see Script arguments.

To generate the kinsip correlations run the following:

python KinshipCorrelationGenerator.py --data mydata.csv --outprefix my_output

This will generate 2 files: my_output_Fam_correlations.csv, which contains 1 column per phenotype containing all the phenotypic correlations, and my_output_Fam_N.csv which contains the sum of weights by default.

To generate the extended kinsip correlations run the following:

python KinshipCorrelationGenerator.py --data mydata.csv --outprefix my_output --extended

Generating cross-trait correlations

To generate cross-trait (ie. bivariate) correlations of the combinations of your phenotypes (i.e. twin1 trait1 - twin1 trait2, mother trait1, son trait2, etc.) you can use the bivar option:

python KinshipCorrelationGenerator.py --data mydata.csv --outprefix my_output --bivar

By adding this argument the script will now save two excel (.xlsx) files, one for the correlations, one for the sum of weights (or N depeding on your other Script arguments) named my_output_bivar_Fam_correlations.xlsx and my_output_bivar_Fam_N.xlsx. These tables will have 1 sheet per kinship (1 for MZM, 1 for MZF, etc), and these sheets contain the full correlation matrix of all phenotypes included in your input file.

The diagonal of these matrices are the standard within-phenotype kinship correlations (same as the standard output), the off-diagonal are the cross-trait cross-kinship correlations. Note these tables are not symmetrical, as the values on either side of the diagonal represent different correlations. Columns are phenotypes in ID_0, and rows are phenotypes in ID_1. As a specific example: in the MotherSon correlation table, column x, row y represents mother trait x and son trait y. Conversiley, column y row x represents the correlation of son’s trait x with mother’s trait y.

Note

The calculation for values on the diagonal in each sheet of the bivariate table is identical to that of the output without the --bivar option. Therefore if this option is enabled no csv files are stored, only the excel files.

Other common options

By default the script calculates weighted Pearson correlation, you can change this to a weighted Spearman correlation by specifying method.

python KinshipCorrelationGenerator.py --data mydata.csv --outprefix my_output --extended --method spearman

To linearly correct your phenotypes for age before calculating the correlations, you can specify correct.

python KinshipCorrelationGenerator.py --data mydata.csv --outprefix my_output --extended --correct age

Correct accepts many combinations of input, for example if you want to correct for age and sex and their interaction use:

python KinshipCorrelationGenerator.py --data mydata.csv --outprefix my_output --extended --correct age+sex+age*sex

You can also use custom covariates as long as they are present with the same name in your input file. If you do so make sure to also add custom covariates to --exclude!

python KinshipCorrelationGenerator.py --data mydata.csv --outprefix my_output --extended --correct age+bmi --exclude bmi

By default, no correlation is computed (and NA returned) when there are less than 30 complete pairs. You can change this to 15 (for example) as follows:

python KinshipCorrelationGenerator.py --data mydata.csv --outprefix my_output --extended --min_n 15

Script arguments

-h, –help

Prints a short help summary with overview of arguments

–morehelp

Print more help on a specific function, or specific functions. Pass the argument name, or argument names (comma seperated), or all for a more detailed description of the whole script and its arguments.

–data

Path to the datafile. This datafile should at least contain the columns FISNumber and age. See Creating the input data file.

–outprefix

Prefix to use for output files.

–extended

Add calculation of extended-pedigree correlations (UncleAunt, AuntNephew, NephewNiece, etc) when used to compute correlations. Generates an extended reformatted pedigree when together with –pedigree

–pedigree

Path to the raw pedigree file, only used when generating a new reformatted pedigree file. See Generating a reformatted pedigree file.

–method

Method for computing correlation, should be Pearson, or Spearman, default is Pearson.

–bivar

Compute bivariate correlations of all combinations of phenotypes. See Generating cross-trait correlations.

–correct

Formula to correct phenotypes before computing correlations. Defaults to no correction

–exclude

Phenotypes for which no correlations should be calculated. For example, when custom covariates are used. Identifier column, age and sex columns are always excluded.

–raw_n

Return an additional .csv file containing the raw number of samples in each kinship in addition to the weighted N file.

–randomsample

Use only 1 pair per family, instead of calculating weighted correlations.

–use_repeated_families

Include all participants from larger families, i.e. don’t drop or weigh for duplicate samples or larger families.

–longitudinal

When the input data is of longitudinal nature, this script can perform family-based selection, such that the difference in the year of survey completion among family-members is minimal. If you use this option the input data should contain the columns index (within-subject number of the survey, starting at 1 for fist survey), and Source (string label of the survey). Additionally, you should also add the argument –surveycompletion.

–surveycompletion

Dat file with survey completion years. Should contain the columns FISNumber (identical to datafile), Source (identical to datafile), and invjr (year of survey completion).

Additonal non-argument settings

The top of the Python file contains some additional settings you can tweak. * upper_boundary: Upper age boundary for inclusion in --longitudinal. default = 110 * lower_boundary: Lower age boundary for inclusion in --longitudinal. default = 0 * check_cutoff_drops: save an Age-cutoff_drops.txt file detailing subjects dropped by the cutoffs. default = False * seed: seed for random selection of subjects with --randomsample. default = 1415926536 * explore_plot: Generate scatterplots of each correlation. default = False: * save_separate_data: Store the raw data used for each correlation in separate files. default = False * parallel: Generate reformatted pedigree using multiple processing threads (Currently not working). default = False

Indices and tables