  CL-TF 'addon' for
  ADASTRA release BillCipher (July 2022)
-|======================================|-

In addition to the main TF- and CL-ASBs where all data are aggregated across TFs and cell types, we present the CLTF-ASBs
which were built by aggregating data for particular <cell type, transcription factor> pairs.

By doing so, we have identified 327641 ASBs at 5% FDR in total of which 93664 are 'novel' for the respective TFs compared to the main ADASTRA release BillCipher i.e. they did not pass 5% FDR in the global TF-centric aggregation.

The results are a bit more prone to false positives and the CLTF-ASB distribution is skewed towards <cell type, TF> pairs with multiple datasets, but nonetheless these data might be valuable in particular usage scenarios as particularly interesting TF- and cell type-specific ASBs are often missed in ADASTRA due to lowered statistical significance in case of aggregating conflicting data across TFs or cell types.

Release files structure:

For each <cell type, TF> pair there is a separate file listing all putative ASB events at eligible SNVs that pass the necessary coverage thresholds.

Each tsv-file is a plain tab-separated text document containing one line per single-nucleotide variant (SNV) with the following columns.

  'chr': SNV chromosome, hg38 genome assembly
  'pos': SNV position, hg38, 1-based
  'ID': rsSNP ID of the SNV according to the dbSNP build 151
  'ref': reference allele (A,C,G, or T, according to hg38)
  'alt': alternative allele
  'repeat_type': type of the repetitive region (if any) encompassing the SNV according to the UCSC RepeatMasker track
  'n_peak_calls': total number of ChIP-Seq peak calls (across all GTRD peak callers) overlapping the SNV
  'n_peak_callers': number of unique ChIP-Seq peak callers (from the GTRD list: macs, macs2, sissrs, gem, сpics) that identified a peak overlapping the SNV

   'mean_BAD': mean background allelic dosage (BAD) of the genomic segment encompassing the SNV across all the aggregated experiments. Higher BAD values correspond to the higher contribution of aneuploidy and local copy-number variants. BAD values are taken into account when estimating the statistical significance of individual candidate ASBs (found in different experiments). Mean BAD is computed across all SNV that were used in the statistical aggregation of the particular ASB call.
  'mean_SNP_per_segment': mean number of SNPs in a region with the constant common  BAD
  'n_aggregated': the number of datasets in aggregation
 'total_cover': total read coverage of all aggregated SNVs

  'es_mean_ref', 'es_mean_alt': allele-wise effect size (log2),  weighted-average of log-ratios of observed and expected allelic read counts (negative logarithms of individual P-values from each dataset used as weights).

  'fdrp_bh_ref', 'fdrp_bh_alt': allele-wise logit-aggregated and FDR-corrected P-values

'novel': 'True' if ASB did not pass 5% FDR in the primary TF-aggregation in the ADASTRA BillCipher release, 'False' if it passed 5% FDR in TF-aggregation

For ASBs of transcription factors with motifs available in the HOCOMOCO v.11 (https://hocomoco.autosome.org) core collection, the P-values of the best motif hits were calculated for the Reference and Alternative alleles using SPRY-SARUS (https://github.com/autosome-ru/sarus). The motif position was fixed according to the best hit considering both the Reference and the Alternative alleles on both DNA strands:
  'motif_log_pref': -log10(motif P-value) for the best motif occurrence of the PWM (position weight matrix) for the Ref allele
  'motif_log_palt':  -log10(motif P-value) for the Alt allele
  'motif_fc': motif Fold Change (FC), log2-ratio between motif P-values for the Reference and Alternative alleles. Positive values indicate Alt-ASBs (preferred binding to the Alternative allele). Negative values indicate Ref-ASBs. The value of ‘None’ is assigned in case the PWM was not available.
  'motif_pos': position of the SNV relative to the best PWM hit (taking into account the strand orientation of the motif hit), 0-based
  'motif_orient': '+' or '-', the DNA strand of the best motif PWM hit relative to the chromosome sequence in the genome assembly
  'motif_conc': Motif Concordance indicates whether the allelic read imbalance agrees with the motif Fold Change (FC, predicted from sequence analysis). Concordance is assessed for ASBs passing FDR of 25%. The following notation is used:
'None': Motif is not available or both fdrp_bh_ref and fdrp_bh_alt are > 0.25
'No hit': The best hit P-value is higher than 0.0005
'Weak concordant': The absolute value of FC is less than 2 but consistent with the allelic read imbalance
'Weak discordant': The absolute value of FC is less than 2 and contrasts with the allelic read imbalance
'Concordant': The absolute value of FC is greater or equal than 2 and consistent with allelic read imbalance
'Discordant': The absolute value of FC is greater or equal than 2 and contrasts with allelic read imbalance


For additional details please refer to the primary README of the ADASTRA release dump, and to the information provided at adastra.autosome.org
