Analysing data

In this section we will go through a typical analysis workflow using the castle package:

  1. How to load a set of droplet counts from a QuantaSoft dataset. For this the data needs to be in .csv-format.
  2. Training a model for the CASTLE algorithm. This is done by analyzing the (assay) specific pattern of errors in a set of negative (background) samples.
  3. Analyze some samples of interest for the presence of a mutation, by utilizing the CASTLE algorithm with the model trained above.

The first thing we need to do is load the packages we need for the analysis:

library(castle)
#> Loading required package: readr
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

Loading QuantaSoft data

We start by loading a QuantaSoft dataset in .csv-format using the included import_QS_files-function. Here we will use the example data included in the package, where we have both negative samples of different concentrations (training samples) and some test samples available for a KRAS G12D assay. For this assay Ch1 was used for measuring the mutant signal, which is set using the flag Ch1_is_mutation = TRUE. If Ch2 was used this is simply set to FALSE. Data is loaded like this:

path_to_my_data <- c("path/to/a/csv_file", "path/to/a/folder/with/csv_files/")
my_samples <- import_QS_files(path_to_my_data)

For this example we will load the training samples that are bundled in the castle-package:

path_to_my_data <- system.file("extdata/example_data/", "background_data_KRAS_G12D.csv", package = "castle")
background_samples <- import_QS_files(path_to_my_data, Ch1_is_mutation = TRUE)

The first 5 samples/wells in background_samples looks like this:

head(background_samples, 5)
#> # A tibble: 5 × 13
#>   FileName    Sample  Well  Ch1TargetType Ch2TargetType Target  DoubleNegativeD…
#>   <chr>       <chr>   <chr> <chr>         <chr>         <chr>              <dbl>
#> 1 background… NC20_1… A01   Unknown       Unknown       KRAS_G…             2415
#> 2 background… NC21_1… B01   Unknown       Unknown       KRAS_G…             1534
#> 3 background… NC22_1… C01   Unknown       Unknown       KRAS_G…             2041
#> 4 background… NC23_1… D01   Unknown       Unknown       KRAS_G…              183
#> 5 background… NC24_1… E01   Unknown       Unknown       KRAS_G…             1487
#> # … with 6 more variables: WildtypeOnlyDroplets <dbl>,
#> #   MutantOnlyDroplets <dbl>, DoublePositiveDroplets <dbl>,
#> #   TotalDroplets <dbl>, NumberOfMergedWells <dbl>, MergedWells <lgl>

Note that besides some metadata (file name, target etc.) for each sample, the dataset contains the columns “WildtypeOnlyDroplets”, “MutantOnlyDroplets”, “DoubleNegativeDroplets” and “DoublePositiveDroplets”. These are the counts of the four different kinds of droplets for each sample based on the two signals in Ch1 and Ch2. The following model training and statistical tests are based solely on these counts.

Training a model for CASTLE

To train the model we are only interested in the negative samples. Ideally these should cover the range of different concentrations of DNA that the trained model is expected to encounter in future test samples. Before training we start by removing the empty samples (“NTC”) and positive controls from the training dataset we loaded above:

# Remove control samples ("NTC" and "Positive Control")
clean_background_samples <-
  background_samples %>%
  filter(
    !(Ch1TargetType %in% c("NTC", "Positive Control")),
    !(Ch2TargetType %in% c("NTC", "Positive Control"))
  )

The cleaned data is then used for model training using the function train_simple_ddpcr_model:

trained_model <- train_integrated_ddpcr_model(
  background_samples = clean_background_samples
)

The object trained_model is simply a list of parameters used in the statistical model in CASTLE.

Testing some samples with CASTLE

For testing we will use a variety of samples from two different patients, who have undergone curative surgery for colorectal cancer. We start by loading the data from these patients. To load the data we will again use the function import_QS_files. But this time, data from some of the samples is distributed across multiple wells. These can be merged together using the flag merge_wells = 'yes' (see ?import_QS_files for other methods for merging). Furthermore samples across multiple files can also be merged using the flag merge_files = TRUE if necessary.

# Path to example data in the package
path_to_my_test_samples <- system.file(
  "extdata/example_data/test_samples",
  c(
    "patient_1_plasma_KRAS_G12D.csv",
    "patient_2_plasma_KRAS_G12D.csv"
  ),
  package = "castle"
)

# Import test examples
test_samples <-
  import_QS_files(
    path_to_my_test_samples,
    merge_wells = "yes"
  )

To test the samples we use the function test_tumor_sample_integrated and use the model we trained above:

# Make calls using CASTLE
test_res <- test_tumor_sample_integrated(
  test_samples = test_samples,
  integrated_model = trained_model
)

We will start by looking at some of the results for the tumor and PBL (white blood cells) sample from each patient:

test_res %>%
  dplyr::filter(Sample %in% c("Tumor", "PBL")) %>%
  dplyr::select(FileName, Sample, p_val, mutation_detected)
#> # A tibble: 4 × 4
#>   FileName                       Sample p_val mutation_detected
#>   <chr>                          <chr>  <dbl> <lgl>            
#> 1 patient_1_plasma_KRAS_G12D.csv Tumor  0     TRUE             
#> 2 patient_1_plasma_KRAS_G12D.csv PBL    1     FALSE            
#> 3 patient_2_plasma_KRAS_G12D.csv Tumor  0     TRUE             
#> 4 patient_2_plasma_KRAS_G12D.csv PBL    0.455 FALSE

From this, we clearly see that the mutation is present in the tumor from both of the patients and not in PBL. We can therefore use the mutation as a cancer-specific marker, and thus to check if the patient has residual disease after curative intent surgery. As a positive test, we start by looking at the “Plasma_A” sample of both patients, which is a blood sample collected before surgery:

test_res %>%
  dplyr::filter(Sample %in% c("Plasma_A")) %>%
  dplyr::select(FileName, Sample, p_val, mutation_detected)
#> # A tibble: 2 × 4
#>   FileName                       Sample   p_val mutation_detected
#>   <chr>                          <chr>    <dbl> <lgl>            
#> 1 patient_1_plasma_KRAS_G12D.csv Plasma_A     0 TRUE             
#> 2 patient_2_plasma_KRAS_G12D.csv Plasma_A     0 TRUE

We see that these samples indicate that the mutation, and thus the tumor, is present in both patients, as expected. From the result we can also see the following estimates of properties of the ddPCR analysis and the ctDNA in the blood:

test_res %>%
  dplyr::filter(Sample %in% c("Plasma_A")) %>%
  dplyr::select(
    FileName, Sample,
    mutant_molecules_per_droplet,
    wildtype_molecules_per_droplet,
    total_mutant_molecules,
    mutant_molecules_per_droplet_CI_lower, mutant_molecules_per_droplet_CI_upper,
    wildtype_molecules_per_droplet_CI_lower, wildtype_molecules_per_droplet_CI_upper
  )
#> # A tibble: 2 × 9
#>   FileName      Sample  mutant_molecules_… wildtype_molecules… total_mutant_mol…
#>   <chr>         <chr>                <dbl>               <dbl>             <dbl>
#> 1 patient_1_pl… Plasma…           0.000242              0.224               21.3
#> 2 patient_2_pl… Plasma…           0.00386               0.0960             376. 
#> # … with 4 more variables: mutant_molecules_per_droplet_CI_lower <dbl>,
#> #   mutant_molecules_per_droplet_CI_upper <dbl>,
#> #   wildtype_molecules_per_droplet_CI_lower <dbl>,
#> #   wildtype_molecules_per_droplet_CI_upper <dbl>

test_res %>%
  dplyr::filter(Sample %in% c("Plasma_A")) %>%
  dplyr::select(
    FileName, Sample,
    allele_frequency,
    allele_frequency_CI_lower, allele_frequency_CI_upper
  )
#> # A tibble: 2 × 5
#>   FileName        Sample  allele_frequency allele_frequency_… allele_frequency_…
#>   <chr>           <chr>              <dbl>              <dbl>              <dbl>
#> 1 patient_1_plas… Plasma…          0.00108           0.000576            0.00181
#> 2 patient_2_plas… Plasma…          0.0387            0.0339              0.0438

We can then look what happens after the surgery. To do this we look at the blood samples marked as “Plasma_B”:

test_res %>%
  dplyr::filter(Sample %in% c("Plasma_B")) %>%
  dplyr::select(FileName, Sample, p_val, mutation_detected)
#> # A tibble: 2 × 4
#>   FileName                       Sample   p_val mutation_detected
#>   <chr>                          <chr>    <dbl> <lgl>            
#> 1 patient_1_plasma_KRAS_G12D.csv Plasma_B 0.306 FALSE            
#> 2 patient_2_plasma_KRAS_G12D.csv Plasma_B 0.242 FALSE

We now see that the test indicates that the mutation is no longer present in any of the patients, which is consistent with the tumor being removed by surgery.

The full list of information available in the analysis results is:

#>  [1] "FileName"                               
#>  [2] "Sample"                                 
#>  [3] "Well"                                   
#>  [4] "Ch1TargetType"                          
#>  [5] "Ch2TargetType"                          
#>  [6] "Target"                                 
#>  [7] "DoubleNegativeDroplets"                 
#>  [8] "WildtypeOnlyDroplets"                   
#>  [9] "MutantOnlyDroplets"                     
#> [10] "DoublePositiveDroplets"                 
#> [11] "TotalDroplets"                          
#> [12] "NumberOfMergedWells"                    
#> [13] "MergedWells"                            
#> [14] "mutant_molecules_per_droplet"           
#> [15] "wildtype_molecules_per_droplet"         
#> [16] "p_val"                                  
#> [17] "test_statistic"                         
#> [18] "mutation_detected"                      
#> [19] "allele_frequency"                       
#> [20] "total_mutant_molecules"                 
#> [21] "total_wildtype_molecules"               
#> [22] "mutant_molecules_per_droplet_CI_lower"  
#> [23] "mutant_molecules_per_droplet_CI_upper"  
#> [24] "allele_frequency_CI_lower"              
#> [25] "allele_frequency_CI_upper"              
#> [26] "total_mutant_molecules_CI_lower"        
#> [27] "total_mutant_molecules_CI_upper"        
#> [28] "wildtype_molecules_per_droplet_CI_lower"
#> [29] "wildtype_molecules_per_droplet_CI_upper"

The results can be exported to .csv format via the function write_csv(x = test_res, file = "your/file/path).

Note

Alternatively the functions train_integrated_ddpcr_model and test_tumor_sample_simple are used for training and testing respectively. These are based on a faster (and simpler) version of the CASTLE algorithm, but have very similar performance. See [1] for more information.

Simulating data

In the package the function simulate_droplet_counts for simulating droplet counts from the statistical model that CASTLE is based on is also provided. Here we will simulate a positive and negative sample based on the parameters learned above, and test them using test_tumor_sample_integrated:

set.seed(42)
# DNA concentration (molecules per droplet)
l <- 0.5 # WT
r_positive <- 0.01 # M
r_negative <- 0 # M

# Number of droplets
n_drops <- 14000

# Positive sample
test_sample_positive <-
  simulate_droplet_counts(
    l = l, r = r_positive,
    a = trained_model$a_est, b = trained_model$b_est, c = trained_model$c_est,
    n_drops = n_drops
  )

# Negative sample
test_sample_negative <-
  simulate_droplet_counts(
    l = l, r = r_negative,
    a = trained_model$a_est, b = trained_model$b_est, c = trained_model$c_est,
    n_drops = n_drops
  )


simulated_samples <- bind_rows(
  data.frame(Sample = "Positive", test_sample_positive),
  data.frame(Sample = "Negative", test_sample_negative)
)


# Test samples
sim_test_res <- test_tumor_sample_simple(
  test_samples = simulated_samples,
  simple_model = trained_model
)

# Print some relevant results
sim_test_res %>% select(
  Sample, p_val, mutation_detected,
  total_mutant_molecules, total_mutant_molecules_CI_lower, total_mutant_molecules_CI_upper
)
#>     Sample     p_val mutation_detected total_mutant_molecules
#> 1 Positive 0.0000000              TRUE            144.3065735
#> 2 Negative 0.3989545             FALSE              0.9685377
#>   total_mutant_molecules_CI_lower total_mutant_molecules_CI_upper
#> 1                        115.4188                      177.638344
#> 2                          0.0000                        7.088729