Process MS data: clean, normalize and summarize before differential analysis

dataProcess(
  raw,
  logTrans = 2,
  normalization = "equalizeMedians",
  nameStandards = NULL,
  featureSubset = "all",
  remove_uninformative_feature_outlier = FALSE,
  min_feature_count = 2,
  n_top_feature = 3,
  summaryMethod = "TMP",
  equalFeatureVar = TRUE,
  censoredInt = "NA",
  MBimpute = TRUE,
  remove50missing = FALSE,
  fix_missing = NULL,
  maxQuantileforCensored = 0.999,
  use_log_file = TRUE,
  append = FALSE,
  verbose = TRUE,
  log_file_path = NULL
)

Arguments

raw	name of the raw (input) data set.
logTrans	base of logarithm transformation: 2 (default) or 10.
normalization	normalization to remove systematic bias between MS runs. There are three different normalizations supported: 'equalizeMedians' (default) represents constant normalization (equalizing the medians) based on reference signals is performed. 'quantile' represents quantile normalization based on reference signals 'globalStandards' represents normalization with global standards proteins. If FALSE, no normalization is performed.
nameStandards	optional vector of global standard peptide names. Required only for normalization with global standard peptides.
featureSubset	"all" (default) uses all features that the data set has. "top3" uses top 3 features which have highest average of log-intensity across runs. "topN" uses top N features which has highest average of log-intensity across runs. It needs the input for n_top_feature option. "highQuality" flags uninformative feature and outliers.
remove_uninformative_feature_outlier	optional. Only required if featureSubset = "highQuality". TRUE allows to remove 1) noisy features (flagged in the column feature_quality with "Uninformative"), 2) outliers (flagged in the column, is_outlier with TRUE, before run-level summarization. FALSE (default) uses all features and intensities for run-level summarization.
min_feature_count	optional. Only required if featureSubset = "highQuality". Defines a minimum number of informative features a protein needs to be considered in the feature selection algorithm.
n_top_feature	optional. Only required if featureSubset = 'topN'. It that case, it specifies number of top features that will be used. Default is 3, which means to use top 3 features.
summaryMethod	"TMP" (default) means Tukey's median polish, which is robust estimation method. "linear" uses linear mixed model.
equalFeatureVar	only for summaryMethod = "linear". default is TRUE. Logical variable for whether the model should account for heterogeneous variation among intensities from different features. Default is TRUE, which assume equal variance among intensities from features. FALSE means that we cannot assume equal variance among intensities from features, then we will account for heterogeneous variation from different features.
censoredInt	Missing values are censored or at random. 'NA' (default) assumes that all 'NA's in 'Intensity' column are censored. '0' uses zero intensities as censored intensity. In this case, NA intensities are missing at random. The output from Skyline should use '0'. Null assumes that all NA intensites are randomly missing.
MBimpute	only for summaryMethod = "TMP" and censoredInt = 'NA' or '0'. TRUE (default) imputes 'NA' or '0' (depending on censoredInt option) by Accelated failure model. FALSE uses the values assigned by cutoffCensored.
remove50missing	only for summaryMethod = "TMP". TRUE removes the runs which have more than 50% missing values. FALSE is default.
fix_missing	Optional, same as the `fix_missing` parameter in MSstatsConvert::MSstatsBalancedDesign function
maxQuantileforCensored	Maximum quantile for deciding censored missing values, default is 0.999
use_log_file	logical. If TRUE, information about data processing will be saved to a file.
append	logical. If TRUE, information about data processing will be added to an existing log file.
verbose	logical. If TRUE, information about data processing wil be printed to the console.
log_file_path	character. Path to a file to which information about data processing will be saved. If not provided, such a file will be created automatically. If `append = TRUE`, has to be a valid path to a file.

Examples

# Consider a raw data (i.e. SRMRawData) for a label-based SRM experiment from a yeast study
# with ten time points (T1-T10) of interests and three biological replicates.
# It is a time course experiment. The goal is to detect protein abundance changes
# across time points.
head(SRMRawData)
#>     ProteinName PeptideSequence PrecursorCharge FragmentIon ProductCharge
#> 243        IDHC   ATDVIVPEEGELR               2          y7            NA
#> 244        IDHC   ATDVIVPEEGELR               2          y7            NA
#> 245        IDHC   ATDVIVPEEGELR               2          y8            NA
#> 246        IDHC   ATDVIVPEEGELR               2          y8            NA
#> 247        IDHC   ATDVIVPEEGELR               2          y9            NA
#> 248        IDHC   ATDVIVPEEGELR               2          y9            NA
#>     IsotopeLabelType Condition BioReplicate Run   Intensity
#> 243                H         1        ReplA   1 84361.08350
#> 244                L         1        ReplA   1   215.13526
#> 245                H         1        ReplA   1 29778.10188
#> 246                L         1        ReplA   1    98.02134
#> 247                H         1        ReplA   1 17921.29255
#> 248                L         1        ReplA   1    60.47029
# Log2 transformation and normalization are applied (default)
QuantData<-dataProcess(SRMRawData, use_log_file = FALSE)
#> INFO  [2021-07-05 20:05:33] ** Features with one or two measurements across runs are removed.
#> INFO  [2021-07-05 20:05:33] ** Fractionation handled.
#> INFO  [2021-07-05 20:05:33] ** Updated quantification data to make balanced design. Missing values are marked by NA
#> INFO  [2021-07-05 20:05:33] ** Log2 intensities under cutoff = 3.776  were considered as censored missing values.
#> INFO  [2021-07-05 20:05:33] ** Log2 intensities = NA were considered as censored missing values.
#> INFO  [2021-07-05 20:05:33] ** Use all features that the dataset originally has.
#> INFO  [2021-07-05 20:05:33] 
#>  # proteins: 2
#>  # peptides per protein: 2-2
#>  # features per peptide: 3-3
#> INFO  [2021-07-05 20:05:33] 
#>                     1 2 3 4 5 6 7 8 9 10
#>              # runs 3 3 3 3 3 3 3 3 3  3
#>     # bioreplicates 3 3 3 3 3 3 3 3 3  3
#>  # tech. replicates 1 1 1 1 1 1 1 1 1  1
#> INFO  [2021-07-05 20:05:33]  == Start the summarization per subplot...
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%
#> INFO  [2021-07-05 20:05:33]  == Summarization is done.
head(QuantData$FeatureLevelData)
#>   PROTEIN         PEPTIDE TRANSITION               FEATURE LABEL GROUP RUN
#> 1    IDHC ATDVIVPEEGELR_2      y7_NA ATDVIVPEEGELR_2_y7_NA     H     0   1
#> 2    IDHC ATDVIVPEEGELR_2      y7_NA ATDVIVPEEGELR_2_y7_NA     L     1   1
#> 3    IDHC ATDVIVPEEGELR_2      y7_NA ATDVIVPEEGELR_2_y7_NA     H     0   2
#> 4    IDHC ATDVIVPEEGELR_2      y7_NA ATDVIVPEEGELR_2_y7_NA     L     1   2
#> 5    IDHC ATDVIVPEEGELR_2      y7_NA ATDVIVPEEGELR_2_y7_NA     H     0   3
#> 6    IDHC ATDVIVPEEGELR_2      y7_NA ATDVIVPEEGELR_2_y7_NA     L     1   3
#>   SUBJECT FRACTION originalRUN censored  INTENSITY ABUNDANCE newABUNDANCE
#> 1       0        1           1    FALSE 84361.0835 15.855859    15.855859
#> 2       1        1           1    FALSE   215.1353  7.240669     7.240669
#> 3       0        1           2    FALSE 62109.5876 15.801179    15.801179
#> 4       2        1           2    FALSE  1205.2252 10.113738    10.113738
#> 5       0        1           3    FALSE 65114.3646 15.755022    15.755022
#> 6       3        1           3    FALSE  1476.3046 10.292109    10.292109
#>   predicted
#> 1        NA
#> 2        NA
#> 3        NA
#> 4        NA
#> 5        NA
#> 6        NA
# Log10 transformation and normalization are applied
QuantData1<-dataProcess(SRMRawData, logTrans=10, use_log_file = FALSE)
#> INFO  [2021-07-05 20:05:33] ** Features with one or two measurements across runs are removed.
#> INFO  [2021-07-05 20:05:33] ** Fractionation handled.
#> INFO  [2021-07-05 20:05:33] ** Updated quantification data to make balanced design. Missing values are marked by NA
#> INFO  [2021-07-05 20:05:33] ** Log2 intensities under cutoff = 1.1367  were considered as censored missing values.
#> INFO  [2021-07-05 20:05:33] ** Log2 intensities = NA were considered as censored missing values.
#> INFO  [2021-07-05 20:05:33] ** Use all features that the dataset originally has.
#> INFO  [2021-07-05 20:05:33] 
#>  # proteins: 2
#>  # peptides per protein: 2-2
#>  # features per peptide: 3-3
#> INFO  [2021-07-05 20:05:33] 
#>                     1 2 3 4 5 6 7 8 9 10
#>              # runs 3 3 3 3 3 3 3 3 3  3
#>     # bioreplicates 3 3 3 3 3 3 3 3 3  3
#>  # tech. replicates 1 1 1 1 1 1 1 1 1  1
#> INFO  [2021-07-05 20:05:33]  == Start the summarization per subplot...
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%
#> INFO  [2021-07-05 20:05:33]  == Summarization is done.
head(QuantData1$FeatureLevelData)
#>   PROTEIN         PEPTIDE TRANSITION               FEATURE LABEL GROUP RUN
#> 1    IDHC ATDVIVPEEGELR_2      y7_NA ATDVIVPEEGELR_2_y7_NA     H     0   1
#> 2    IDHC ATDVIVPEEGELR_2      y7_NA ATDVIVPEEGELR_2_y7_NA     L     1   1
#> 3    IDHC ATDVIVPEEGELR_2      y7_NA ATDVIVPEEGELR_2_y7_NA     H     0   2
#> 4    IDHC ATDVIVPEEGELR_2      y7_NA ATDVIVPEEGELR_2_y7_NA     L     1   2
#> 5    IDHC ATDVIVPEEGELR_2      y7_NA ATDVIVPEEGELR_2_y7_NA     H     0   3
#> 6    IDHC ATDVIVPEEGELR_2      y7_NA ATDVIVPEEGELR_2_y7_NA     L     1   3
#>   SUBJECT FRACTION originalRUN censored  INTENSITY ABUNDANCE newABUNDANCE
#> 1       0        1           1    FALSE 84361.0835  4.773089     4.773089
#> 2       1        1           1    FALSE   215.1353  2.179659     2.179659
#> 3       0        1           2    FALSE 62109.5876  4.756629     4.756629
#> 4       2        1           2    FALSE  1205.2252  3.044538     3.044538
#> 5       0        1           3    FALSE 65114.3646  4.742734     4.742734
#> 6       3        1           3    FALSE  1476.3046  3.098233     3.098233
#>   predicted
#> 1        NA
#> 2        NA
#> 3        NA
#> 4        NA
#> 5        NA
#> 6        NA
# Log2 transformation and no normalization are applied
QuantData2<-dataProcess(SRMRawData,normalization=FALSE, use_log_file = FALSE)
#> INFO  [2021-07-05 20:05:33] ** Features with one or two measurements across runs are removed.
#> INFO  [2021-07-05 20:05:33] ** Fractionation handled.
#> INFO  [2021-07-05 20:05:33] ** Updated quantification data to make balanced design. Missing values are marked by NA
#> INFO  [2021-07-05 20:05:34] ** Log2 intensities under cutoff = 3.7346  were considered as censored missing values.
#> INFO  [2021-07-05 20:05:34] ** Log2 intensities = NA were considered as censored missing values.
#> INFO  [2021-07-05 20:05:34] ** Use all features that the dataset originally has.
#> INFO  [2021-07-05 20:05:34] 
#>  # proteins: 2
#>  # peptides per protein: 2-2
#>  # features per peptide: 3-3
#> INFO  [2021-07-05 20:05:34] 
#>                     1 2 3 4 5 6 7 8 9 10
#>              # runs 3 3 3 3 3 3 3 3 3  3
#>     # bioreplicates 3 3 3 3 3 3 3 3 3  3
#>  # tech. replicates 1 1 1 1 1 1 1 1 1  1
#> INFO  [2021-07-05 20:05:34]  == Start the summarization per subplot...
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%
#> INFO  [2021-07-05 20:05:34]  == Summarization is done.
head(QuantData2$FeatureLevelData)
#>   PROTEIN         PEPTIDE TRANSITION               FEATURE LABEL GROUP RUN
#> 1    IDHC ATDVIVPEEGELR_2      y7_NA ATDVIVPEEGELR_2_y7_NA     H     0   1
#> 2    IDHC ATDVIVPEEGELR_2      y7_NA ATDVIVPEEGELR_2_y7_NA     L     1   1
#> 3    IDHC ATDVIVPEEGELR_2      y7_NA ATDVIVPEEGELR_2_y7_NA     H     0   2
#> 4    IDHC ATDVIVPEEGELR_2      y7_NA ATDVIVPEEGELR_2_y7_NA     L     1   2
#> 5    IDHC ATDVIVPEEGELR_2      y7_NA ATDVIVPEEGELR_2_y7_NA     H     0   3
#> 6    IDHC ATDVIVPEEGELR_2      y7_NA ATDVIVPEEGELR_2_y7_NA     L     1   3
#>   SUBJECT FRACTION originalRUN censored  INTENSITY ABUNDANCE newABUNDANCE
#> 1       0        1           1    FALSE 84361.0835  16.36429     16.36429
#> 2       1        1           1    FALSE   215.1353   7.74910      7.74910
#> 3       0        1           2    FALSE 62109.5876  15.92253     15.92253
#> 4       2        1           2    FALSE  1205.2252  10.23509     10.23509
#> 5       0        1           3    FALSE 65114.3646  15.99069     15.99069
#> 6       3        1           3    FALSE  1476.3046  10.52777     10.52777
#>   predicted
#> 1        NA
#> 2        NA
#> 3        NA
#> 4        NA
#> 5        NA
#> 6        NA

Process MS data: clean, normalize and summarize before differential analysis

Arguments

Examples

Contents