FragPipe TMT QC • prophosqua

Configuration

This section shows the key parameters used for this QC analysis. The report can be parameterized to analyze different datasets by providing custom file paths and identifiers.

project information
name	value
workunit_id	1234
project_id	4321
datasetname	ptm_example-main/qc_example_data/QCmini/psm.tsv
fastasequence	ptm_example-main/qc_example_data/fgcz_3702_UP000006548_AraUniprot_1spg_d_20231024.fasta

QC will only use the first psm file.

using : ptm_example-main/qc_example_data/QCmini/psm.tsv

FASTA Database Summary

The FASTA database contains the protein sequences used for peptide identification. Understanding the composition of your search database is crucial for interpreting identification results and assessing potential biases.

Key metrics to evaluate:

Database size: Typical sizes vary by organism:
- Human: ~20,000 proteins
- Mouse: ~22,000 proteins
- Yeast: ~6,000 proteins
- E. coli: ~4,400 proteins
- Arabidopsis: ~27,000 protein coding genes, ~35,000 proteins
Decoy sequences: Reverse sequences (REV_) are used to estimate false discovery rates
Amino acid composition: Should reflect the expected composition of your sample organism

The FASTA database has 55907 sequences including decoys, and 27953 without decoys. The amino acid frequency distribution below shows the composition of your search database, which should be consistent with the expected proteome composition.

nr sequences:
 27953
 length summary:
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
     5.0   198.0   348.0   404.8   520.0  5400.0
 AA frequencies:
      [,1]
 A  712383
 B       5
 C  213335
 D  608123
 E  760476
 F  484702
 G  730135
 H  256377
 I  601507
 K  720098
 L 1078566
 M  277709
 N  498002
 P  543057
 Q  395258
 R  609817
 S 1031163
 T  577392
 V  754323
 W  140645
 X      15
 Y  321973
 Z       3

Data Processing Overview

This analysis processes peptide-spectrum matches (PSMs) from FragPipe with specific quality filters to ensure reliable quantification.

Quality filters applied:

Peptide Prophet probability > 0.9: Ensures high confidence identifications
Abundance threshold > 0: Removes PSMs without quantitative information
Purity threshold = 0: No isolation purity filtering (all PSMs included)

Rows: 5436 Columns: 55
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (14): Spectrum, Spectrum File, Peptide, Modified Peptide, Prev AA, Next ...
dbl (38): Peptide Length, Charge, Retention, Observed Mass, Calibrated Obser...
lgl  (3): Observed Modifications, Is Unique, Quan Usage

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 5436 Columns: 55
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (14): Spectrum, Spectrum File, Peptide, Modified Peptide, Prev AA, Next ...
dbl (38): Peptide Length, Charge, Retention, Observed Mass, Calibrated Obser...
lgl  (3): Observed Modifications, Is Unique, Quan Usage

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

For this analysis we are using all PSM (Spectra) reported in the psm.tsv file with a peptide prophet probability greater than $0.9$ , and an abundance value in any of the channels greater then $0$ . No other filtering is enabled. This reduces the number of PSM from 5436 to 5275.

The reduction in PSM count reflects the stringency of our quality filters. A typical reduction of 1-10 $\%$ is expected and indicates proper quality control.

INFO [2025-12-14 21:01:32] get_annot : ptm_example-main/qc_example_data/fgcz_3702_UP000006548_AraUniprot_1spg_d_20231024.fasta
INFO [2025-12-14 21:01:33] get_annot : finished reading
INFO [2025-12-14 21:01:34] get_annot : extract headers
INFO [2025-12-14 21:01:34] get_annot : all seq : 55907
INFO [2025-12-14 21:01:34] removing decoy sequences usin patter : ^REV_|^rev_
INFO [2025-12-14 21:01:34] get_annot nr seq after decoy removal: 27954
INFO [2025-12-14 21:01:34] get_annot : isUniprot : TRUE
INFO [2025-12-14 21:01:34] get_annot : extracted gene names
INFO [2025-12-14 21:01:34] get_annot : protein length
INFO [2025-12-14 21:01:35] get_annot : nr of tryptic peptides per protein computed.

Warning in prolfquapp::dataset_protein_annot(psm, c(protein_Id = "Protein"), :
deprecated! use build_protein_annot

uniprot database : TRUE

Warning: Expected 2 pieces. Missing pieces filled with `NA` in 20 rows [54, 59, 92, 99,
178, 287, 288, 307, 357, 379, 475, 476, 554, 611, 754, 892, 893, 940, 967,
1021].

creating sampleName from fileName column

Warning in prolfqua::setup_analysis(psm, config): no isotopeLabel column
specified in the data, adding column isotopeLabel automatically and setting to
'light'.

Warning in prolfqua::setup_analysis(psm, config): no nr_children column
specified in the data, adding column nr_children and setting to 1.

completing cases

completing cases done

setup done

Identification Summary

These metrics provide an overview of the depth and breadth of your proteomic analysis. Higher numbers generally indicate better sample preparation and instrument performance.

Understanding the hierarchy:

Proteins: Unique protein groups identified
Peptides: Unique peptide sequences (without modifications)
Peptidoforms: Unique peptides with specific modifications
Precursors: Unique peptide-charge state combinations
PSMs: Total peptide-spectrum matches

Table: Nr of proteins, peptides, peptidoforms, precursors, spectrum peptide matches overall.
isotopeLabel	protein_Id	peptide_Id	mod_peptide_Id	precursor	Spectrum
light	1064	3845	3947	4860	5275

The ratio between these levels can indicate data quality:

PSMs/Peptides ratio: Higher ratios suggest good reproducibility
Peptidoforms/Peptides ratio: Indicates the extent of post-translational modifications
Peptides/Proteins ratio: Reflects proteolytic efficiency and peptide diversity

Figure: Number of proteins, peptides, peptidoforms, ions and precursor per channel.

This plot shows the identification counts across different TMT channels. Ideally, all channels should show similar identification numbers, indicating consistent sample preparation and labeling efficiency.

Table: Nr of proteins, peptides, peptidoforms, precursors, spectrum peptide matches per channel.
isotopeLabel	sampleName	protein_Id	peptide_Id	mod_peptide_Id	precursor	Spectrum
light	QCmini_126	1064	3845	3947	4859	5274
light	QCmini_127C	1064	3845	3947	4859	5274
light	QCmini_127N	1064	3845	3947	4860	5274
light	QCmini_128C	1064	3845	3947	4860	5275
light	QCmini_128N	1064	3845	3947	4859	5274
light	QCmini_129C	1064	3845	3947	4859	5274
light	QCmini_129N	1064	3845	3947	4860	5275
light	QCmini_130C	1064	3845	3947	4859	5274
light	QCmini_130N	1064	3845	3947	4860	5274
light	QCmini_131C	1064	3844	3946	4858	5272
light	QCmini_131N	1064	3845	3947	4859	5273
light	QCmini_132C	1064	3845	3947	4859	5273
light	QCmini_132N	1064	3845	3947	4860	5274
light	QCmini_133C	1064	3845	3947	4859	5274
light	QCmini_133N	1064	3845	3947	4859	5274
light	QCmini_134C	3	3	3	3	3
light	QCmini_134N	1064	3845	3947	4859	5274
light	QCmini_135N	7	9	9	9	9

Quality indicators:

Consistent counts across channels: Good sample preparation
Large variations: May indicate labeling issues or sample degradation
Outlier channels: Should be investigated for technical problems

Modifications Summary

Post-translational modifications (PTMs) and chemical modifications provide insights into sample treatment and biological state. TMT labeling introduces specific modifications that should be monitored for labeling efficiency.

Key modifications to monitor:

TMT labels: N-term(229.1629), K(229.1629) for TMT10plex
Oxidation: M(15.9949) - common artifact
Carbamylation: N-term/K modifications from urea treatment

Number and type of modifications observed in the data.
Var1	Freq
C(57.0214)	2
K(304.2071)	2339
M(15.9949)	635
N-term(304.2071)	3904
N-term(42.0106)	10

Number and type of modification observed in the data.

The modification landscape reflects both intentional chemical treatments (TMT labeling) and unintentional modifications (oxidation, deamidation). High numbers of TMT modifications indicate successful labeling.

Labelling Efficiency

TMT labeling efficiency is critical for accurate quantification. Incomplete labeling leads to quantitative errors and reduced dynamic range.

Monitoring labeling efficiency:

N-terminal labeling: Should approach 100% for most peptides
Lysine labeling: Should be >95% for high-quality samples
Unlabeled peptides: May co-elute and interfere with quantification

N-term

N-terminal labeling efficiency is typically very high (>95%) as it’s kinetically favored.

Peptides

Total number of peptides : 3947
Number of peptides with modified N-term : 3904
Percent peptides with modified N-term: 99 %

Expected range: 95-100%. Values below 95% may indicate:

Suboptimal labeling conditions
Sample degradation
Buffer incompatibility

PSM’s

Total number of PSMs : 5275
Number of PSMs with modified N-term : 5227
Percent PSMs with modified N-term: 99 %

PSM-level statistics weight peptides by their identification frequency, providing insight into the most abundant species in your sample.

Lysine Modification

Lysine labeling is more challenging than N-terminal labeling and is more sensitive to reaction conditions.

Peptides

Warning: Use of .data in tidyselect expressions was deprecated in tidyselect 1.2.0.
ℹ Please use `"modSeq"` instead of `.data$modSeq`

Total number of peptides with Lysine: 2339
Number of peptides with modified Lysine residues : 2339
Percent peptides with modified Lysine residues: 100 %

Expected range: 95-99%. Lower values may indicate:

Insufficient reagent concentration
Competing reactions (e.g., formaldehyde crosslinking)
pH optimization needed

PSM’s with Lysine

Total number of PSMs with Lysine: 3200
Number of PSMs with modified Lysine residues : 3200
Percent PSMs with modified Lysine residues: 100 %

Warning: Use of .data in tidyselect expressions was deprecated in tidyselect 1.2.0.
ℹ Please use `"n"` instead of `.data$n`

Warning: Use of .data in tidyselect expressions was deprecated in tidyselect 1.2.0.
ℹ Please use `"n1"` instead of `.data$n1`

Peptidoforms

Total number of Lysine residues: 2344
Number of modified Lysine residues : 2339
Percent modified Lysine residues: 100 %

Total number of Lysine residues in PSM’s

Total number of Lysine residues: 3205
Number of modified Lysine residues : 3200
Percent modified Lysine residues when taking number of PSMs into account: 100 %

Quantitative information per channel

TMT quantification relies on consistent labeling and equal loading across channels. These plots help identify technical issues that could bias quantitative comparisons.

Quality indicators:

Similar total abundances: Good sample preparation and loading
Outlier channels: May indicate pipetting errors or sample loss
Systematic patterns: Could suggest batch effects or labeling issues

Total abundance per channel (Sum of all abundances).

Total abundance reflects both sample amount and ionization efficiency. Large variations (>2-fold) should be investigated.

Ralative to chanell 126 total abundance per chanel.

Normalization to a reference channel (126) helps identify systematic biases. Values should typically be within 0.5-2.0 fold of the reference.

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the prolfqua package.
  Please report the issue at <https://github.com/wolski/prolfqua/issues>.

Density of abundance values per channel.

The abundance distribution should be similar across channels. Shifted distributions may indicate: - Unequal sample loading - Different sample complexity - Technical artifacts in specific channels

Missed cleavage

Missed cleavage analysis provides insights into proteolytic efficiency and sample quality. Excessive missed cleavages can reduce identification rates and affect quantification accuracy.

Missed cleavage site: Is a residue after which trypsin should have cleaved but did not.

Factors affecting missed cleavages:

Protein denaturation: Incomplete unfolding reduces accessibility
Digestion time/temperature: Insufficient conditions lead to incomplete digestion
Enzyme activity: Old or inhibited trypsin reduces efficiency
Chemical modifications: Modified residues may resist cleavage

To determine the number of missed cleavages we compute: - The number of all potential cleavage sites (i.e., number of K or R residues) - The number of actual cleavage sites (K or R at peptide C-terminus) - Modified residues that may resist cleavage

Missed Lysine residues

We compute the total number of K residues, and the number of K cleavage sites (nr of potential cleavage sites). Then we compute the number of K at the C-term

The number of K residues 3205
The number of unmodified K at C term 0
The number of modified K at C term 3025
The number of any K at C term 3025
The number of any modified K: 3200
Missed cleavage sites : number of K residues - number of any K at C term = 180,
and in % of number of K residues : 6

Expected range: 5-15% missed cleavages are typical. Higher rates may indicate:

Incomplete protein denaturation
Insufficient digestion time
Trypsin inhibitors present
High protein concentration

Missed Arginine residues

The number of R residues 2401
The number of unmodified R at C term 2189
The number of modified R at C term 0
The number of any R at C term 2189
The number of any modified R: 0
Missed cleavage sites : number of R residues - number of any R at C term = 212,
and in % of number of R residues : 9

Arginine cleavage is generally more efficient than lysine cleavage. Similar missed cleavage rates between K and R suggest consistent digestion conditions.

Session Info

R version 4.5.2 (2025-10-31)

Platform: aarch64-apple-darwin20

locale: en_US.UTF-8||en_US.UTF-8||en_US.UTF-8||C||en_US.UTF-8||en_US.UTF-8

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: here(v.1.0.2), ggseqlogo(v.0.2), lubridate(v.1.9.4), forcats(v.1.0.1), stringr(v.1.6.0), dplyr(v.1.1.4), purrr(v.1.2.0), readr(v.2.1.6), tidyr(v.1.3.1), tibble(v.3.3.0), ggplot2(v.4.0.1), tidyverse(v.2.0.0), prophosqua(v.0.1.0) and prolfquapp(v.0.1.8)

loaded via a namespace (and not attached): AhoCorasickTrie(v.0.1.3), gtable(v.0.3.6), xfun(v.0.54), prozor(v.0.3.1), htmlwidgets(v.1.6.4), ggrepel(v.0.9.6), lattice(v.0.22-7), tzdb(v.0.5.0), vctrs(v.0.6.5), tools(v.4.5.2), generics(v.0.1.4), parallel(v.4.5.2), pkgconfig(v.2.0.3), pheatmap(v.1.0.13), Matrix(v.1.7-4), data.table(v.1.17.8), RColorBrewer(v.1.1-3), S7(v.0.2.1), lifecycle(v.1.0.4), prolfqua(v.1.4.0), compiler(v.4.5.2), farver(v.2.1.2), codetools(v.0.2-20), htmltools(v.0.5.9), lazyeval(v.0.2.2), yaml(v.2.3.11), plotly(v.4.11.0), pillar(v.1.11.1), crayon(v.1.5.3), seqinr(v.4.2-36), MASS(v.7.3-65), cachem(v.1.1.0), tidyselect(v.1.2.1), conflicted(v.1.2.0), digest(v.0.6.39), stringi(v.1.8.7), pander(v.0.6.6), labeling(v.0.4.3), ade4(v.1.7-23), rprojroot(v.2.1.1), fastmap(v.1.2.0), grid(v.4.5.2), cli(v.3.6.5), logger(v.0.4.1), magrittr(v.2.0.4), patchwork(v.1.3.2), withr(v.3.0.2), scales(v.1.4.0), bit64(v.4.6.0-1), timechange(v.0.3.0), httr(v.1.4.7), rmarkdown(v.2.30), bit(v.4.6.0), gridExtra(v.2.3), hms(v.1.1.4), memoise(v.2.0.1), evaluate(v.1.0.5), knitr(v.1.50), viridisLite(v.0.4.2), dtplyr(v.1.3.2), rlang(v.1.1.6), Rcpp(v.1.1.0), docopt(v.0.7.2), glue(v.1.8.0), vroom(v.1.6.7), jsonlite(v.2.0.0) and R6(v.2.6.1)