Example usage

In this notebook, I will demonstrate how to use msions to create MS TIC and ion plots.

Imports

import msions.mzml as mzml
import msions.hardklor as hk
import msions.percolator as perc
import msions.kronik as kro
import msions.msplot as msplot
import msions.encyclopedia as encyclo
import msions.utils as msutils
import matplotlib.pyplot as plt

Create DataFrame from mzML file

tic_df creates a pandas DataFrame of MS1 scan information from an mzML file.

ms1_df = mzml.tic_df("example_files/DIA_file.mzML")

It can also be used to make a DataFrame of MS2 scan information

ms2_df = mzml.tic_df("example_files/DDA_file.mzML", level="2")

or a DataFrame with both sets of information.

ms_df = mzml.tic_df("example_files/short_DDA_file.mzML", level="all", include_ms1_info=True)

peak_df creates a pandas DataFrame containing the m/z, ion current, and retention time for all MS1 peaks.

ms1_peaks = mzml.peak_df("example_files/short_DDA_file.mzML")

Read Hardklor file

hk2df will read a Hardklor tab-delimited file into a pandas DataFrame. After import, all columns that can be converted to a numeric data type will be.

hk_df = hk.hk2df("example_files/DIA_hk.hk")

summarize_df will summarize the TIC in each scan from a Hardklor pandas DataFrame or Hardklor tab-delimited file.

hk.summarize_df(hk_df)
rt scan_num TIC
0 0.0051 1 14409796
1 0.0574 152 15346213
2 0.1091 303 16216937
3 0.1607 454 16422145
4 0.2124 605 15524068
... ... ... ...
2493 99.9866 291311 108058
2494 99.9897 291312 24495
2495 99.9927 291313 51831
2496 99.9958 291314 424145
2497 99.9989 291315 111484

2498 rows × 3 columns

If an additional pandas DataFrame is provided with the MS1 scan information, the ion injection time will be mapped to each scan.

hk.summarize_df(hk_df, ms1_df)
scan_num rt TIC IT ions
0 1 0.0051 14409796.0 50.000000 720489.800000
1 152 0.0574 15346213.0 40.343060 619113.184769
2 303 0.1091 16216937.0 40.586967 658196.294454
3 454 0.1607 16422145.0 43.578297 715649.106626
4 605 0.2124 15524068.0 40.905605 635021.398509
... ... ... ... ... ...
2560 291311 99.9866 108058.0 50.000000 5402.900000
2561 291312 99.9897 24495.0 50.000000 1224.750000
2562 291313 99.9927 51831.0 50.000000 2591.550000
2563 291314 99.9958 424145.0 50.000000 21207.250000
2564 291315 99.9989 111484.0 50.000000 5574.200000

2565 rows × 5 columns

Create a simplified DataFrame from a Kronik file

simple_df can be used to filter a Kronik DataFrame’s rows and columns.

kro_df = kro.simple_df("example_files/DDA_match.kro")

filter_df can be used to filter a Kronik DataFrame within a retention time range.

kro.filter_df(kro_df, start=20, stop=80)
first_scan last_scan num_scans mass charge best_int sum_int best_rt mz best_rt_s
0 38596 44919 225 841.5023 2 1.483303e+10 5.915766e+11 24.8667 421.758430 1492.002
1 43401 45665 76 1313.6580 2 5.567430e+09 5.477088e+10 27.4002 657.836280 1644.012
2 61414 62885 53 1823.9778 2 5.446934e+09 4.207272e+10 37.4799 912.996180 2248.794
3 57080 59175 72 1953.0566 3 4.605176e+09 5.238194e+10 35.0052 652.026147 2100.312
4 56254 59449 110 1459.7095 3 4.408563e+09 5.981484e+10 34.9384 487.577113 2096.304
... ... ... ... ... ... ... ... ... ... ...
97079 98272 98352 5 892.2217 2 3.718200e+04 1.589230e+05 64.5663 447.118130 3873.978
97087 98041 98098 4 876.3758 2 3.487900e+04 1.013320e+05 64.4020 439.195180 3864.120
97100 97404 97463 4 848.5609 2 3.099600e+04 8.306610e+04 63.8228 425.287730 3829.368
97103 98727 98766 3 892.2208 2 3.035100e+04 6.099000e+04 64.9506 447.117680 3897.036
97115 97985 98079 6 904.4069 2 2.226500e+04 1.090820e+05 64.3368 453.210730 3860.208

70078 rows × 10 columns

match_rt_mass can compare a Kronik DataFrame to itself to find redundancies.

redund_df = kro_df.copy()
redund_df["redund"] = redund_df.apply(kro.match_rt_mass, axis=1, other_df=kro_df, rt_diff=0.5) 
# view DataFrame
redund_df
first_scan last_scan num_scans mass charge best_int sum_int best_rt mz best_rt_s redund
0 38596 44919 225 841.5023 2 1.483303e+10 5.915766e+11 24.8667 421.758430 1492.002 0
1 43401 45665 76 1313.6580 2 5.567430e+09 5.477088e+10 27.4002 657.836280 1644.012 0
2 61414 62885 53 1823.9778 2 5.446934e+09 4.207272e+10 37.4799 912.996180 2248.794 0
3 57080 59175 72 1953.0566 3 4.605176e+09 5.238194e+10 35.0052 652.026147 2100.312 0
4 56254 59449 110 1459.7095 3 4.408563e+09 5.981484e+10 34.9384 487.577113 2096.304 0
... ... ... ... ... ... ... ... ... ... ... ...
97120 5904 5946 3 653.6884 1 1.817700e+04 4.936000e+04 4.8106 654.695680 288.636 1
97121 8262 8324 4 614.6566 1 1.761600e+04 6.674839e+04 6.6897 615.663880 401.382 1
97122 1516 1557 3 669.3017 1 1.643800e+04 4.555800e+04 1.2335 670.308980 74.010 3
97123 4463 4547 5 614.3503 1 1.592000e+04 7.553350e+04 3.6536 615.357580 219.216 7
97124 4463 4547 5 614.3503 1 1.479800e+04 7.021000e+04 3.6536 615.357580 219.216 7

97125 rows × 11 columns

Parse XML files from percolator output

psms2df will create a pandas DataFrame from a percolator XML output file.

psm_xml_df = perc.psms2df("example_files/short_DDA_xml.xml")

id_scans creates a column saying whether an MS2 was identified.

perc.id_scans("example_files/DDA_percolator.target.peptides.txt", ms2_df)

match_kro determines if Kronik features were identified in a percolator XML output file.

perc.match_kro(kro_df, psm_xml_df, ms_df)

Analyze EncyclopeDIA output

dia_df creates a pandas DataFrame from an EncyclopeDIA elib output.

encyclo_df = encyclo.dia_df("example_files/DIA_elib.elib")

match_hk matches EncyclopeDIA elib output to Hardklor output.

hk_df["in_encyclo"] = hk_df.apply(encyclo.match_hk, axis=1, other_df=encyclo_df)

Plot TIC and ions

plot_data can be used to plot the TIC and ions per MS1 scan in a pandas DataFrame.

msplot.plot_data(ms1_df)
_images/88854f09e0d50ee24a5f22894617ffe43fe1f54c228c4e65fdf55d04676974e5.png
msplot.plot_data(ms1_df, data_type="ions")
_images/5914dc8ee184699d52f05d5cf4154aacd305386101cd27b0a24b4b61b6ffdab8.png
msplot.plot_data(ms1_df, data_type="both")
_images/a942456ac26ccf1788ea485f91a3c811fadb97f6a14590800cb0fdda7be66898.png

plot_data can also take mzML files as input. plot the amount of identified signal if you provide a feature file/DataFrame and the identifications file/DataFrame.

Miscellaneous utility functions

bin_list creates a list of bin edges for a histogram.

# define arguments
mz_bin_size = 4
mz_bin_mult = 1.0005
mz_start = 399
mz_end = 1005

bin_mz_list = msutils.bin_list(mz_start, mz_end, mz_bin_size, mz_bin_mult)

bin_data bins a pandas DataFrame using list(s) of bin edges.

msutils.bin_data(ms1_peaks, type="mz", bin_mz_list=bin_mz_list)
rt bin_mz ips
0 32.542579 [399.0, 403.002) 1.048150e+08
1 32.542579 [403.002, 407.004) 4.862561e+06
2 32.542579 [407.004, 411.006) 5.346059e+06
3 32.542579 [411.006, 415.008) 5.346583e+06
4 32.542579 [415.008, 419.01) 7.255431e+06
... ... ... ...
5129 33.093167 [983.292, 987.294) 2.364387e+06
5130 33.093167 [987.294, 991.296) 2.474095e+06
5131 33.093167 [991.296, 995.298) 3.682017e+06
5132 33.093167 [995.298, 999.3) 3.545738e+06
5133 33.093167 [999.3, 1003.302) 4.575072e+06

5134 rows × 3 columns

Technical Note Use Cases

# initiate dictionary to store data
note_dict = {}

EV

file_name = "Ecl_2022_0718_FreezeThaw_Q1_EV10_12mz_26"
suf = ""

note_dict[file_name+"_ms1"+suf], note_dict[file_name+"_sumid"+suf], note_dict[file_name+"_id"+suf], note_dict[file_name+"_hk"+suf] = msplot.plot_data(human_dir+"mzMLs/SixQuantitativeIons/"+file_name+".mzML", 
                 human_dir+"hk_files/"+file_name+"_MS1_3sn.hk", 
                 human_dir+"mzMLs/SixQuantitativeIons/"+file_name+".mzML.elib", 
                 method="DIA", data_type="both", stats="title", return_dfs=True)
_images/4ca7f16f0c1b2e89c3bf99552c25c55b490188be9da899488290a06736b6be24.png

Plasma

file_name = "Ecl_2022_0718_FreezeThaw_Total_01_12mz_28"
suf = ""

note_dict[file_name+"_ms1"+suf], note_dict[file_name+"_sumid"+suf], note_dict[file_name+"_id"+suf], note_dict[file_name+"_hk"+suf] = msplot.plot_data(human_dir+"mzMLs/SixQuantitativeIons/"+file_name+".mzML", 
                 human_dir+"hk_files/"+file_name+"_MS1_3sn.hk", 
                 human_dir+"mzMLs/SixQuantitativeIons/"+file_name+".mzML.elib", 
                 method="DIA", data_type="both", stats="title", return_dfs=True)
_images/56de2a15dfd8f5929356d368d5560b3e188a19d811c1c916e459286d08eef5e1.png

Mouse DB

file_name = "Ecl_2022_0718_FreezeThaw_Q1_EV10_12mz_26"
suf = "_mus"
mzml_dir = "mzMLs/mouse_fasta/"

note_dict[file_name+"_ms1"+suf], note_dict[file_name+"_sumid"+suf], note_dict[file_name+"_id"+suf], note_dict[file_name+"_hk"+suf] = msplot.plot_data(human_dir+mzml_dir+file_name+".mzML", 
                 human_dir+"hk_files/"+file_name+"_MS1_3sn.hk", 
                 human_dir+mzml_dir+file_name+".mzML.elib", 
                 method="DIA", data_type="both", stats="title", return_dfs=True)
_images/bd2b9c88aee437bd47ac8c5bad0cd8e45fc732c4e50d1c5c2c1e8f29655caf0d.png

Missing Albumin

file_name = "Ecl_2022_0718_FreezeThaw_Q1_EV10_12mz_26"
suf = "_noalbu"
mzml_dir = "mzMLs/SixQuantitativeIons/"

note_dict[file_name+"_ms1"+suf], note_dict[file_name+"_sumid"+suf], note_dict[file_name+"_id"+suf], note_dict[file_name+"_hk"+suf] = msplot.plot_data(human_dir+mzml_dir+file_name+".mzML", 
                 human_dir+"hk_files/"+file_name+"_MS1_3sn.hk", 
                 human_dir+mzml_dir+file_name+".mzML_noALBU.elib", 
                 method="DIA", data_type="both", stats="title", return_dfs=True)
_images/d7de9e921c15c5468dd1b1f989b5b5af13d6dbaf92d4f92867f82dd2c7c384e8.png
file_name = "Ecl_2022_0718_FreezeThaw_Total_01_12mz_28"
suf = "_noalbu"
mzml_dir = "mzMLs/SixQuantitativeIons/"

note_dict[file_name+"_ms1"+suf], note_dict[file_name+"_sumid"+suf], note_dict[file_name+"_id"+suf], note_dict[file_name+"_hk"+suf] = msplot.plot_data(human_dir+mzml_dir+file_name+".mzML", 
                 human_dir+"hk_files/"+file_name+"_MS1_3sn.hk", 
                 human_dir+mzml_dir+file_name+".mzML_noALB.elib", 
                 method="DIA", data_type="both", stats="title", return_dfs=True)
_images/2b8e19755ea0a1c85b6fa2f5c8ddb576cd4c198ed25828610d2f7d2a92d5928d.png