Example usage

In this notebook, I will demonstrate how to use msions to create MS TIC and ion plots.

Imports

import msions.mzml as mzml
import msions.hardklor as hk
import msions.percolator as perc
import msions.kronik as kro
import msions.msplot as msplot
import msions.encyclopedia as encyclo
import msions.utils as msutils

Create DataFrame from mzML file

tic_df creates a pandas DataFrame of MS1 scan information from an mzML file.

ms1_df = mzml.tic_df("example_files/DIA_file.mzML")

It can also be used to make a DataFrame of MS2 scan information

ms2_df = mzml.tic_df("example_files/DDA_file.mzML", level="2")

or a DataFrame with both sets of information.

ms_df = mzml.tic_df("example_files/short_DDA_file.mzML", level="all", include_ms1_info=True)

peak_df creates a pandas DataFrame containing the m/z, ion current, and retention time for all MS1 peaks.

ms1_peaks = mzml.peak_df("example_files/short_DDA_file.mzML")

Read Hardklor file

hk2df will read a Hardklor tab-delimited file into a pandas DataFrame. After import, all columns that can be converted to a numeric data type will be.

hk_df = hk.hk2df("example_files/DIA_hk.hk")

summarize_df will summarize the TIC in each scan from a Hardklor pandas DataFrame or Hardklor tab-delimited file.

hk.summarize_df(hk_df)
rt scan_num TIC
0 0.0051 1 14409796
1 0.0574 152 15346213
2 0.1091 303 16216937
3 0.1607 454 16422145
4 0.2124 605 15524068
... ... ... ...
2493 99.9866 291311 108058
2494 99.9897 291312 24495
2495 99.9927 291313 51831
2496 99.9958 291314 424145
2497 99.9989 291315 111484

2498 rows × 3 columns

If an additional pandas DataFrame is provided with the MS1 scan information, the ion injection time will be mapped to each scan.

hk.summarize_df(hk_df, ms1_df)
rt scan_num TIC IT ions
0 0.0051 1 14409796 50.000000 720489.800000
1 0.0574 152 15346213 40.343060 619113.184769
2 0.1091 303 16216937 40.586967 658196.294454
3 0.1607 454 16422145 43.578297 715649.106626
4 0.2124 605 15524068 40.905605 635021.398509
... ... ... ... ... ...
2493 99.9866 291311 108058 50.000000 5402.900000
2494 99.9897 291312 24495 50.000000 1224.750000
2495 99.9927 291313 51831 50.000000 2591.550000
2496 99.9958 291314 424145 50.000000 21207.250000
2497 99.9989 291315 111484 50.000000 5574.200000

2498 rows × 5 columns

Create a simplified DataFrame from a Kronik file

simple_df can be used to filter a Kronik DataFrame’s rows and columns.

kro_df = kro.simple_df("example_files/DDA_match.kro")

filter_df can be used to filter a Kronik DataFrame within a retention time range.

kro.filter_df(kro_df, start=20, stop=80)
first_scan last_scan num_scans mass charge best_int sum_int best_rt mz best_rt_s
0 38596 44919 225 841.5023 2 1.483303e+10 5.915766e+11 24.8667 421.758430 1492.002
1 43401 45665 76 1313.6580 2 5.567430e+09 5.477088e+10 27.4002 657.836280 1644.012
2 61414 62885 53 1823.9778 2 5.446934e+09 4.207272e+10 37.4799 912.996180 2248.794
3 57080 59175 72 1953.0566 3 4.605176e+09 5.238194e+10 35.0052 652.026147 2100.312
4 56254 59449 110 1459.7095 3 4.408563e+09 5.981484e+10 34.9384 487.577113 2096.304
... ... ... ... ... ... ... ... ... ... ...
97079 98272 98352 5 892.2217 2 3.718200e+04 1.589230e+05 64.5663 447.118130 3873.978
97087 98041 98098 4 876.3758 2 3.487900e+04 1.013320e+05 64.4020 439.195180 3864.120
97100 97404 97463 4 848.5609 2 3.099600e+04 8.306610e+04 63.8228 425.287730 3829.368
97103 98727 98766 3 892.2208 2 3.035100e+04 6.099000e+04 64.9506 447.117680 3897.036
97115 97985 98079 6 904.4069 2 2.226500e+04 1.090820e+05 64.3368 453.210730 3860.208

70078 rows × 10 columns

match_rt_mass can compare a Kronik DataFrame to itself to find redundancies.

redund_df = kro_df.copy()
redund_df["redund"] = redund_df.apply(kro.match_rt_mass, axis=1, other_df=kro_df, rt_diff=0.5) 
# view DataFrame
redund_df
first_scan last_scan num_scans mass charge best_int sum_int best_rt mz best_rt_s redund
0 87237 93951 477 841.5015 2 1.001262e+10 1.223656e+12 68.2184 421.758030 4093.104 0
1 229767 231799 153 2778.9350 3 4.772458e+09 5.293946e+10 170.6296 927.318947 10237.776 0
2 84967 89520 301 1528.7273 3 4.073246e+09 1.699650e+11 65.2467 510.583047 3914.802 0
3 55811 61307 397 1017.5459 2 3.582841e+09 1.497513e+11 45.0560 509.780230 2703.360 0
4 136654 140227 231 1832.8845 4 3.374072e+09 1.145103e+11 100.1007 459.228405 6006.042 0
... ... ... ... ... ... ... ... ... ... ... ...
269690 9803 9817 5 670.8153 1 3.194000e+03 1.340314e+04 8.4985 671.822580 509.910 19
269691 2151 2173 3 625.8080 1 3.131000e+03 7.133000e+03 1.8446 626.815280 110.676 2
269692 9749 9759 3 707.8342 1 2.969000e+03 7.321000e+03 8.4421 708.841480 506.526 3
269693 10130 10153 3 670.8163 1 2.841000e+03 8.072000e+03 8.8624 671.823580 531.744 18
269694 2206 2261 6 773.6507 1 2.602000e+03 1.351250e+04 1.8838 774.657980 113.028 1

269695 rows × 11 columns

Parse XML files from percolator output

psms2df will create a pandas DataFrame from a percolator XML output file.

psm_xml_df = perc.psms2df("example_files/short_DDA_xml.xml")

id_scans creates a column saying whether an MS2 was identified.

perc.id_scans("example_files/DDA_percolator.target.peptides.txt", ms2_df)

match_kro determines if Kronik features were identified in a percolator XML output file.

perc.match_kro(kro_df, psm_xml_df, ms_df)

Plot TIC and ions

plot_tic can be used to plot the TIC per MS1 scan in a pandas DataFrame.

msplot.plot_tic(ms1_df)
_images/0ad8751c5eb45cf00b050531f515d92c11b973a13a30aa6a890bb9abc61e6ee6.png

plot_ions can be used to plot the ions per MS1 scan in a pandas DataFrame.

msplot.plot_ions(ms1_df)
_images/faeaa4bf12630c5c5a4ac9b7e005fe80b36d9d9ba9410cf04c7a357055d366be.png

Analyze EncyclopeDIA output

dia_df creates a pandas DataFrame from an EncyclopeDIA elib output.

encyclo_df = encyclo.dia_df("example_files/DIA_elib.elib")

match_hk matches EncyclopeDIA elib output to Hardklor output.

hk_df["in_encyclo"] = hk_df.apply(encyclo.match_hk, axis=1, other_df=encyclo_df)

Miscellaneous utility functions

bin_list creates a list of bin edges for a histogram.

# define arguments
mz_bin_size = 4
mz_bin_mult = 1.0005
mz_start = 399
mz_end = 1005

bin_mz_list = msutils.bin_list(mz_start, mz_end, mz_bin_size, mz_bin_mult)

bin_data bins a pandas DataFrame using list(s) of bin edges.

msutils.bin_data(ms1_peaks, type="mz", bin_mz_list=bin_mz_list)
rt bin_mz ips
0 32.542579 [399.0, 403.002) 1.048150e+08
1 32.542579 [403.002, 407.004) 4.862561e+06
2 32.542579 [407.004, 411.006) 5.346059e+06
3 32.542579 [411.006, 415.008) 5.346583e+06
4 32.542579 [415.008, 419.01) 7.255431e+06
... ... ... ...
5129 33.093167 [983.292, 987.294) 2.364387e+06
5130 33.093167 [987.294, 991.296) 2.474095e+06
5131 33.093167 [991.296, 995.298) 3.682017e+06
5132 33.093167 [995.298, 999.3) 3.545738e+06
5133 33.093167 [999.3, 1003.302) 4.575072e+06

5134 rows × 3 columns