Re: Avoiding connection on side compounds
Reply #3 –
Dear Maria,
Happy the paper was helpful and sorry if some parts were not clear enough. Please find below some elements of answer.
The last method is more dedicated to compound networks, were reactions are splitted into several substrate-product transitions. By saying that no atoms are exchanged between glucose and ADP during the glucokinase reaction we mean that atoms from the ADP product only came from the ATP substrate, so it may not be relevant to connect glucose to ADP based on that reaction. It is useful when you want to found paths between target compounds, without having those "side compounds shortcuts", even if side compounds remain in the network, as well as transition like ADP to ATP. Only links between "main" compounds and side compound will be removed or at least omitted during path search. So strictly speaking chemistry based method doesn't identify side compounds but it help to avoid irrelevant links between non-related compound.
However, I agree that using similarity weighting at the reaction level may leads to mistakes. It can help to extract principal transitions (both GLC-G6P and ADP-ATP as you said), which can be enough for reactions like "maltose + H2O -> glucose", but require manual curation for most cases.
This work has been done for many reactions in the Kegg database. They provided in the RPAIRs database tags for substrate to product transitions, allowing to distinguish "side" transitions from main ones in a reactions. Unfortunately this service has been discontinued this year, and the "side" annotation are no longer available on their website, but I'm pretty sure those data can be found elsewhere.
RPAIRs tags or manually annotated set combined with atom mapping data might be used to train an AI, I've never tried myself but I guess it will be tough work. If found, mapping kegg tags on your network might be a good start (even if not straightforward, but tools for converting identifier from different sources exists).
If the coverage of kegg data on your reaction list is good enough, you can consider a compound not involved in at least one main transition as a side compound. To my opinion it is better to define side compounds in a context of a particular reaction, as for example ATP can be considered as "main" compound in reactions from the nucleotide biosynthesis pathway, despite being a side compound in most reaction.
I hope it answers your questions and the paper came out after your thesis, so no worry ;-)
Best wishes
Fabien
PS : this answer was mostly writen by Clément Frainay, PhD candidate in my group.