age-old question in the metabolic network reconstruction : how identify in the most efficient way the side compounds in each reaction (in order to avoid to connect reactions between them on these side compounds)?
I generally use the known metabolic pathways, as they generally define primary and side compounds in function of the pathway backbone, but with this method I loose all the reactions that are not (yet) described in a metabolic pathway available in public database.
Thanks for initiating discussion :-)
There are several options to avoid side compounds but unfortunately no gold standard ...
- pre-difining a list of compounds that will be considered as side compounds (a bit like what you do)
PROS : the most reliable since it is manual
CONS : it is arbitrary and, you may want some flexibility since this definition is not obvious for some compounds (e.g. CoenzymeA...). Moreover, if you work on genome scale network it can be quite time consuming !!
- computing node degree of all compounds and removing automatically the highly connected compounds. The assumption is that metabolites involved in many reactions are more likely to be side compounds
PROS : it is automatic
CONS : it requires a threshold which is hard to define automatically. You may have highly connected compounds which are of interest (like glutamate)
- Using chemistry. You can use the chemical similarity (using fingerprints) to establish a connection between a given substrate and the product which is more likely to be connected to it.
PROS : automatic and take chemistry into account
CONS : you need to have the chemical structure (eg encoded in Inchi) for all substrates and products. Many models and database don't have them unfortunately...
To my mind, best (automatic) solution is the latest. But it requires that the modelling community makes the effort to add chemical information on metabolites.
Finally, we wrote a review including this topic. Maybe it can help: http://bib.oxfordjournals.org/content/early/2016/01/27/bib.bbv115.long (sorry it is not free access but drop me a message and I will send privately).
I just saw your answer, had problems with my mailbox :/
Thank you so much for replying!
I wish I have seen this paper before finishing my phd thesis... :)
I totally agree on the first two methods, , the pros and the cons, but I still don't totally understand the usage of chemical similarity between compounds on a large scale.
I'm totally missing something in the structure similarity comparison to identify side compounds... In the example you put in your paper, (Figure 3) there is a comparison of structural similarity between compounds involved in the glucokinase reaction:
GLC + ATP -> GLC-6-P + ADP + H+
Similarity scores (as from the paper):
GLC - GLC-6-P = 0.85
GLC-ADP = 0.22
But, if a do an automated analysis, I will perform similarity comparison between all compounds involved in the reaction, which will give me these additional scores:
ATP - GLC-6-P = 0.30
ATP - ADP = 0.90
So that will mean that GLC is transformed in GLC-6-P and ATP in ADP.
But how I can use this information (in a global metabolic network, where I don't have predefined pathways, or target metabolites) to say that I want to select only the connection between GLC and GLC-6-P and not the connection between ATP and ADP?
The similarity between ATP and ADP is higher that between GLC and GLC-6-P, so weighting on this can lead to huge mistakes, no?
(By the way, always in this paper, there is a sentence I really don't get "As shown in Figure 3A, atom mapping shows that no atoms are exchanged between glucose and ATP during the glucokinase reaction". But... They exchange the whole phosphate group, no?)
I am more and more thinking about AI : a human with even small knowledge of biochemistry is able to recognize "main" and "secondary" compounds for most of the reactions. Do you think it's worth to create such an AI that could learn from human expertise? Or you think it's an excessively complicated way to solve this problem?
Happy the paper was helpful and sorry if some parts were not clear enough. Please find below some elements of answer.
The last method is more dedicated to compound networks, were reactions are splitted into several substrate-product transitions. By saying that no atoms are exchanged between glucose and ADP during the glucokinase reaction we mean that atoms from the ADP product only came from the ATP substrate, so it may not be relevant to connect glucose to ADP based on that reaction. It is useful when you want to found paths between target compounds, without having those "side compounds shortcuts", even if side compounds remain in the network, as well as transition like ADP to ATP. Only links between "main" compounds and side compound will be removed or at least omitted during path search. So strictly speaking chemistry based method doesn't identify side compounds but it help to avoid irrelevant links between non-related compound.
However, I agree that using similarity weighting at the reaction level may leads to mistakes. It can help to extract principal transitions (both GLC-G6P and ADP-ATP as you said), which can be enough for reactions like "maltose + H2O -> glucose", but require manual curation for most cases.
This work has been done for many reactions in the Kegg database. They provided in the RPAIRs database tags for substrate to product transitions, allowing to distinguish "side" transitions from main ones in a reactions. Unfortunately this service has been discontinued this year, and the "side" annotation are no longer available on their website, but I'm pretty sure those data can be found elsewhere.
RPAIRs tags or manually annotated set combined with atom mapping data might be used to train an AI, I've never tried myself but I guess it will be tough work. If found, mapping kegg tags on your network might be a good start (even if not straightforward, but tools for converting identifier from different sources exists).
If the coverage of kegg data on your reaction list is good enough, you can consider a compound not involved in at least one main transition as a side compound. To my opinion it is better to define side compounds in a context of a particular reaction, as for example ATP can be considered as "main" compound in reactions from the nucleotide biosynthesis pathway, despite being a side compound in most reaction.
I hope it answers your questions and the paper came out after your thesis, so no worry ;-)
PS : this answer was mostly writen by Clément Frainay, PhD candidate in my group.