Post- genome/transcriptome analysis, especially in cancer or other disease related study, usually ends up in functional genomics that includes comparing list of mutated or changed (qualitatively or quantitatively) genes with the list of genes from ontologies (GO, GeneGo), canonical pathways (DAVID, KEGG, Wikipathways), gene networks, or genes implicated with disease (COSMIC database).
Now this gene list comparison is typically done by matching gene-symbols (also known as gene identifiers). Gene-symbol comparison can become tricky at times since gene-symbols change due to either change in underlying nomenclature or re-annotation of gene function. For example, ARHGEF7 can be referred to as KIAA0142, PIXB, DKFZp761K1021, Nbla10314, DKFZp686C12170, BETA-PIX, COOL1, P85SPR, P85, P85COOL1, P50BP, PAK3 or P50.
This problem sounds trivial but using incorrect or inconsistent gene-symbols can result in the data loss (changed genes) that in turn can confound the accurate biological interpretation of the data.
Now solution to this problem is to always use currently standardized gene-symbols approved and provided by the nomenclature committee which is HGNC (HUGO Gene Nomenclature Committee).
A batch tool converter (http://www.genenames.org/cgi-bin/symbol_checker ) converts older gene-symbols and gene-aliases to their currently used standard version. Alternatively, gene symbols can also be converted to a consistent set using DAVID conversion tool (available at: http://david.abcc.ncifcrf.gov/conversion.jsp)
Also, using Excel (yes.. The Microsoft Excel) for accessing gene-lists can also change gene-symbols. Supposedly "self sufficiently smart" Excel confuses some of the gene-symbols with calendar dates and converts them to dates permanently. For example, MARCH1 becomes 1-March and SEPT2 becomes 2-September. I have gathered a list (available here) of 34 human gene-symbols that are changed by Excel. This problem has also been described previously by Zeeberg et al.
Excel mediated conversion of gene-symbols can be avoided by selecting the column(s) that will contain gene-symbols and then select format as 'text' by right clicking on it.
While comparing the gene-symbol lists, there are some other few things to keep in mind. For example, make sure all the gene symbols are in same case (C8orf4 vs C8ORF) and remove any spaces from within or around the gene-symbols.
That Zeeberg et al paper is one of my favorites for so many reasons. Mainly because Excel is so widely used but is such a problem that it required a publication (with 31 citations as of now!) describing the pitfalls.
ReplyDeleteThat's true, Lee. That's one of my favorite paper too. This gene-symbol problem looks trivial but have far more impacts when it comes to analyzing gene-sets. I've seen significant changes in my analysis due to missing out on important genes simply because of the changed gene-symbols.
DeleteThanks so much for sharing this awesome info! I am looking forward to see more postsby you!
ReplyDeleteexcel vba training london