Sunday, February 9, 2014

Functional genomics and gene-symbols

Post- genome/transcriptome analysis, especially in cancer or other disease related study, usually ends up in functional genomics that includes comparing list of mutated or changed (qualitatively or quantitatively) genes with the list of genes from ontologies (GO, GeneGo), canonical pathways (DAVID, KEGG, Wikipathways), gene networks, or genes implicated with disease (COSMIC database).
Now this gene list comparison is typically done by matching gene-symbols (also known as gene identifiers). Gene-symbol comparison can become tricky at times since gene-symbols change due to either change in underlying nomenclature or re-annotation of gene function.  For example, ARHGEF7 can be referred to as KIAA0142, PIXB, DKFZp761K1021, Nbla10314, DKFZp686C12170, BETA-PIX, COOL1, P85SPR, P85, P85COOL1, P50BP, PAK3 or P50.

This problem sounds trivial  but using incorrect or inconsistent gene-symbols can result in the data loss (changed genes) that in turn can confound the accurate biological interpretation of the data.

Now solution to this problem is to always use currently standardized gene-symbols approved and provided by the nomenclature committee which is HGNC (HUGO Gene Nomenclature Committee).
A batch tool converter (http://www.genenames.org/cgi-bin/symbol_checker ) converts older gene-symbols and gene-aliases to their currently used standard version. Alternatively, gene symbols can also be converted to a consistent set using DAVID conversion tool (available at: http://david.abcc.ncifcrf.gov/conversion.jsp)

Also, using Excel (yes.. The Microsoft Excel) for accessing gene-lists can also change gene-symbols.  Supposedly "self sufficiently smart" Excel confuses some of the gene-symbols with calendar dates and converts them to dates permanently. For example, MARCH1 becomes 1-March and SEPT2 becomes 2-September. I have gathered a list (available here) of 34 human gene-symbols that are changed by Excel. This problem has also been described previously by Zeeberg et al
Excel mediated conversion of gene-symbols can be avoided by selecting the column(s) that will contain gene-symbols and then select format as 'text' by right clicking on it.

While comparing the gene-symbol lists, there are some other few things to keep in mind. For example, make sure all the gene symbols are in same case (C8orf4 vs C8ORF) and remove any spaces from within or around the gene-symbols.