{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Molecular types\n",
    "\n",
    "The ``MolType`` object provides services for resolving ambiguities, or providing the correct ambiguity for recoding. It also maintains the mappings between different kinds of alphabets, sequences and alignments.\n",
    "\n",
    "If your analysis involves handling ambiguous states, or translation via a genetic code, it's critical to specify the appropriate moltype.\n",
    "\n",
    "## Available molecular types"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<style>\n",
       "tr:last-child {border-bottom: 1px solid #000;} tr > th {text-align: center !important;} tr > td {text-align: left !important;}\n",
       "</style>\n",
       "<caption style=\"color: rgb(250, 250, 250); background: rgba(30, 140, 200, 1); align=top;\"><span style=\"font-weight: bold;\">Specify a moltype by the string 'Abbreviation' (case insensitive).</span><span></span></caption>\n",
       "<thead style=\"background: rgba(161, 195, 209, 0.75); font-weight: bold; text-align: center;\">\n",
       "<th>Abbreviation</th>\n",
       "<th>Number of states</th>\n",
       "<th>Moltype</th>\n",
       "</thead>\n",
       "<tbody>\n",
       "<tr>\n",
       "<td style=\"background: rgba(161, 195, 209, 0.25); font-weight: 600;\">ab</td>\n",
       "<td style=\"font-family: monospace !important;\">2</td>\n",
       "<td>MolType(('a', 'b'))</td>\n",
       "</tr>\n",
       "<tr>\n",
       "<td style=\"background: rgba(161, 195, 209, 0.25); font-weight: 600;\">dna</td>\n",
       "<td style=\"font-family: monospace !important;\">4</td>\n",
       "<td>MolType(('T', 'C', 'A', 'G'))</td>\n",
       "</tr>\n",
       "<tr>\n",
       "<td style=\"background: rgba(161, 195, 209, 0.25); font-weight: 600;\">rna</td>\n",
       "<td style=\"font-family: monospace !important;\">4</td>\n",
       "<td>MolType(('U', 'C', 'A', 'G'))</td>\n",
       "</tr>\n",
       "<tr>\n",
       "<td style=\"background: rgba(161, 195, 209, 0.25); font-weight: 600;\">protein</td>\n",
       "<td style=\"font-family: monospace !important;\">21</td>\n",
       "<td>MolType(('A', 'C', 'D', 'E', 'F', 'G', ...</td>\n",
       "</tr>\n",
       "<tr>\n",
       "<td style=\"background: rgba(161, 195, 209, 0.25); font-weight: 600;\">protein_with_stop</td>\n",
       "<td style=\"font-family: monospace !important;\">22</td>\n",
       "<td>MolType(('A', 'C', 'D', 'E', 'F', 'G', ...</td>\n",
       "</tr>\n",
       "<tr>\n",
       "<td style=\"background: rgba(161, 195, 209, 0.25); font-weight: 600;\">text</td>\n",
       "<td style=\"font-family: monospace !important;\">52</td>\n",
       "<td>MolType(('a', 'b', 'c', 'd', 'e', 'f', ...</td>\n",
       "</tr>\n",
       "<tr>\n",
       "<td style=\"background: rgba(161, 195, 209, 0.25); font-weight: 600;\">bytes</td>\n",
       "<td style=\"font-family: monospace !important;\">256</td>\n",
       "<td>MolType(('\\x00', '\\x01', '\\x02', '\\x03'...</td>\n",
       "</tr>\n",
       "</tbody>\n",
       "</table>\n",
       "<p>\n",
       "7 rows x 3 columns</p>"
      ],
      "text/plain": [
       "Specify a moltype by the string 'Abbreviation' (case insensitive).\n",
       "===================================================================================\n",
       "     Abbreviation    Number of states                                       Moltype\n",
       "-----------------------------------------------------------------------------------\n",
       "               ab                   2                           MolType(('a', 'b'))\n",
       "              dna                   4                 MolType(('T', 'C', 'A', 'G'))\n",
       "              rna                   4                 MolType(('U', 'C', 'A', 'G'))\n",
       "          protein                  21    MolType(('A', 'C', 'D', 'E', 'F', 'G', ...\n",
       "protein_with_stop                  22    MolType(('A', 'C', 'D', 'E', 'F', 'G', ...\n",
       "             text                  52    MolType(('a', 'b', 'c', 'd', 'e', 'f', ...\n",
       "            bytes                 256    MolType(('\\x00', '\\x01', '\\x02', '\\x03'...\n",
       "-----------------------------------------------------------------------------------\n",
       "\n",
       "7 rows x 3 columns"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from cogent3 import available_moltypes\n",
    "\n",
    "available_moltypes()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For statements that have a `moltype` argument, use the entry under the \"Abbreviation\" column. For example:\n",
    "\n",
    "```python\n",
    "from cogent3 import load_aligned_seqs\n",
    "\n",
    "seqs = load_aligned_seqs(\"path/to/data.fasta\", moltype=\"dna\")\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Getting a `MolType`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MolType(('T', 'C', 'A', 'G'))"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from cogent3 import get_moltype\n",
    "\n",
    "dna = get_moltype(\"dna\")\n",
    "dna"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Using a `MolType` to get ambiguity codes\n",
    "\n",
    "Just using `dna` from above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'?': ('T', 'C', 'A', 'G', '-'),\n",
       " '-': ('-',),\n",
       " 'N': ('A', 'C', 'T', 'G'),\n",
       " 'R': ('A', 'G'),\n",
       " 'Y': ('C', 'T'),\n",
       " 'W': ('A', 'T'),\n",
       " 'S': ('C', 'G'),\n",
       " 'K': ('T', 'G'),\n",
       " 'M': ('C', 'A'),\n",
       " 'B': ('C', 'T', 'G'),\n",
       " 'D': ('A', 'T', 'G'),\n",
       " 'H': ('A', 'C', 'T'),\n",
       " 'V': ('A', 'C', 'G'),\n",
       " 'T': ('T',),\n",
       " 'C': ('C',),\n",
       " 'A': ('A',),\n",
       " 'G': ('G',)}"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dna.ambiguities"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## `MolType` definition of degenerate codes "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'N': ('A', 'C', 'T', 'G'),\n",
       " 'R': ('A', 'G'),\n",
       " 'Y': ('C', 'T'),\n",
       " 'W': ('A', 'T'),\n",
       " 'S': ('C', 'G'),\n",
       " 'K': ('T', 'G'),\n",
       " 'M': ('C', 'A'),\n",
       " 'B': ('C', 'T', 'G'),\n",
       " 'D': ('A', 'T', 'G'),\n",
       " 'H': ('A', 'C', 'T'),\n",
       " 'V': ('A', 'C', 'G'),\n",
       " '?': 'TCAG-'}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dna.degenerates"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Nucleic acid `MolType` and complementing\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'TCC'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dna.complement(\"AGG\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Making sequences\n",
    "\n",
    "Use the either the top level `cogent3.make_seq` function, or the method on the `MolType` instance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DnaSequence(AGGCTT)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "seq = dna.make_seq(\"AGGCTT\", name=\"seq1\")\n",
    "seq"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Verify sequences"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rna = get_moltype(\"rna\")\n",
    "rna.is_valid(\"ACGUACGUACGUACGU\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Making a custom ``MolType``\n",
    "\n",
    "We demonstrate this by customising DNA so it allows ``.`` as gaps"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DnaSequence(ACG.)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from cogent3.core import moltype as mt\n",
    "\n",
    "DNAgapped = mt.MolType(seq_constructor=mt.DnaSequence,\n",
    "                       motifset=mt.IUPAC_DNA_chars,\n",
    "                       ambiguities=mt.IUPAC_DNA_ambiguities,\n",
    "                       complements=mt.IUPAC_DNA_ambiguities_complements,\n",
    "                       pairs=mt.DnaStandardPairs,\n",
    "                       gaps='.')\n",
    "seq = DNAgapped.make_seq('ACG.')\n",
    "seq"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. warning:: At present, constructing a custom ``MolType`` that overrides a builtin one affects the original (in this instance, the ``DnaSequence`` class). All subsequent calls to the original class in the running process that made the change are affected. teh below code is resetting this attribute now to allow the rest of the documentation to be executed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from cogent3 import DNA\n",
    "from cogent3.core.sequence import DnaSequence\n",
    "DnaSequence.moltype = DNA"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:c3dev] *",
   "language": "python",
   "name": "conda-env-c3dev-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.1"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "metadata": {
     "collapsed": false
    },
    "source": []
   }
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
