{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Specifying data for analysis\n",
    "\n",
    "We introduce the concept of a \"data store\". This represents the data record(s) that you want to analyse. It can be a single file, a directory of files, a zipped directory of files or a single `tinydb` file containing multiple data records.\n",
    "\n",
    "We represent this concept by a `DataStore` class. There are different flavours of these:\n",
    "\n",
    "- directory based\n",
    "- zip archive based\n",
    "- TinyDB based (this is a NoSQL json based data base)\n",
    "\n",
    "These can be read only or writable. All of these types support being indexed, iterated over, filtered, etc.. The `tinydb` variants do have some unique abilities (discussed below)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A read only data store\n",
    "\n",
    "To create one of these, you provide a `path` AND a `suffix` of the files within the directory / zip that you will be analysing. (If the path ends with `.tinydb`, no file suffix is required.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5x member ReadOnlyZippedDataStore(source='../data/raw.zip', members=['../data/raw/ENSG00000157184.fa', '../data/raw/ENSG00000131791.fa', '../data/raw/ENSG00000127054.fa'...)"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from cogent3.app.io import get_data_store\n",
    "\n",
    "dstore = get_data_store(\"../data/raw.zip\", suffix=\"fa*\", limit=5)\n",
    "dstore"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data store \"members\"\n",
    "\n",
    "These are able to read their own raw data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'../data/raw/ENSG00000157184.fa'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m = dstore[0]\n",
    "m"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'>Human\\nATGGTGCCCCGCC'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.read()[:20]  # truncating"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Showing the last few members\n",
    "\n",
    "Use the `head()` method to see the first few."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['../data/raw/ENSG00000157184.fa',\n",
      " '../data/raw/ENSG00000131791.fa',\n",
      " '../data/raw/ENSG00000127054.fa',\n",
      " '../data/raw/ENSG00000067704.fa',\n",
      " '../data/raw/ENSG00000182004.fa']\n"
     ]
    }
   ],
   "source": [
    "dstore.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filtering a data store for specific members"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['../data/raw/ENSG00000067704.fa']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dstore.filtered(\"*ENSG00000067704*\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Looping over a data store"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "../data/raw/ENSG00000157184.fa\n",
      "../data/raw/ENSG00000131791.fa\n",
      "../data/raw/ENSG00000127054.fa\n",
      "../data/raw/ENSG00000067704.fa\n",
      "../data/raw/ENSG00000182004.fa\n"
     ]
    }
   ],
   "source": [
    "for m in dstore:\n",
    "    print(m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Making a writeable data store\n",
    "\n",
    "The creation of a writeable data store is handled for you by the different writers we provide under `cogent3.app.io`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TinyDB data stores are special\n",
    "\n",
    "When you specify a TinyDB data store as your output (by using `io.write_db()`), you get additional features that are useful for dissecting the results of an analysis. \n",
    "\n",
    "One important issue to note is the process which creates a TinyDB \"locks\" the file. If that process exits unnaturally (e.g. the run that was producing it was interrupted) then the file may remain in a locked state. If the db is in this state, `cogent3` will not modify it unless you explicitly unlock it.\n",
    "\n",
    "This is represented in the display as shown below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<style>\n",
       "tr:last-child {border-bottom: 1px solid #000;} tr > th {text-align: center !important;} tr > td {text-align: left !important;}\n",
       "</style>\n",
       "<caption style=\"color: rgb(250, 250, 250); background: rgba(30, 140, 200, 1); align=top;\"><span style=\"font-weight: bold;\">Locked db store. Locked to pid=8582, current pid=13845</span><span></span></caption>\n",
       "<thead style=\"background: rgba(161, 195, 209, 0.75); font-weight: bold; text-align: center;\">\n",
       "<th>record type</th>\n",
       "<th>number</th>\n",
       "</thead>\n",
       "<tbody>\n",
       "<tr>\n",
       "<td>completed</td>\n",
       "<td style=\"font-family: monospace !important;\">175</td>\n",
       "</tr>\n",
       "<tr>\n",
       "<td>incomplete</td>\n",
       "<td style=\"font-family: monospace !important;\">0</td>\n",
       "</tr>\n",
       "<tr>\n",
       "<td>logs</td>\n",
       "<td style=\"font-family: monospace !important;\">1</td>\n",
       "</tr>\n",
       "</tbody>\n",
       "</table>\n",
       "<p>\n",
       "3 rows x 2 columns</p>"
      ],
      "text/plain": [
       "Locked db store. Locked to pid=8582, current pid=13845\n",
       "=====================\n",
       "record type    number\n",
       "---------------------\n",
       "  completed       175\n",
       " incomplete         0\n",
       "       logs         1\n",
       "---------------------\n",
       "\n",
       "3 rows x 2 columns"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dstore = get_data_store(\"../data/demo-locked.tinydb\")\n",
    "dstore.describe"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To unlock, you execute the following:\n",
    "\n",
    "```python\n",
    "dstore.unlock(force=True)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Interrogating run logs\n",
    "\n",
    "If you use the `apply_to(logger=true)` method, a `scitrack` logfile will be included in the data store. This includes useful information regarding the run conditions that produced the contents of the data store."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<style>\n",
       "tr:last-child {border-bottom: 1px solid #000;} tr > th {text-align: center !important;} tr > td {text-align: left !important;}\n",
       "</style>\n",
       "<caption style=\"color: rgb(250, 250, 250); background: rgba(30, 140, 200, 1); align=top;\"><span style=\"font-weight: bold;\">summary of log files</span><span></span></caption>\n",
       "<thead style=\"background: rgba(161, 195, 209, 0.75); font-weight: bold; text-align: center;\">\n",
       "<th>time</th>\n",
       "<th>name</th>\n",
       "<th>python version</th>\n",
       "<th>who</th>\n",
       "<th>command</th>\n",
       "<th>composable</th>\n",
       "</thead>\n",
       "<tbody>\n",
       "<tr>\n",
       "<td>2019-07-24 14:42:56</td>\n",
       "<td>load_unaligned-progressive_align-write_db-pid8650.log</td>\n",
       "<td>3.7.3</td>\n",
       "<td>gavin</td>\n",
       "<td>/Users/gavin/miniconda3/envs/c3dev/lib/python3.7/site-packages/ipykernel_launcher.py -f /Users/gavin/Library/Jupyter/runtime/kernel-5eb93aeb-f6e0-493e-85d1-d62895201ae2.json</td>\n",
       "<td>load_unaligned(type='sequences', moltype='dna', format='fasta') + progressive_align(type='sequences', model='HKY85', gc=None, param_vals={'kappa': 3}, guide_tree=None, unique_guides=False, indel_length=0.1, indel_rate=1e-10) + write_db(type='output', data_path='../data/aligned-nt.tinydb', name_callback=None, create=True, if_exists='overwrite', suffix='json')</td>\n",
       "</tr>\n",
       "</tbody>\n",
       "</table>\n",
       "<p>\n",
       "1 rows x 6 columns</p>"
      ],
      "text/plain": [
       "summary of log files\n",
       "====================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================\n",
       "               time                                                     name    python version      who                                                                                                                                                                          command                                                                                                                                                                                                                                                                                                                                                                  composable\n",
       "--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n",
       "2019-07-24 14:42:56    load_unaligned-progressive_align-write_db-pid8650.log             3.7.3    gavin    /Users/gavin/miniconda3/envs/c3dev/lib/python3.7/site-packages/ipykernel_launcher.py -f /Users/gavin/Library/Jupyter/runtime/kernel-5eb93aeb-f6e0-493e-85d1-d62895201ae2.json    load_unaligned(type='sequences', moltype='dna', format='fasta') + progressive_align(type='sequences', model='HKY85', gc=None, param_vals={'kappa': 3}, guide_tree=None, unique_guides=False, indel_length=0.1, indel_rate=1e-10) + write_db(type='output', data_path='../data/aligned-nt.tinydb', name_callback=None, create=True, if_exists='overwrite', suffix='json')\n",
       "--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n",
       "\n",
       "1 rows x 6 columns"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dstore.summary_logs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Log files can be accessed vial a special attribute."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['load_unaligned-progressive_align-write_db-pid8650.log']"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dstore.logs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each element in that list is a `DataStoreMember` wghich you can use to get the data contents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2019-07-24 14:42:56\tEratosthenes.local:8650\tINFO\tsystem_details : system=Darwin Kernel Version 18.6.0: Thu Apr 25 23:16:27 PDT 2019; root:xnu-4903.261.4~2/RELEASE_X86_64\n",
      "2019-07-24 14:42:56\tEratosthenes.local:8650\tINFO\tpython\n"
     ]
    }
   ],
   "source": [
    "print(dstore.logs[0].read()[:225])  # truncated for clarity"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:c3dev] *",
   "language": "python",
   "name": "conda-env-c3dev-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.1"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
