.. install_geneflow

Install GeneFlow
================

GeneFlow can be installed and run in any Linux environment, and is pre-installed in the CDC environment. 

Requirements
------------

At a minimum, GeneFlow requires a Linux environment with Python 3. The Python pip installer for GeneFlow handles all python dependencies.

Agave is optionally required if you want to run workflows in Agave (see https://agaveapi.co).

Install Dependencies in Ubuntu/Debian Systems
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Install system-level dependencies in Ubuntu with the following commands:

    .. code-block:: bash

        sudo apt install python3 python3-dev git gcc

Install Dependencies in CentOS/RHEL Systems
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Install system-level dependencies in CentOS with the following commands:

    .. code-block:: bash

       sudo yum install python36 python36-devel git gcc

Python Modules
~~~~~~~~~~~~~~

You may also need the following Python modules:

    .. code-block:: bash

        pip install setuptools wheel

Prepare the CDC Environment to run GeneFlow
-------------------------------------------

To use the pre-installed GeneFlow in the CDC environment, use the following instructions to prepare the environment and load the module.

Prepare the CDC Environment and Load the GeneFlow Module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Use the following steps to setup your CDC Linux environment to run GeneFlow. GeneFlow can be run from Biolinux or from any CDC Linux system that has access to the "modules" environment. 

1. In your home directory, create a working directory.

    .. code-block:: bash

        mkdir ~/geneflow_work

2. Although GeneFlow output can be directed to any folder, create an output folder to help organize workflow outputs:

    .. code-block:: bash

        mkdir ~/geneflow_work/output

3. Load the GeneFlow module. Note that older versions of GeneFlow can also be loaded by replacing "latest" with the desired version number:

    .. code-block:: bash

        module load geneflow/latest

Prepare Agave in the CDC Environment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you want to run workflows in Agave, you will need to initialize your CDC Agave environment.

Follow these instructions to prepare your Agave environment. Note that these instructions only need to be performed once.

1. Load the Agave CLI tools:

    .. code-block:: bash

        module load cobra-cli/0.1

2. Initialize your client:

    .. code-block:: bash

        cobra-init

    ``cobra-init`` will prompt you for your username and password.

3. Create an execution system:

    .. code-block:: bash

        cobra-systems-create

    Note the name of the new execution system, which will be formatted as ``cobra-hpc-aspen-[USER]-[DATE]``.

    Create GeneFlow output and work directories in your Agave home:

    .. code-block:: bash

        files-mkdir -N geneflow-output /[USER]
        files-mkdir -N geneflow-work /[USER]

4. Prepare the Agave configuration file:

    Create a new file with agave environment parameters:

    .. code-block:: bash

        cd ~/geneflow_work
        vi ./agave-params.yaml

    Add the following to the file:

    .. code-block:: yaml

       %YAML 1.1
       ---
       agave:
         # prefix for app name. For user apps, use your username.
         # For public apps, use 'public'.
         appsPrefix: [USER]

         # must have publish rights to the execution system
         executionSystem: cobra-hpc-aspen-[USER]-[DATE]

         # location of your agave home directory
         deploymentSystem: cobra-default-public-storage

         # Apps directory where app assets will be uploaded
         # This must be an absolute path
         appsDir: /[USER]/apps-gf

         # location of workflow test data, absolute path
         testDataDir: /[USER]/testdata-gf


    Replace ``[USER]`` with your Agave username.

    ``executionSystem`` should be the same system created in step 3 (e.g., ``cobra-hpc-aspen-[USER]-[DATE]``, replace ``[USER]`` and ``[DATE]``). To see a list of execution systems to which you have access, use:

    .. code-block:: bash

        systems-list -E

    ``deploymentSystem`` should be left at the default value.

Install GeneFlow in a General Linux Environment
-----------------------------------------------

Use the following instructions to install GeneFlow in a general Linux environment. 

.. _install-geneflow-venv:

Install GeneFlow using a Python Virtual Environment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

GeneFlow can be installed and run in any general Linux environment. Use the following instructions to install geneflow and prepare the environment for running workflows. 

1. Create a working directory in a Linux environment. The remainder of these instructions will assume that the working directory is in ``~/geneflow_work``. If you customize this, be sure to adjust the commands in the instructions accordingly.

    .. code-block:: bash

        mkdir ~/geneflow_work

2. Setup a Python 3 virtual environment and install dependencies with pip3. The virtual environment is optional if you have sudo access and wish to install GeneFlow and dependencies system-wide. The python3 executable must be available in your path.

    Create and activate the Python 3 virtual environment:

    .. code-block:: bash

        cd ~/geneflow_work
        python3 -m venv gfpy
        source gfpy/bin/activate

    You should now see a modified prompt prefixed with "(gfpy)".

3. Clone the GeneFlow source code and install it:

    .. code-block:: bash

        git clone https://github.com/CDCgov/geneflow2
        pip3 install ./geneflow

Prepare Agave in a General Linux Environment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you want to run workflows in Agave, you will need access to an Agave tenant. Follow the instructions at https://agaveapi.co to setup Agave with a client, storage system, and execution system. Alternatively, talk to your system administrator about setting up an Agave tenant.

1. Create an Agave token using the Agave CLI (https://bitbucket.org/tacc-cic/cli):

    .. code-block:: bash

        auth-tokens-create -u [USER]

2. Create GeneFlow output and work directories in your Agave home:

    .. code-block:: bash

        files-mkdir -N geneflow-output /[USER]
        files-mkdir -N geneflow-work /[USER]

    Note that, depending on your Agave default storage system, your home directory may be in a different place (e.g., /home/[USER]). Check with your Agave administrator if you are unsure.

3. Install the Agave Python wrapper:

    .. code-block:: bash

        pip3 install agavepy

4. Prepare the Agave configuration file:

    Create a new file with Agave environment parameters:

    .. code-block:: bash

        cd ~/geneflow_work
        vi ./agave-params.yaml

    Add the following to the file:

    .. code-block:: yaml

        %YAML 1.1
        ---
        agave:
          # prefix for app name. For user apps, use your username.
          # For public apps, use 'public'.
          appsPrefix: [USER]

          # must have publish rights to the execution system
          executionSystem: [execution-system]

          # location of your agave home directory
          deploymentSystem: [default-public-storage]

          # Apps directory where app assets will be uploaded.
          # This must be an absolute path.
          appsDir: /[USER]/apps-gf

          # location of workflow test data, absolute path.
          testDataDir: /[USER]/testdata-gf

    Replace ``[USER]`` with your Agave username.

    Replace ``[execution-system]`` with an Agave execution system for which you have "OWNER" access.

    Replace ``[deployment-system]`` with an Agave storage system that contains your home directory.

    For public apps, ``appsDir`` and ``testDataDir`` can be set to a global or shared location instead of a user home directory.