Generic File Synchronizer
==========================

Concept
-------

To synchronize arbitrary folder structures, we created a module that continuously watches
a specified folder and can automatically create datasets from it.

The software identifies datasets by the presence of a ``_QH_dataset_info.yaml`` file in the folder.
This file specifies the minimum amount of information needed to create a dataset.
Every other file in this (sub)directory is considered a data file and will be added to the dataset.

An example of the folder structure:

.. code-block::

    main_folder
    ├── 20240101
    │   ├── 20240101-211245-165-731d85-experiment_1
    │   │   ├── _QH_dataset_info.yaml
    │   │   ├── 01-01-2024_01-01-01.json
    │   │   ├── 01-01-2024_01-01-01.hdf5
    ├── 20240102
    │   ├── 20240102-220655-268-455d85-experiment_2
    │   │   ├── _QH_dataset_info.yaml
    │   │   ├── 02-01-2024_02-02-02.json
    │   │   ├── 02-01-2024_02-02-02.hdf5
    │   │   ├── analysis
    │   │   │   ├── 02-01-2024_02-02-02_analysis.json
    │   │   │   ├── 02-01-2024_02-02-02_analysis.hdf5
    ├── some_other_folder
    │   ├── _QH_dataset_info.yaml
    │   ├── 01-01-2024_01-01-01.json

If a file is added to any of these folders or a new one is created,
the synchronization agent will automatically add it to the dataset.

.. _creating-dataset-info-files:

The ``_QH_dataset_info.yaml`` file
----------------------------------
When performing measurements, we recommend programmatically writing the ``_QH_dataset_info.yaml`` file to the relevant folder.

A minimal example of this file:

.. code-block:: yaml

    version : 0.1

This file can contain the following fields:

* ``version`` (required): The version of the file format. This ensures compatibility with the software. The current version is 0.1.
* ``dataset_name`` (optional): The name of the dataset. If not provided, the name of the folder containing the _QH_dataset_info.yaml file is used.
* ``created`` (optional): The date the dataset was created, in the format ``YYYY-MM-DDTHH:MM:SS``. If creation time of the first file in the dataset is used.
* ``description`` (optional): A description of the dataset.
* ``attributes`` (optional): A dictionary of attributes to be added to the dataset. Values should be of type str or number.
* ``keywords`` (optional): A list of keywords to be added to the dataset.
* ``converters`` (optional): Allows you to specify file converters to automatically convert files within the dataset.

    To create your own converter, you need to create a class that inherits from the ``FileConverter`` class
    and implement the ``convert`` method. For more details, see the section on :ref:`creating-file-converters`.
  
    The syntax for the converters field is as follows:
  
    .. code-block:: yaml

        converters:
            txt_to_csv_converter : #name of the converter
                module : my_library.location.to.module
                class : MyConverterClass

* ``skip`` (optional): A list of patterns to skip files (e.g., [ ``*.json``, ``my_image.png`` ]).

An example of a more complete ``_QH_dataset_info.yaml`` file is:

.. code-block:: yaml

    version : 0.1
    dataset_name : 'my_dataset_name'
    description : "Description of the experiment I want to do."
    attributes:
      'initials' : 'QH'
      'set_up' : 'XLD001'
      'sample' : 'my_sample'
    keywords: ['rabi', 'test']
    skip: ['*.json', 'raw_data/*',]


Programmatically creating the ``_QH_dataset_info.yaml`` file
------------------------------------------------------------

The config file can be programmatically created using the following function from the qdrive package:

.. code-block:: python

    from qdrive.dataset import generate_dataset_info

    # import a converter
    from etiket_client.sync.backends.filebase.converters.zarr_to_zip import ZarrToZipConverter

    # for example (note that all these field are optional) :
    path = "/Users/stephan/Desktop/test_2/"
    generate_dataset_info(path, 
        dataset_name = "my_dataset_name",
        creation = datetime.now(),
        description="Description of the experiment I want to do.",
        attributes={"sample" : "my_sample"},
        keywords=["rabi", "test"],
        converters = [ZarrToZipConverter],
        skip=["*.json", "raw_data/*"])

Setting Up the synchronization agent
-------------------------------------

The easiest way to configure the synchronization agent is through the GUI.
In case you want to add it programmatically, you can use the following code snippet:

.. code-block:: python

    from etiket_client.sync.backends.sources import add_sync_source
    from etiket_client.python_api.scopes import get_scope_by_name

    from etiket_client.sync.backends.filebase.filebase_sync_class import FileBaseSync
    from etiket_client.sync.backends.filebase.filebase_config_class import FileBaseConfigData

    import pathlib 

    data_path = pathlib.Path('/Users/user/Desktop') # update path !!!!!!

    if not data_path.exists():
        raise ValueError(f"Data path {data_path} does not exist. Please correct the path.")

    # scope to which the data will be uploaded
    scope4upload = get_scope_by_name('scope_name') # update scope name !!!!!!

    # sample name and set up will be added to every dataset that is uploaded from this location
    config = FileBaseConfigData(root_directory=data_path, server_folder = False)

    # give a name to the sync agent (should be locally unique)
    add_sync_source('my_sync_source_name', FileBaseSync, config, scope4upload)

.. note::
    When your data is stored on a network drive, the ``server_folder`` parameter should be set to ``True``.

.. _creating-file-converters:

Creating your own file converters
---------------------------------

To create your own file converter,
you need to create a class that inherits from the ``FileConverter`` class
and implements the ``convert`` method.


Here is an example of a converter that converts `.zarr` files to `.zip` files:

.. code-block:: python

    from etiket_client.sync.backends.filebase.converters.base import FileConverter

    import shutil, pathlib

    class ZarrToZipConverter(FileConverter):
        input_type = 'zarr' # Specify the input file type
        output_type = 'zip' # Specify the output file type
        
        def convert(self) -> pathlib.Path:
            folder_name = self.file_path.name
            shutil.make_archive(
                base_name=str(pathlib.Path(self.temp_dir.name) / folder_name),
                format='zip',
                root_dir=str(self.file_path)
            )
            return pathlib.Path(self.temp_dir.name) / f"{folder_name}.zip"

The ``FileConverter`` class requires the following attributes to be set:

- ``input_type`` (required): The file extension of the input file.
- ``output_type`` (required): The file extension of the output file.

The ``FileConverter`` class provides the following attributes for use in the ``convert`` method:

- ``self.file_path``: The path of the input file.
- ``self.temp_dir``: A temporary directory where the output file can be stored.

In the ``convert`` method, you can convert the file to the desired format using the provided paths,
store it in the temporary directory, and return the path to the converted file.

.. note::
    Don't forget to install the package containing your converter in the environment where the synchronization agent runs (see next section)!

.. tip::

    You can test the converter by running the following code:

    .. code-block:: python

        from etiket_client.sync.backends.filebase.converters.zarr_to_zip import ZarrToZipConverter

        converter = ZarrToZipConverter(pathlib.Path('/Users/user/Desktop/test.zarr'))

        with converter:
            output_path = converter.convert()
            print(output_path)
            # Here you can test further if the output is correct
            # When exiting this context manager, the temporary directory will be removed

Packaging Your Own Converter
----------------------------

To use your custom file converter with the synchronization agent,
you need to ensure that your converter is installed in the same environment where the sync agent runs.
If you're not familiar with Python packaging, don't worry -- you can just follow these step-by-step instructions.

1. **Organize Your Converter Code**

   Create a new directory for your converter package. Within this directory, create a subdirectory for your converter module. Here's an example structure:

   .. code-block:: none

       my_converter_package/
       ├── my_converters/
       │   ├── __init__.py
       │   └── my_converter.py

   - ``my_converter_package``: This is the root folder of your package.
   - ``my_converters``: This subfolder will contain your converter code.
   - ``__init__.py``: An empty file that tells Python this directory is a package.
   - ``my_converter.py``: The file where your converter class is defined.

   Place your converter class ( e.g., ``MyConverterClass`` ) in ``my_converter.py``, as shown in the previous section.

3. **Add a** ``pyproject.toml`` **File**

   In the root directory (`my_converter_package/`), create a file named `pyproject.toml` with the following content:

   .. code-block:: toml

    [project]
    name = "my_converters"
    version = "0.1.0"
    description = "My custom converter package"
    dependencies = ["numpy"] # Add any dependencies here

    [build-system]
    requires = ["setuptools>=64.0"]
    build-backend = "setuptools.build_meta"

   More info on the ``pyproject.toml`` file can be found `here <https://packaging.python.org/en/latest/guides/writing-pyproject-toml/>`_.

4. **Install the Package**

   Open a terminal and navigate to the root directory of your package:

   .. code-block:: bash

       cd /path/to/my_converter_package

   Replace `/path/to/my_converter_package` with the actual path to your package directory.

   Install the package using pip:

   .. code-block:: bash

       pip install .

   This command installs your converter package into the Python environment.

   .. tip::

        To install the package in editable (development) mode, allowing you to modify the code without reinstalling, use:

        .. code-block:: bash

            pip install -e .

5. **Reference the Converter in** ``_QH_dataset_info.yaml``

   In your dataset info file, specify your converter so the synchronization agent knows how to use it:

   .. code-block:: yaml

       converters:
         my_custom_converter:
           module: my_converters.my_converter
           class: MyConverterClass

   - ``module``: The dotted path to your converter module. In this example, ``my_converters.my_converter`` corresponds to `my_converters/my_converter.py`.
   - ``class``: The name of your converter class defined in ``my_converter.py``.


6. **Restart the Synchronization Agent**

   If the synchronization agent is already running, restart it to recognize the newly installed converter.

By following these steps, your custom converter will be available to the synchronization agent, allowing it to process files using your converter as specified.

.. note::

    **Dependencies**: If your converter requires additional Python packages, you can specify them in the `setup.py` file using the `install_requires` parameter:

    .. code-block:: python

        setup(
            name='my_converter_package',
            version='0.1',
            packages=find_packages(),
            install_requires=[
                'numpy',
                'pandas',
            ],
        )

    This ensures that required packages are installed alongside your converter.

.. note::
    
    **Testing Your Converter**: 
    It's a good idea to test your converter before using it with the synchronization agent. You can do this by running:

    .. code-block:: python

        from my_converters.my_converter import MyConverterClass
        import pathlib

        converter = MyConverterClass(pathlib.Path('/path/to/your/file.zarr'))

        with converter:
            output_path = converter.convert()
            print(f'Converted file saved at: {output_path}')