Generic File Synchronizer

Concept

To synchronize arbitrary folder structures, we created a module that continuously watches a specified folder and can automatically create datasets from it.

The software identifies datasets by the presence of a _QH_dataset_info.yaml file in the folder. This file specifies the minimum amount of information needed to create a dataset. Every other file in this (sub)directory is considered a data file and will be added to the dataset.

An example of the folder structure:

main_folder
├── 20240101
│   ├── 20240101-211245-165-731d85-experiment_1
│   │   ├── _QH_dataset_info.yaml
│   │   ├── 01-01-2024_01-01-01.json
│   │   ├── 01-01-2024_01-01-01.hdf5
├── 20240102
│   ├── 20240102-220655-268-455d85-experiment_2
│   │   ├── _QH_dataset_info.yaml
│   │   ├── 02-01-2024_02-02-02.json
│   │   ├── 02-01-2024_02-02-02.hdf5
│   │   ├── analysis
│   │   │   ├── 02-01-2024_02-02-02_analysis.json
│   │   │   ├── 02-01-2024_02-02-02_analysis.hdf5
├── some_other_folder
│   ├── _QH_dataset_info.yaml
│   ├── 01-01-2024_01-01-01.json

If a file is added to any of these folders or a new one is created, the synchronization agent will automatically add it to the dataset.

The `_QH_dataset_info.yaml` file

When performing measurements, we recommend programmatically writing the _QH_dataset_info.yaml file to the relevant folder.

A minimal example of this file:

version : 0.1

This file can contain the following fields:

version (required): The version of the file format. This ensures compatibility with the software. The current version is 0.1.
dataset_name (optional): The name of the dataset. If not provided, the name of the folder containing the _QH_dataset_info.yaml file is used.
created (optional): The date the dataset was created, in the format YYYY-MM-DDTHH:MM:SS. If creation time of the first file in the dataset is used.
description (optional): A description of the dataset.
attributes (optional): A dictionary of attributes to be added to the dataset. Values should be of type str or number.
keywords (optional): A list of keywords to be added to the dataset.
converters (optional): Allows you to specify file converters to automatically convert files within the dataset.
To create your own converter, you need to create a class that inherits from the FileConverter class and implement the convert method. For more details, see the section on Creating your own file converters.

The syntax for the converters field is as follows:
converters: txt_to_csv_converter : #name of the converter module : my_library.location.to.module class : MyConverterClass
skip (optional): A list of patterns to skip files (e.g., [ *.json, my_image.png ]).

An example of a more complete _QH_dataset_info.yaml file is:

version : 0.1
dataset_name : 'my_dataset_name'
description : "Description of the experiment I want to do."
attributes:
  'initials' : 'QH'
  'set_up' : 'XLD001'
  'sample' : 'my_sample'
keywords: ['rabi', 'test']
skip: ['*.json', 'raw_data/*',]

Note

The YAML file must use spaces for indentation, not tabs. Using tabs will cause parsing errors and the synchronization of the dataset will fail.

Programmatically creating the `_QH_dataset_info.yaml` file

The config file can be programmatically created using the following function from the qdrive package:

from qdrive.dataset import generate_dataset_info

# import a converter
from etiket_client.sync.backends.filebase.converters.zarr_to_zip import ZarrToZipConverter

# for example (note that all these field are optional) :
path = "/Users/stephan/Desktop/test_2/"
generate_dataset_info(path,
    dataset_name = "my_dataset_name",
    creation = datetime.now(),
    description="Description of the experiment I want to do.",
    attributes={"sample" : "my_sample"},
    keywords=["rabi", "test"],
    converters = [ZarrToZipConverter],
    skip=["*.json", "raw_data/*"])

Setting Up the synchronization agent

The easiest way to configure the synchronization agent is through the GUI. In case you want to add it programmatically, you can use the following code snippet:

from etiket_client.sync.backends.sources import add_sync_source
from etiket_client.python_api.scopes import get_scope_by_name

from etiket_client.sync.backends.filebase.filebase_sync_class import FileBaseSync
from etiket_client.sync.backends.filebase.filebase_config_class import FileBaseConfigData

import pathlib

data_path = pathlib.Path('/Users/user/Desktop') # update path !!!!!!

if not data_path.exists():
    raise ValueError(f"Data path {data_path} does not exist. Please correct the path.")

# scope to which the data will be uploaded
scope4upload = get_scope_by_name('scope_name') # update scope name !!!!!!

# sample name and set up will be added to every dataset that is uploaded from this location
config = FileBaseConfigData(root_directory=data_path, server_folder = False)

# give a name to the sync agent (should be locally unique)
add_sync_source('my_sync_source_name', FileBaseSync, config, scope4upload)

Note

When your data is stored on a network drive, the server_folder parameter should be set to True.

Creating your own file converters

To create your own file converter, you need to create a class that inherits from the FileConverter class and implements the convert method.

Here is an example of a converter that converts .zarr files to .zip files:

from etiket_client.sync.backends.filebase.converters.base import FileConverter

import shutil, pathlib

class ZarrToZipConverter(FileConverter):
    input_type = 'zarr' # Specify the input file type
    output_type = 'zip' # Specify the output file type

    def convert(self) -> pathlib.Path:
        folder_name = self.file_path.name
        shutil.make_archive(
            base_name=str(pathlib.Path(self.temp_dir.name) / folder_name),
            format='zip',
            root_dir=str(self.file_path)
        )
        return pathlib.Path(self.temp_dir.name) / f"{folder_name}.zip"

The FileConverter class requires the following attributes to be set:

input_type (required): The file extension of the input file.
output_type (required): The file extension of the output file.

The FileConverter class provides the following attributes for use in the convert method:

self.file_path: The path of the input file.
self.temp_dir: A temporary directory where the output file can be stored.

In the convert method, you can convert the file to the desired format using the provided paths, store it in the temporary directory, and return the path to the converted file.

Note

Don’t forget to install the package containing your converter in the environment where the synchronization agent runs (see next section)!

Tip

You can test the converter by running the following code:

from etiket_client.sync.backends.filebase.converters.zarr_to_zip import ZarrToZipConverter

converter = ZarrToZipConverter(pathlib.Path('/Users/user/Desktop/test.zarr'))

with converter:
    output_path = converter.convert()
    print(output_path)
    # Here you can test further if the output is correct
    # When exiting this context manager, the temporary directory will be removed

Packaging Your Own Converter

To use your custom file converter with the synchronization agent, you need to ensure that your converter is installed in the same environment where the sync agent runs. If you’re not familiar with Python packaging, don’t worry – you can just follow these step-by-step instructions.

Organize Your Converter Code

Create a new directory for your converter package. Within this directory, create a subdirectory for your converter module. Here’s an example structure:
```
my_converter_package/
├── my_converters/
│   ├── __init__.py
│   └── my_converter.py
```
- my_converter_package: This is the root folder of your package.
- my_converters: This subfolder will contain your converter code.
- __init__.py: An empty file that tells Python this directory is a package.
- my_converter.py: The file where your converter class is defined.
Place your converter class ( e.g., MyConverterClass ) in my_converter.py, as shown in the previous section.

Add a pyproject.toml File

In the root directory (my_converter_package/), create a file named pyproject.toml with the following content:

[project]
name = "my_converters"
version = "0.1.0"
description = "My custom converter package"
dependencies = ["numpy"] # Add any dependencies here

[build-system]
requires = ["setuptools>=64.0"]
build-backend = "setuptools.build_meta"

More info on the pyproject.toml file can be found here.

Install the Package

Open a terminal and navigate to the root directory of your package:
```
cd /path/to/my_converter_package
```
Replace /path/to/my_converter_package with the actual path to your package directory.

Install the package using pip:
```
pip install .
```
This command installs your converter package into the Python environment.
Tip

To install the package in editable (development) mode, allowing you to modify the code without reinstalling, use:
```
pip install -e .
```
Reference the Converter in _QH_dataset_info.yaml

In your dataset info file, specify your converter so the synchronization agent knows how to use it:
```
converters:
  my_custom_converter:
    module: my_converters.my_converter
    class: MyConverterClass
```
- module: The dotted path to your converter module. In this example, my_converters.my_converter corresponds to my_converters/my_converter.py.
- class: The name of your converter class defined in my_converter.py.

Restart the Synchronization Agent

If the synchronization agent is already running, restart it to recognize the newly installed converter.

By following these steps, your custom converter will be available to the synchronization agent, allowing it to process files using your converter as specified.

Note

Dependencies: If your converter requires additional Python packages, you can specify them in the setup.py file using the install_requires parameter:

setup(
    name='my_converter_package',
    version='0.1',
    packages=find_packages(),
    install_requires=[
        'numpy',
        'pandas',
    ],
)

This ensures that required packages are installed alongside your converter.

Note

Testing Your Converter: It’s a good idea to test your converter before using it with the synchronization agent. You can do this by running:

from my_converters.my_converter import MyConverterClass
import pathlib

converter = MyConverterClass(pathlib.Path('/path/to/your/file.zarr'))

with converter:
    output_path = converter.convert()
    print(f'Converted file saved at: {output_path}')

Generic File Synchronizer

Concept

The _QH_dataset_info.yaml file

Programmatically creating the _QH_dataset_info.yaml file

Setting Up the synchronization agent

Creating your own file converters

Packaging Your Own Converter

The `_QH_dataset_info.yaml` file

Programmatically creating the `_QH_dataset_info.yaml` file