Generic File Synchronizer
Concept
To synchronize arbitrary folder structures, we created a module that continuously watches a specified folder and can automatically create datasets from it.
The software identifies datasets by the presence of a _QH_dataset_info.yaml
file in the folder.
This file specifies the minimum amount of information needed to create a dataset.
Every other file in this (sub)directory is considered a data file and will be added to the dataset.
An example of the folder structure:
main_folder
├── 20240101
│ ├── 20240101-211245-165-731d85-experiment_1
│ │ ├── _QH_dataset_info.yaml
│ │ ├── 01-01-2024_01-01-01.json
│ │ ├── 01-01-2024_01-01-01.hdf5
├── 20240102
│ ├── 20240102-220655-268-455d85-experiment_2
│ │ ├── _QH_dataset_info.yaml
│ │ ├── 02-01-2024_02-02-02.json
│ │ ├── 02-01-2024_02-02-02.hdf5
│ │ ├── analysis
│ │ │ ├── 02-01-2024_02-02-02_analysis.json
│ │ │ ├── 02-01-2024_02-02-02_analysis.hdf5
├── some_other_folder
│ ├── _QH_dataset_info.yaml
│ ├── 01-01-2024_01-01-01.json
If a file is added to any of these folders or a new one is created, the synchronization agent will automatically add it to the dataset.
The _QH_dataset_info.yaml
file
When performing measurements, we recommend programmatically writing the _QH_dataset_info.yaml
file to the relevant folder.
A minimal example of this file:
version : 0.1
This file can contain the following fields:
version
(required): The version of the file format. This ensures compatibility with the software. The current version is 0.1.dataset_name
(optional): The name of the dataset. If not provided, the name of the folder containing the _QH_dataset_info.yaml file is used.created
(optional): The date the dataset was created, in the formatYYYY-MM-DDTHH:MM:SS
. If creation time of the first file in the dataset is used.description
(optional): A description of the dataset.attributes
(optional): A dictionary of attributes to be added to the dataset. Values should be of type str or number.keywords
(optional): A list of keywords to be added to the dataset.converters
(optional): Allows you to specify file converters to automatically convert files within the dataset.To create your own converter, you need to create a class that inherits from the
FileConverter
class and implement theconvert
method. For more details, see the section on Creating your own file converters.The syntax for the converters field is as follows:
converters: txt_to_csv_converter : #name of the converter module : my_library.location.to.module class : MyConverterClass
skip
(optional): A list of patterns to skip files (e.g., [*.json
,my_image.png
]).
An example of a more complete _QH_dataset_info.yaml
file is:
version : 0.1
dataset_name : 'my_dataset_name'
description : "Description of the experiment I want to do."
attributes:
'initials' : 'QH'
'set_up' : 'XLD001'
'sample' : 'my_sample'
keywords: ['rabi', 'test']
skip: ['*.json', 'raw_data/*',]
Programmatically creating the _QH_dataset_info.yaml
file
The config file can be programmatically created using the following function from the qdrive package:
from qdrive.dataset import generate_dataset_info
# import a converter
from etiket_client.sync.backends.filebase.converters.zarr_to_zip import ZarrToZipConverter
# for example (note that all these field are optional) :
path = "/Users/stephan/Desktop/test_2/"
generate_dataset_info(path,
dataset_name = "my_dataset_name",
creation = datetime.now(),
description="Description of the experiment I want to do.",
attributes={"sample" : "my_sample"},
keywords=["rabi", "test"],
converters = [ZarrToZipConverter],
skip=["*.json", "raw_data/*"])
Setting Up the synchronization agent
The easiest way to configure the synchronization agent is through the GUI. In case you want to add it programmatically, you can use the following code snippet:
from etiket_client.sync.backends.sources import add_sync_source
from etiket_client.python_api.scopes import get_scope_by_name
from etiket_client.sync.backends.filebase.filebase_sync_class import FileBaseSync
from etiket_client.sync.backends.filebase.filebase_config_class import FileBaseConfigData
import pathlib
data_path = pathlib.Path('/Users/user/Desktop') # update path !!!!!!
if not data_path.exists():
raise ValueError(f"Data path {data_path} does not exist. Please correct the path.")
# scope to which the data will be uploaded
scope4upload = get_scope_by_name('scope_name') # update scope name !!!!!!
# sample name and set up will be added to every dataset that is uploaded from this location
config = FileBaseConfigData(root_directory=data_path, server_folder = False)
# give a name to the sync agent (should be locally unique)
add_sync_source('my_sync_source_name', FileBaseSync, config, scope4upload)
Note
When your data is stored on a network drive, the server_folder
parameter should be set to True
.
Creating your own file converters
To create your own file converter,
you need to create a class that inherits from the FileConverter
class
and implements the convert
method.
Here is an example of a converter that converts .zarr files to .zip files:
from etiket_client.sync.backends.filebase.converters.base import FileConverter
import shutil, pathlib
class ZarrToZipConverter(FileConverter):
input_type = 'zarr' # Specify the input file type
output_type = 'zip' # Specify the output file type
def convert(self) -> pathlib.Path:
folder_name = self.file_path.name
shutil.make_archive(
base_name=str(pathlib.Path(self.temp_dir.name) / folder_name),
format='zip',
root_dir=str(self.file_path)
)
return pathlib.Path(self.temp_dir.name) / f"{folder_name}.zip"
The FileConverter
class requires the following attributes to be set:
input_type
(required): The file extension of the input file.output_type
(required): The file extension of the output file.
The FileConverter
class provides the following attributes for use in the convert
method:
self.file_path
: The path of the input file.self.temp_dir
: A temporary directory where the output file can be stored.
In the convert
method, you can convert the file to the desired format using the provided paths,
store it in the temporary directory, and return the path to the converted file.
Note
Don’t forget to install the package containing your converter in the environment where the synchronization agent runs (see next section)!
Tip
You can test the converter by running the following code:
from etiket_client.sync.backends.filebase.converters.zarr_to_zip import ZarrToZipConverter
converter = ZarrToZipConverter(pathlib.Path('/Users/user/Desktop/test.zarr'))
with converter:
output_path = converter.convert()
print(output_path)
# Here you can test further if the output is correct
# When exiting this context manager, the temporary directory will be removed
Packaging Your Own Converter
To use your custom file converter with the synchronization agent, you need to ensure that your converter is installed in the same environment where the sync agent runs. If you’re not familiar with Python packaging, don’t worry – you can just follow these step-by-step instructions.
Organize Your Converter Code
Create a new directory for your converter package. Within this directory, create a subdirectory for your converter module. Here’s an example structure:
my_converter_package/ ├── my_converters/ │ ├── __init__.py │ └── my_converter.py
my_converter_package
: This is the root folder of your package.my_converters
: This subfolder will contain your converter code.__init__.py
: An empty file that tells Python this directory is a package.my_converter.py
: The file where your converter class is defined.
Place your converter class ( e.g.,
MyConverterClass
) inmy_converter.py
, as shown in the previous section.
Add a
pyproject.toml
FileIn the root directory (my_converter_package/), create a file named pyproject.toml with the following content:
[project] name = "my_converters" version = "0.1.0" description = "My custom converter package" dependencies = ["numpy"] # Add any dependencies here [build-system] requires = ["setuptools>=64.0"] build-backend = "setuptools.build_meta"
More info on the
pyproject.toml
file can be found here.Install the Package
Open a terminal and navigate to the root directory of your package:
cd /path/to/my_converter_package
Replace /path/to/my_converter_package with the actual path to your package directory.
Install the package using pip:
pip install .
This command installs your converter package into the Python environment.
Tip
To install the package in editable (development) mode, allowing you to modify the code without reinstalling, use:
pip install -e .
Reference the Converter in
_QH_dataset_info.yaml
In your dataset info file, specify your converter so the synchronization agent knows how to use it:
converters: my_custom_converter: module: my_converters.my_converter class: MyConverterClass
module
: The dotted path to your converter module. In this example,my_converters.my_converter
corresponds to my_converters/my_converter.py.class
: The name of your converter class defined inmy_converter.py
.
Restart the Synchronization Agent
If the synchronization agent is already running, restart it to recognize the newly installed converter.
By following these steps, your custom converter will be available to the synchronization agent, allowing it to process files using your converter as specified.
Note
Dependencies: If your converter requires additional Python packages, you can specify them in the setup.py file using the install_requires parameter:
setup(
name='my_converter_package',
version='0.1',
packages=find_packages(),
install_requires=[
'numpy',
'pandas',
],
)
This ensures that required packages are installed alongside your converter.
Note
Testing Your Converter: It’s a good idea to test your converter before using it with the synchronization agent. You can do this by running:
from my_converters.my_converter import MyConverterClass
import pathlib
converter = MyConverterClass(pathlib.Path('/path/to/your/file.zarr'))
with converter:
output_path = converter.convert()
print(f'Converted file saved at: {output_path}')