Using datasets with qDrive
==========================

Each qDrive dataset is uniquely identified by a UUID (Universally Unique Identifier).
A dataset has several metadata associated with it that help you sorting and filtering through your data, 
some are user defined (the `Data Identifiers`) and some are standard (e.g. acquisition time, ranking, ...).

A dataset can hold multiple files where the actual data is stored. While files can be in any format,
the DataQruiser app supports rendering the following file types: netCDF4 files, json files, code/text files and pictures. 

Log in with Python 
------------------

You can log into qDrive using one of two methods: through a graphical user interface (GUI) or via a Python command.


You can login to qdrive through two methods: either using python or by employing a graphical user interface.

**Through the Graphical User Interface**: from the environment where you installed qdrive run the following command to launch the synchronization GUI.

.. code-block:: console

   python -c "import qdrive; qdrive.launch_GUI()" 

**Via a Python Command**: execute the following command in the Python kernel/console/jupyter-notebook from the environment where ``qdrive`` is installed:

.. code-block:: python

   import qdrive
   qdrive.authenticate_with_console()

.. note::
    Both methods establish a persistent login session, so you won’t need to log in again unless you log-out.

To log-out:

* Use the command ``qdrive.logout()`` in Python, or
* Click the log-out button (symbol) in the GUI.

Creating new datasets
---------------------

An empty dataset can be created using the following commands:

.. code-block:: python

    from qdrive import dataset

    # Minimal example - creates dataset with just a name
    ds_1 = dataset.create('my_dataset_name')

    # Complete example - creates dataset with extended metadata
    ds_2 = dataset.create(
        'Qubit 2 T2*',
        description='T2* measurement of qubit 2, RF power is also applied to qubit 1',
        scope_name='2Q SC processor A14',
        keywords=['calibration'],
        attributes={
            'set_up': 'Fridge B256',
            'sample_id': 'Q7-R3'
        }, 
        alt_uid='exp20240115-124501'
    )

Every dataset is automatically assigned a UUID (Universally Unique Identifier) that can be accessed after creation:

.. code-block:: python

    # Print the UUID of the newly created dataset
    print(f"Dataset UUID: {ds_2.uuid}")
    
    # Print the alternative identifier (if provided)
    if hasattr(ds_2, 'alt_uid') and ds_2.alt_uid:
        print(f"Alternative identifier: {ds_2.alt_uid}")

**Parameters Explained:**

* **name** (required): A descriptive name for your dataset.
* **description** (optional): A longer description of the dataset contents.
* **scope_name** (optional): The scope where the dataset will be stored.
* **keywords** (optional): A list of tags associated with the dataset (e.g. calibration, tuning, ...).
* **attributes** (optional): A dictionary of key-value pairs providing structured metadata.
* **alt_uid** (optional): An alternative identifier that can be used instead of the UUID, for accessing the dataset. In most cases this is not needed.

.. note::    
    * If no ``scope_name`` is provided, the dataset will be created in the default scope (see :doc:`here <scopes>`).
    * The ``scope_name`` parameter accepts any of the following:

      * A scope name as a string (e.g., ``'quantum_project'``)
      * A scope object, e.g. returned by the ``get_scopes()`` function in :doc:`scopes <scopes>`.
      * A scope UUID (e.g., ``uuid.UUID('12345678-1234-5678-1234-567812345678')``)

.. tip::
    Use descriptive names and consistent attributes across related datasets to make them easier to find 
    and filter later. The attributes and keywords are fully searchable in both the Python API and the
    dataQruiser application.

Loading existing datasets
-------------------------

Load the data from a dataset using the following commands (you can copy your dataset uuid with one click from the dataQruiser app):

.. code-block:: python

   from qdrive import dataset
   dsq = dataset('d30eec8071014f99b09cc3dfce60187d') #this is the dataset uuid or the alternative identifier
   print(dsq)

When printed, the contents of the dataset can be inspected. A typical dataset will look like this:

.. code-block:: text

    Contents of dataset :: single shot - sensor tuning
    ==================================================

    uuid :: 59c40af3-cef3-49aa-8747-64707a9b080a
    Alternative identifier :: 1695914164228175126
    Scope :: my_scope
    Ranking :: 0
    Attributes ::
            set-up : my_setup
            sample : my_sample
    Files ::
    name              type                  selected version number (version_id)      Maximal version number
    ----------------  --------------------  --------------------------------------  ------------------------
    measurement       FileType.HDF5_NETCDF  0 (1719573749579)                                              0
    snapshot          FileType.JSON         0 (1719573749579)                                              0
    analysis          FileType.HDF5_NETCDF  1 (1719573831702)                                              1
    fit_params        FileType.JSON         1 (1719573831822)                                              1

.. Several fields can be observed:

* **uuid**: is a universal unique identifier ofr the dataset. Each dataset that is created has its own unique identifier which can be retrieved in the dataQruiser.
* **Alternative identifier**: this field contains the identifier assigned to the dataset from your data-acquisition software, e.g. the core-tools uid or qcodes GUID.
* **Scope**: scope where this dataset is part of. A scope is usually the name of a long standing project, to which data from several users can belong.
* **Ranking**: integer value indicating how much you like your dataset, useful for filtering in the dataQruiser app.
* **Attributes**: these are searchable key-value field, which give further structure to the scope. In this case the set-up and sample are set as attributes.
* **Files**: A dataset can contain several files, this could be files representing a measurement, raw text, a python file, ... . Each file can have its own version (see further).

It is possible to adjust the following information of the dataset:

Modifying Dataset Metadata
~~~~~~~~~~~~~~~~~~~~~~~~~~

After creating a dataset, you can modify its metadata properties as needed:

.. code-block:: python

    # Load an existing dataset
    ds = dataset("59c40af3-cef3-49aa-8747-64707a9b080a")
    
    # Update the description
    ds.description = 'T2* measurement of qubit 2 with RF power applied to qubit 1'
    
    # Update keywords - replacing all existing keywords
    ds.keywords = ['calibration']
    
    # Add a new keyword without replacing existing ones
    ds.keywords.append('tuning')
    
    # Update attributes - replacing all existing attributes
    ds.attributes = {
        'set_up': 'Fridge B256', 
        'sample_id': 'Q7-R3'
    }
    
    # Add or update a single attribute
    ds.attributes['sample_id'] = 'Q7-R4'
    
    # Set the ranking (useful for filtering in dataQruiser or the search_datasets function)
    ds.ranking = 1  # 1 = like, 0 = neutral, -1 = dislike/hidden

.. note::
    All changes made to the dataset are first made locally and then synchronized with the server (this can take a few seconds before the changes become available on other devices).

Searching for datasets
----------------------

The following command can be used to search for datasets:

.. code-block:: python

    from qdrive.dataset.search import search_datasets
    
    search_result = search_datasets(search_query='my_coolest_dataset')

    # iterate over the search result
    for ds in search_result:
        print(ds)
    
    # get the first dataset
    ds = search_result.first

The ``search_datasets`` function returns a list of datasets that match the search query.\
This function can handle the following arguments:

* **search_query** : A string that is used to search for datasets. The search query can be a dataset name, a dataset UUID, or a dataset alternative identifier.
* **attributes** : Additional attributes to filter datasets, for example ``{'set_up' : 'my_setup'}`` will return only datasets with the attribute set-up equal to 'my_setup'. It is also possible get results from multiple set-ups by using a list of values, for example ``{'set_up' : ['my_setup1', 'my_setup2']}``.
* **ranking**  The ranking score to filter datasets. Defaults to 0. Hidden datasets have a ranking of -1. The search queries for datasets with a ranking greater than or equal to the specified value.
* **start_date** : The start date to filter datasets. Only datasets collected after this date will be included in the results (e.g. ``datetime.datetime(2024, 12, 01)``).
* **end_date** : The end date to filter datasets. Only datasets collected before this date will be included in the results.
* **scopes** : A list of scopes to filter datasets. The scope can either be represented be its name (`str`), its UUID (`uuid.UUID`), or the `scope` object. More information on scopes can be found :doc:`here <scopes>`.

.. warning::
    The search query can take considerable time when iterating over a large number of datasets (e.g., when iterating over every dataset in the whole scope).
    Note that only the datasets that are needed are loaded.


Working with Files
------------------

The dataset object allows to manage files within a dataset. In this section, we will cover:

* Adding new files
* Inspecting and selecting different file versions
* Using file type-specific methods to easily access data within the files


Adding New Files
~~~~~~~~~~~~~~~~

You can add files to a dataset by assigning them directly.
The following example demonstrates how to add files from various sources to a dataset:

.. code-block:: python

    from qdrive import dataset
    from pathlib import Path
    import numpy as np
    import xarray as xr

    new_dataset = dataset.create('my_dataset_name')

    # add a file from a path
    new_dataset['my_file'] = Path('C:/location/of/file.extension')

    # add a file from a python object (list, dict, numpy array, xarray)
    new_dataset['my_array'] = np.linspace(0,10,100)
    new_dataset['my_json'] = {'a': 1, 'b':[1,2,3]}
    new_dataset['my_xarray'] = xr.Dataset({'a': (['x'], np.arange(10))})

    # add the current script file
    new_dataset['my_script'] = __file__

.. note::
    
    Assigning a file with the same key multiple times will create a new version of that file.

Inspecting and Selecting Different File Versions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Each file within a dataset can have multiple versions,
which can be inspected with the following command:

.. code-block:: python

    print(my_dataset['my_file'].version_info)

This will display information like the following:

.. code-block::

    File object information
    =======================
    Name : my_file
    Selected File version : 1720711563075
    File versions (3) : 
          1720711517406 (created on 11/07/2024 17:25:17)
          1720711551649 (created on 11/07/2024 17:25:51)
        * 1720711563075 (created on 11/07/2024 17:26:03)

By default, accessing a file returns its latest version. The selected version is marked by an asterisk (*).

To access a specific version of a file by its unique version ID, use:

.. code-block:: python

    my_dataset['my_file'].version_id(1720711517406)

.. tip:: 

    You can also access file versions by their position in the version history:

    .. code-block:: python

        my_dataset['my_file'].version(0) # first version
        my_dataset['my_file'].version(1) # second version
        my_dataset['my_file'].version(2) # third version

Using file type-specific methods
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Numerical data**

For storing numerical data, we recommend using ``.hdf5`` files formatted according to the ``NETCDF4`` standard.
While it’s possible to assign an HDF5 file directly, a more user-friendly approach is to work with xarray datasets.

When assigning an xarray dataset, the dataset is automatically converted to an HDF5 file and uploaded to the cloud.

.. tip::
    
    Xarray is a library that allows for easy labeling of NumPy arrays and supports
    defining relationships between different dimensions.
    Assigning data in this way also enables automatic plotting of datasets in the DataQruiser app.


    Example:

    .. code-block:: python

        import xarray as xr
        import numpy as np

        # Create an xarray dataset with two variables, y1 and y2, and a shared coordinate, x

        
        x_data = np.linspace(0,30,100)
        y1_data = np.sin(x_data)
        y2_data = np.cos(x_data)

        xr_ds = xr.Dataset(
            {
                "y1": (["x"], y1_data, {"units": "mV"}),
                "y2": (["x"], y2_data, {"units": "mV"}),
            },
            coords={
                "x": ("x", x_data, {"units": "s"})
            }
        )

        # Optionally, to link y1 and y2 for joint plotting, you can use a temporary solution:
        xr_ds["y1"].attrs.update({**target_fit_param, "__join_plot": "y2"})
        xr_ds["y2"].attrs.update({**target_fit_param, "__join_plot": "y1"})

        # Note: This is a workaround and will be improved in future releases.


    In this example, we create a dataset xr_ds with two variables,
    y1 and y2, each associated with the coordinate x.
    The units of each variable and coordinate can be specified for clarity.

HDF5 files saved in NETCDF4 format can be accessed in multiple ways,
allowing for flexibility depending on your analysis needs:

.. code-block:: python

    from qdrive import dataset
    dsq         = dataset("my_dataset_uuid_or_alt_id")

    xarray_ds   = dsq["measurement"].xarray    # load as a Xarray (recommended option)
    pandas_ds   = dsq["measurement"].pandas    # load as a pandas Dataframe
    hdf5_handle = dsq["measurement"].hdf5      # load as h5py File

**Adding metadata**

Metadata can either be added directly to a dataset or stored within files.
A common approach is to save metadata as JSON files,
which can then be accessed as dictionaries in Python.

Example of adding a JSON file to a dataset:


.. code-block:: python

    dsq['my_json_data'] = {'item1': "value1", 'item2': "value2"}
    # New keys can be added dynamically, though it's recommended to assign all keys at once for performance.
    dsq['my_json_data']['item3'] = "value3"

In addition to dictionaries, lists can also be assigned for metadata storage.


.. note::

    If you have metadata directly associated with numerical data,
    it can be embedded within an xarray dataset.
    This approach allows the metadata to be visible in the DataQruiser app.

    Example of adding metadata to an xarray dataset:

    .. code-block:: python
        
        import xarray as xr
        import numpy as np

        xr_ds = xr.Dataset(
            {"y": (["x"], np.linspace(0, 100), {"units": "mV"})},
            coords={"x": ("x", np.linspace(0, 100), {"units": "s"})}
        )

        # Add additional metadata as attributes
        xr_ds.attrs['my_fit_params'] = {'param1': 1, 'param2': 2, 'param3': 3}

        # Store the xarray dataset in the dataset
        dsq['my_xarray'] = xr_ds
    
**Adding source files**

To automatically upload a Python script to a QDrive dataset each time it’s run, add the following to your script:

.. code-block:: python

    from qdrive import dataset
    from pathlib import Path

    # Create a new dataset or load an existing one using its UUID
    dsq = dataset.create('my_dataset')  # or use: dsq = dataset('uuid')

    # Upload the current script
    dsq['my_script'] = Path(__file__)

If you’re working in a Jupyter notebook,
you can upload the notebook file by running the following in a cell,
specifying the notebook's name or path:


.. code-block:: python

    dsq['my_notebook'] = Path('my_notebook_name.ipynb')

.. note::

    Be sure to save your file before uploading to ensure all changes are included.


.. .. code-block:: python

..    from qdrive import dataset
..    dsq = dataset('d30eec8071014f99b09cc3dfce60187d') #this is the dataset uuid or the alternative identifier
..    print(dsq)

.. For creating and accessing HDF5_NETCDF, JSON, numpy and other common filetypes we have convenience methods outlined in the sections below .


.. In general you can add any file to an existing qdrive dataset using:

.. .. code-block:: python

..     # add file from path
..     dsq['my_file'] = 'path_of_file_to_add' 

..     # add file from json (dict or list)
..     dsq['my_json'] = {'a': 1, 'b':[1,2,3]} 

..     # add file from xarray
..     dsq['my_xarray'] = xarray_data

..     #add data from numpy
..     sq['my_xarray'] = np.arange(10)

.. This commands add the files to the qdrive dataset and synchronize it to the server.

.. You can access a file using:

.. .. code-block:: python

..     dsq['my_file']

.. You can save the file locally in your computer using:

.. .. code-block:: python

..     dsq['my_file'].export(path='.', file_name='my_file.extension')


.. HDF5_NETCDF files (measurement & analysis)
.. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. Measurement and analysis data are stored as an HDF5 file with NETCDF4 compatibility.

.. **Access a HDF5 file** with one of the following commands:

.. .. code-block:: python

..     from qdrive import dataset
..     dsq         = dataset("my_dataset_uuid_or_alt_id")

..     # Several options available:
..     xarray_ds   = dsq["measurement"].xarray    # load as a Xarray (recommended option)
..     pandas_ds   = dsq["measurement"].pandas    # load as a pandas Dataframe
..     hdf5_handle = dsq["measurement"].hdf5      # load as h5py File
..     netcdf4_ds  = dsq["measurement"].netcdf4   # load with the Netcdf4 library

.. We think xarray is the most versatile library to work with labeled datasets, it has the power of numpy combined with the convenience of labels.
.. You can find xarray docs `here <https://docs.xarray.dev/en/latest/>`_. 
.. To convert an array from xarray to numpy you can use ``xarray_ds['variable'].to_numpy()`` 

.. **Add a new HDF5 file** to a qdrive dataset as follows: 

.. .. code-block:: python

..     import xarray as xr
..     import numpy as np
..     from qdrive.dataset.dataset import dataset

..     # Create a new dataset or load an existing one
..     dsq = dataset.create('my_dataset_name')

..     # Example data
..     x_data = np.linspace(0,30,100)
..     y1_data = np.sin(x_data)
..     y2_data = np.cos(x_data)

..     # Create the xarray dataset
..     xr_ds = xr.Dataset(
..         {
..             "y1": (["x"], y1_data, {"units": "unit_y1"}),
..             "y2": (["x"], y2_data, {"units": "unit_y2"}),
..         },
..         coords={
..             "x": ("x", x_data, {"units": "unit_x"})
..         }
..     )

..     # Add the xarray dataset to the qdrive dataset as file called 'my_data' 
..     dsq['my_data'] = xr_ds

..     # A new version of the file `my_data` can be added to the qdrive dataset by assigning a new xarray dataset 
..     # dsq['my_data'] = xr_ds_new

.. This command adds the xarray dataset as an HDF5_NETCDF file to the qdrive dataset. 
.. The new file is automatically synchronized to the cloud and will be immediately available in the dataQruiser app and through python from all logged in devices. 


.. JSON files
.. ~~~~~~~~~~
.. **Access a JSON file** with the following commands:

.. .. code-block:: python

..     dsq['fit_params'].json

.. JSON files are represented in Python as dictionaries or lists. 

.. **Add a new JSON file** to a qdrive dataset as follows: 

.. .. code-block:: python

..     dsq['my_json_data'] = {'item1' : "value1"}

..     # reassigning a new object, creates a new version of the 'my_json_data' file.
..     dsq['my_json_data'] = ['item1', 'item2']

..     # adding items to an existing json file creates a new version of the 'my_json_data' file.
..     dsq['my_json_data']['item2'] = 'value2'


.. NUMPY raw files
.. ~~~~~~~~~~~~~~~

.. While it is recommended to store numerical data in the HDF5 format,
.. raw NumPy arrays can also be stored easily in the dataset. For example:

.. .. code-block:: python

..     my_dataset['new_data'] = np.zeros([100,100])
..     numpy_array = my_dataset['new_data'].raw #access raw numpy array

..     # more options of providing numpy arrays
..     my_dataset['new_data'] = [np.ones([100,100]), np.zeros([100,100])]
..     numpy_array_list = my_dataset['new_data'].raw

..     my_dataset['new_data'] = { "x" : np.ones([100,100]),
..                     "y" : np.zeros([100,100]),
..                     "z" : np.zeros([100,100])}
..     numpy_array_dict = my_dataset['new_data'].raw["x"]


.. Script Files
.. ~~~~~~~~~~~~
.. To automatically upload a python script file to a qdrive dataset each time the script is run you can add to your script the following:

.. .. code-block:: python

..     from qdrive import dataset
..     dsq = dataset.create('my_dataset') # you can also load an existing dataset with its uuid using: dsq = dataset('uuid')
..     dsq['my_script'] = __file__

.. If you want to upload a currently used Jupyter-notebook you can run this in a cell, specifying the notebook name (or path):

.. .. code-block:: python

..     dsq['my_notebook'] = 'my_notebook_name.ipynb'

.. Make sure your file is saved before you upload it, such that all changes are reflected. 

.. .. _file-versioning:

.. File versioning
.. ~~~~~~~~~~~~~~~

.. Several version of the same file can be created by:

.. .. code-block:: python

..     # from file path
..     dsq['my_file'] = file_path_1
..     dsq['my_file'] = file_path_2 # creates a new version of my_file

..     # from json object (list or dict)
..     dsq['my_json'] = {'a': 1, 'b':[1,2]}
..     dsq['my_json'] = {'a': 1, 'b':[1,2,3]} # creates a new version of my_json

..     # from xarray
..     dsq['my_xr'] = xr.DataArray(np.arange(10))
..     dsq['my_xr'] = xr.DataArray(np.arange(20)) # creates a new version of my_xr

.. In this cases, two versions of the same files will be created.

.. The different versions of a file can be inspected using the following command:

.. .. code-block:: python

..     print(my_dataset['my_file'].version_info)

.. .. TODO remove print statement fom above once this is solved https://github.com/qEncoder/eTiKeT-testing/issues/23

.. which return the following information :

.. .. code-block::

..     File object information
..     =======================
..     Name : my_file
..     Selected File version : 1720711563075
..     File versions (3) : 
..           1720711517406 (created on 11/07/2024 17:25:17)
..           1720711551649 (created on 11/07/2024 17:25:51)
..         * 1720711563075 (created on 11/07/2024 17:26:03)

.. When accessing a file, by default the latest version (marked by a \*) is returned.

.. You can access a specific version of a file by calling:

.. .. code-block:: python

..     my_dataset['my_file'].version_id(1720711517406)

.. .. tip:: 

..     It is also possible to access file versions by specifying the nth version of the file, e.g.

..     .. code-block:: python

..         my_dataset['my_file'].version(0) # first version
..         my_dataset['my_file'].version(1) # second version
..         my_dataset['my_file'].version(2) # third version