API documentation

Processing Set and Measurement Set v4 API. Includes functions and classes to open, load, convert, and retrieve information from Processing Set and Measurement Sets nodes of the Processing Set DataTree

open_processing_set(ps_store: str, scan_intents: list | None = None, array_backend: str = 'dask') → DataTree[source]

Creates a lazy representation of a Processing Set (only meta-data is loaded into memory).

Parameters:

ps_store (str) – String of the path and name of the processing set. For example ‘/users/user_1/uid___A002_Xf07bba_Xbe5c_target.lsrk.vis.zarr’.
scan_intents (str | None, optional) – A list of scan_intents to be opened for example [‘OBSERVE_TARGET#ON_SOURCE’]. The scan_intents in a processing_set_xdt can be seen by calling processing_set_xdt.ps.summary(). By default None, which will include all scan_intents.
array_backend (str, optional) – The array backend to use for the data variables in the processing set. Options are ‘dask’ and ‘xarray’ both which are lazy. By default ‘dask’, which will create Dask arrays for the data variables.

Returns:

Lazy representation of processing set (the data arrays of the datasets are represented by Dask.arrays).

Return type:

xarray.DataTree

load_processing_set(ps_store: str, sel_parms: dict = None, data_group_name: str = None, include_variables: list | None = None, drop_variables: list | None = None, load_sub_datasets: bool = True) → DataTree[source]

Loads a processing set into memory.

Parameters:

ps_store (str) – String of the path and name of the processing set. For example ‘/users/user_1/uid___A002_Xf07bba_Xbe5c_target.lsrk.vis.zarr’ for a file stored on a local file system, or ‘s3://viper-test-data/Antennae_North.cal.lsrk.split.vis.zarr/’ for a file in AWS object storage.
sel_parms (dict, optional) –
A dictionary where the keys are the names of the ms_xdt’s (measurement set xarray data trees) and the values are slice_dicts. slice_dicts: A dictionary where the keys are the dimension names and the values are slices.

For example:
```
{

    'ms_v4_name_1': {'frequency': slice(0, 160, None),'time':slice(0,100)},
    ...
    'ms_v4_name_n': {'frequency': slice(0, 160, None),'time':slice(0,100)},
}
```
By default None, which loads all ms_xdts.
data_group_name (str, optional) – The name of the data group to select. By default None, which loads all data groups.
include_variables (Union[list, None], optional) – The list of data variables to load into memory for example [‘VISIBILITY’, ‘WEIGHT, ‘FLAGS’]. By default None which will load all data variables into memory.
drop_variables (Union[list, None], optional) – The list of data variables to drop from memory for example [‘VISIBILITY’, ‘WEIGHT, ‘FLAGS’]. By default None which will not drop any data variables from memory.
load_sub_datasets (bool, optional) – If true sub-datasets (for example weather_xds, antenna_xds, pointing_xds, system_calibration_xds …) will be loaded into memory, by default True.

Returns:

In memory representation of processing set using xr.DataTree.

Return type:

xarray.DataTree

convert_msv2_to_processing_set(in_file: str, out_file: str, partition_scheme: list = [], partition_filter: Callable[[Dict[str, Any]], bool] | None = None, main_chunksize: Dict | float | None = None, with_pointing: bool = True, pointing_chunksize: Dict | float | None = None, pointing_interpolate: bool = False, ephemeris_interpolate: bool = False, phase_cal_interpolate: bool = False, sys_cal_interpolate: bool = False, use_table_iter: bool = False, compressor: BytesBytesCodec = ZstdCodec(level=2, checksum=False), add_reshaping_indices: bool = False, storage_backend: Literal['zarr', 'netcdf'] = 'zarr', parallel_mode: Literal['none', 'partition', 'time'] = 'none', persistence_mode: str = 'w-')[source]

Convert a Measurement Set v2 into a Processing Set of Measurement Set v4.

Parameters:

in_file (str) – Input MS name.
out_file (str) – Output PS name.
partition_scheme (list, optional) – A MS v4 can only contain a single data description (spectral window and polarization setup), and observation mode. Consequently, the MS v2 is partitioned when converting to MS v4. In addition to data description and polarization setup a finer partitioning is possible by specifying a list of partitioning keys. Any combination of the following keys are possible: “FIELD_ID”, “SCAN_NUMBER”, “STATE_ID”, “SOURCE_ID”, “SUB_SCAN_NUMBER”, “ANTENNA1”. “ANTENNA1” is intended as a single-dish specific partitioning option. For mosaics where the phase center is rapidly changing (such as VLA on the fly mosaics) partition_scheme should be set to an empty list []. By default, [].
partition_filter (Callable[[Dict[str, Any]], bool], optional) –
Callable predicate taking a single argument, assumed to be an MS v2 partition dictionary at call time. When provided, only partitions for which the predicate returns True are converted. .. rubric:: Examples
```
>>> partition_filter = lambda p: (
...     "OBSERVE_TARGET#ON_SOURCE" in p["OBS_MODE"]
...     and 6 in p["SCAN_NUMBER"]
... )
```
main_chunksize (Union[Dict, float, None], optional) – Defines the chunk size of the main dataset. If given as a dictionary, defines the sizes of several dimensions, and acceptable keys are “time”, “baseline_id”, “antenna_id”, “frequency”, “polarization”. If given as a float, gives the size of a chunk in GiB. By default, None.
with_pointing (bool, optional) – Whether to convert the POINTING subtable into pointing sub-datasets
pointing_chunksize (Union[Dict, float, None], optional) – Defines the chunk size of the pointing dataset. If given as a dictionary, defines the sizes of several dimensions, acceptable keys are “time” and “antenna_id”. If given as a float, defines the size of a chunk in GiB. By default, None.
pointing_interpolate (bool, optional) – Whether to interpolate the time axis of the pointing sub-dataset to the time axis of the main dataset
ephemeris_interpolate (bool, optional) – Whether to interpolate the time axis of the ephemeris data variables (of the field_and_source sub-dataset) to the time axis of the main dataset
phase_cal_interpolate (bool, optional) – Whether to interpolate the time axis of the phase calibration data variables to the time axis of the main dataset
sys_cal_interpolate (bool, optional) – Whether to interpolate the time axis of the system calibration data variables (sys_cal_xds) to the time axis of the main dataset
use_table_iter (bool, optional) – Whether to use the table iterator to read the main table of the MS v2. This should be set to True when reading datasets with large number of rows and few partitions, by default False.
compressor (numcodecs.abc.Codec, optional) – The Blosc compressor to use when saving the converted data to disk using Zarr, by default numcodecs.Zstd(level=2).
add_reshaping_indices (bool, optional) – Whether to add the tidxs, bidxs and row_id variables to each partition of the main dataset. These can be used to reshape the data back to the original ordering in the MS v2. This is mainly intended for testing and debugging, by default False.
storage_backend (Literal["zarr", "netcdf"], optional) – The on-disk format to use. “netcdf” is not yet implemented.
parallel_mode (Literal["none", "partition", "time"], optional) – Choose whether to use Dask to execute conversion in parallel, by default “none” and conversion occurs serially. The option “partition”, parallelises the conversion over partitions specified by partition_scheme. The option “time” can only be used for phased array interferometers where there are no partitions in the MS v2; instead the MS v2 is parallelised along the time dimension and can be controlled by main_chunksize.
persistence_mode (str, optional) – “w” means create (overwrite if exists); “w-” means create (fail if exists); “a” means override all existing variables including dimension coordinates (create if does not exist); Use this mode if you want to add to an existing Processing Set. The default is “w-“.

estimate_conversion_memory_and_cores(in_file: str, partition_scheme: list = []) → tuple[float, int, int][source]

Given an MSv2 and a partition_scheme to use when converting it to MSv4, estimates:

memory (in the sense of the amount expected to be enough to convert)
cores (in the sense of the recommended/optimal number of cores to use to convert)

Note: this function does not currently try to estimate the memory required for sub-xdss such as pointing_xds and system_calibration_xds, instead it uses a small percentage of the main_xds to account for them. This can lead to underestimation especially for MSv2s with small partitions but large pointing or syscal tables. This should not typically be a concern for sufficiently large partitions (a few or 10s, 100s of GiBs).

Parameters:

in_file (str) – Input MS name.
partition_scheme (list) – Partition scheme as used in the function convert_msv2_to_processing_set()

Returns:

estimated maximum memory required for one partition, maximum number of cores it makes sense to use (number of partitions), suggested number of cores to use (maximum/4 as a rule of thumb)

Return type:

tuple

ProcessingSetXdt API

Custom accessor to Processing Set additional functionality. Given a Processing Set xarray.DataTree, named ps_xdt, the accessor can be used as ps_xdt.xr_ps (xr for xradio and ps for Processing Set).

class ProcessingSetXdt(datatree: DataTree)[source]
Accessor to Processing Set DataTree nodes. Provides Processing Set specific functionality such as producing a summary of the processing set (with information from all its MSv4s), or retrieving combined antenna or field_and_source datasets.

get_combined_antenna_xds() → Dataset[source]

Combine the antenna_xds datasets from all Measurement Sets into a single dataset.

This method concatenates the antenna datasets from each Measurement Set along the ‘antenna_name’ dimension.

Returns:

A combined xarray.Dataset containing antenna information from all Measurement Sets.

Return type:

xarray.Dataset

Raises:

ValueError – If antenna datasets are missing required variables or improperly formatted.

get_combined_field_and_source_xds(data_group_name: str = 'base') → Dataset[source]

Combine all non-ephemeris field_and_source_xds datasets from a Processing Set for a data group into a single dataset.

Parameters:

data_group_name (str, optional) – The data group to process. Default is “base”.

Returns:

combined_field_and_source_xds: Combined dataset for standard (non-ephemeris) fields.

Return type:

xarray.Dataset

Raises:

ValueError – If the field_and_source_xds attribute is missing or improperly formatted in any Measurement Set.

get_combined_field_and_source_xds_ephemeris(data_group_name: str = 'base') → Dataset[source]

Combine all ephemeris field_and_source_xds datasets from a Processing Set for a datagroup into a single dataset.

Parameters:

data_group_name (str, optional) – The data group to process. Default is “base”.

Returns:

combined_ephemeris_field_and_source_xds: Combined dataset for ephemeris fields.

Return type:

xarray.Dataset

Raises:

ValueError – If the field_and_source_xds attribute is missing or improperly formatted in any Measurement Set.

get_freq_axis() → DataArray[source]

Combine the frequency axes of all Measurement Sets in the Processing Set.

This method aggregates the frequency information from each Measurement Set to create a unified frequency axis for the entire Processing Set.

Returns:

The combined frequency axis of the Processing Set.

Return type:

xarray.DataArray

get_max_dims() → dict[str, int][source]

Determine the maximum dimensions across all Measurement Sets in the Processing Set.

This method examines each Measurement Set’s dimensions and computes the maximum size for each dimension across the entire Processing Set.

For example, if the Processing Set contains two MSs with dimensions (50, 20, 30) and (10, 30, 40), the maximum dimensions will be (50, 30, 40).

Returns:

A dictionary containing the maximum dimensions of the Processing Set, with dimension names as keys and their maximum sizes as values.

Return type:

dict

get_ms_xdt()[source]

Returns the Measurement Set associated with this Processing Set if there is only a single Measurement Set.

Returns:

The Measurement Set Data Tree object.

Return type:

xr.DataTree

plot_antenna_positions(label_all_antennas: bool = False)[source]

Plot the antenna positions of all antennas in the Processing Set.

This method generates and displays a figure with three scatter plots, displaying the antenna positions in different planes:

X vs Y

X vs Z

Y vs Z

The antenna names are shown on hovering their positions, unless label_all_antennas is enabled.

Parameters:

label_all_antennas (bool, optional) – If ‘True’, annotations are shown with the names of every antenna next to their positions.

Return type:

None

Raises:

ValueError – If the combined antenna dataset is empty or missing required coordinates.

plot_antenna_positions_2d(add_antenna_labels: bool = True, add_antenna_stations: bool = False, add_elevation_plot: bool = True, add_continent_outlines: bool = True, figure_size: tuple = (12, 8))[source]

Plot the antenna positions of all antennas in all measurement sets onto 2D grids.

For connected arrays with a known array center the antenna positions are plotted in an east-west (X axis) north-south (Y axis) grid in meters centered on the array center. For disconnected arrays (usually VLBI) or arrays with no-known array center antenna positions are plotted in a longitude and latitude grid in a quasi-mercator projection. If cartopy is available, non-connected arrays are plotted along with continental outlines. A plot of antenna elevations above sea level is also produced together with the 2d array configuration.

Parameters:

add_antenna_labels (bool, optional) – If ‘True’, annotations are shown with a descriptive label for each antenna, default is ‘True’.

add_antenna_stations (bool, optional) – If ‘True’, add antenna station information to the antenna labels, default is ‘False’.

add_elevation_plot (bool, optional) – If ‘True’, add a plot of the elevations above sea level for each antenna, default is ‘True’.

add_continent_outlines (bool, optional) – If ‘True’ and cartopy is available, add continental outlines for the longitude and latitude plots for disconnected arrays or arrays for which no array center is known.

figure_size (tuple, optional) – Controls the size of the plot in inches.

Return type:

None

Raises:

ValueError – If antenna positions are not in the Geocentric ITRS frame.

plot_phase_centers(label_all_fields: bool = False, data_group_name: str = 'base')[source]

Plot the phase center locations of all fields in the Processing Set.

This method is primarily used for visualizing mosaics. It generates scatter plots of the phase center coordinates for both standard and ephemeris fields. The central field is highlighted in red based on the closest phase center calculation.

Parameters:

label_all_fields (bool, optional) – If True, all fields will be labeled on the plot. Default is False.

data_group_name (str, optional) – The data group to use for processing. Default is “base”.

Return type:

None

Raises:

ValueError – If the combined datasets are empty or improperly formatted.
query(string_exact_match: bool = True, query: str = None, **kwargs) → DataTree[source]
Select a subset of the Processing Set based on specified criteria.

This method allows filtering the Processing Set by matching column names and values or by applying a Pandas query string. The selection criteria can target various attributes of the Measurement Sets such as scan_intents, polarization, spectral window names, etc.

A data group can be selected by name by using the data_group_name parameter. This is applied to each Measurement Set in the Processing Set.

Note

This selection does not modify the actual data within the Measurement Sets. For example, if a Measurement Set has field_name=[‘field_0’,’field_10’,’field_08’] and ps.query(field_name=’field_0’) is invoked, the resulting subset will still contain the original list [‘field_0’,’field_10’,’field_08’]. The exception is data group selection, using data_group_name, that will select data variables only associated with the specified data group in the Measurement Set.

Parameters:

string_exact_match (bool, optional) – If True, string matching will require exact matches for string and string list columns. If False, partial matches are allowed. Default is True.

query (str, optional) – A Pandas query string to apply additional filtering. Default is None.

**kwargs (dict) – Keyword arguments representing column names and their corresponding values to filter the Processing Set.

Returns:

A new Processing Set DataTree instance containing only the Measurement Sets that match the selection criteria.

Return type:

xr.DataTree

Examples
>>> # Select all MSs with scan_intents 'OBSERVE_TARGET#ON_SOURCE' and polarization 'RR' or 'LL'
>>> selected_ps_xdt = ps_xdt.xr_ps.query(scan_intents='OBSERVE_TARGET#ON_SOURCE', polarization=['RR', 'LL'])
>>> # Select all MSs with start_frequency greater than 100 GHz and less than 200 GHz
>>> selected_ps_xdt = ps_xdt.xr_ps.query(query='start_frequency > 100e9 AND end_frequency < 200e9')
summary(data_group_name: str | None = None, first_columns: list[str] = None) → DataFrame[source]

Generate and retrieve a summary of the Processing Set as a data frame.

The summary includes information such as the names of the Measurement Sets, their intents, polarizations, spectral window names, field names, source names, field coordinates, start frequencies, and end frequencies.

To prioritize certain columns depending on the context, the order in which the columns of the data frame are sorted can be modified from the default (first_columns parameter).

Parameters:

data_group_name (str, optional) – The data group to summarize. By default the “base” group is used (if found), or otherwise the first group found.

first_columns (list[str], optional) – List of columns to be sorted first. Currently, the columns included in the summary frame are, in this order: “name”, “scan_intents”, “shape”, “execution_block_UID”, “polarization”, “scan_name”, “spw_name”, “spw_intents”, “field_name”, “source_name”, “line_name”, “field_coords”, “session_reference_UID”, “scheduling_block_UID”, “project_UID”, “start_frequency”, “end_frequency”. For example, with first_columns=[“spw_name”, “scan_name”] one can print these two columns first, followed by all the other columns in their usual order.

Returns:

A DataFrame containing the summary information of the specified data group.

Return type:

pandas.DataFrame

MeasurementSetXdt API

Custom accessor to MSv4 additional functionality. Given an MSv4 xarray.DataTree, named ms_xdt, the accessor can be used as ms_xdt.xr_ms (xr for xradio and ms for Measurement Set).

class MeasurementSetXdt(datatree: DataTree)[source]
Accessor to the Measurement Set DataTree node. Provides MSv4 specific functionality such as:

get_partition_info(): produce an info dict with a general MSv4 description including intents, SPW name, field and source names, etc.

get_field_and_source_xds() to retrieve the field_and_source_xds for a given data group.

sel(): select data by dimension labels, for example by data group and polaritzation

add_data_group(new_data_group_name: str, new_data_group: dict = {}, data_group_dv_shared_with: str = None) → DataTree[source]

Adds a data group to the MSv4 DataTree, grouping the given data, weight, flag, etc. variables and field_and_source_xds.

Parameters:

new_data_group_name (str) – _description_

new_data_group (dict) – _description_

data_group_dv_shared_with (str, optional) – _description_, by default “base”

Returns:

MSv4 DataTree with the new group added

Return type:

xr.DataTree

delete_data_variables(variables: list[str]) → DataTree[source]

Delete data variables from the MSv4 dataset and all data groups.

Parameters:

variables (list of str) – List of data variable names to delete.

Returns:

MSv4 DataTree with specified data variables deleted.

Return type:

xarray.DataTree

get_field_and_source_xds(data_group_name: str = None) → Dataset[source]

Get the field_and_source_xds associated with data group data_group_name.

Parameters:

data_group_name (str, optional) – The data group to process. Default is “base” or if not found to first data group.

Returns:

field_and_source_xds associated with the data group.

Return type:

xarray.Dataset

get_partition_info(data_group_name: str = None) → dict[source]

Generate a partition info dict for an MSv4, with general MSv4 description including information such as field and source names, SPW name, scan name, the intents string, etc.

The information is gathered from various coordinates, secondary datasets, and info dicts of the MSv4. For example, the SPW name comes from the attributes of the frequency coordinate, whereas field and source related information such as field and source names come from the field_and_source_xds (base) dataset of the MSv4.

Parameters:

data_group_name (str, optional) – The data group to process. Default is “base” or if not found to first data group.

Returns:

Partition info dict for the MSv4

Return type:

dict
sel(indexers: Mapping[Any, Any] | None = None, method: str | None = None, tolerance: int | float | Iterable[int | float] | None = None, drop: bool = False, **indexers_kwargs: Any) → DataTree[source]
Select data along dimension(s) by label. Alternative to xarray.Dataset.sel so that a data group can be selected by name by using the data_group_name parameter. For more information on data groups see Data Groups section. See xarray.Dataset.sel for parameter descriptions.

Returns:

xarray DataTree with MeasurementSetXdt accessors

Return type:

xarray.DataTree

Examples
>>> # Select data group 'corrected' and polarization 'XX'.
>>> selected_ms_xdt = ms_xdt.xr_ms.sel(data_group_name='corrected', polarization='XX')
>>> # Select data group 'corrected' and polarization 'XX' using a dict.
>>> selected_ms_xdt = ms_xdt.xr_ms.sel({'data_group_name':'corrected', 'polarization':'XX')