Overview
The Measurement Set v4 (MS v4) schema defines how correlated data
(interferometer visibilities and single-dish spectra) can be represented
in memory using datasets that consist of n-dimensional arrays labeled
with coordinates and meta-information contained in attributes (see
Introduction). The MS v4 implementation differs
from the MS v2 implementation in
casacore primarily in its use
of Python and off -the-shelf packages for data structures and data
serialization, contrasting with casacore’s bespoke C++ approach.
Reference documents consulted for the MS v4 schema design:
MeasurementSet VLBI (Very Long Baseline Interferometry) extensions
ASDM (Astronomy Science Data Model): SDM Tables Short Description
The current MS v4 schema focuses on offline processing capabilities and does not encompass all information present in the ASDM. However, its design allows for future expansion to incorporate additional data as needed (see schema versioning section). It’s important to note that MS v4 is not backward compatible with either MS v2 or MS v3, representing a significant evolution in the data model.
Schema Layout
An xarray dataset conforming to the MSv4 schema contains data for a
single observation, spectral window, polarization setup, observation
mode, processor and beam per antenna (though finer partitioning, such as
splitting by scan or antenna, is allowed if desired). This structure
simplifies the MS v4 data representation relative to the MS v2, enabling
it to be stored as n-dimensional arrays with consistent shapes over time
(rare baseline dropouts are handled by NaN padding). Related datasets
can be grouped together into a Processing Set (ps), which is useful
for processing them together. Importantly, each MS v4 is fully
self-describing. As shown in Figure 1 (a simplified diagram; for full
details, see the Data Model Schema), the MS v4 is
structured as a dataset (xds) of datasets comprising the
correlated_xds along with antenna_xds, pointing_xds,
phase_calibration_xds, weather_xds, system_calibration_xds,
gain_curve_xds, and phased_array_xds, all stored in the
attribute section. The correlated_xds contains the VISIBILITY
(for interferometer data) or SPECTRUM (for single dish data),
UVW, WEIGHT, and FLAGS data variables, along with info
dictionaries in the attributes. The field_and_source_xds is
specifically stored within the attributes of the
VISIBILITY/SPECTRUM data variable.
Figure 1: Processing Set and MS v4 schema layout. Optional datasets are indicated by round brackets. Data variables are capitalized. The suffix ‘_xds’ denotes an xarray Dataset, while the suffix ‘xdt’ denotes an xarray DataTree. ‘_info’ indicates metadata dictionaries.
Data Groups
The correlated_xds can contain multiple copies of
VISIBILITY/SPECTRUM, FLAG, WEIGHT, and UVW data
variables. To maintain the relationship between a set of data variables,
the data_groups dictionary can contain one or more data group
definitions. A data_group dictionary is used with fixed lowercase
keys: ‘correlated_data’, ‘flag’, ‘weight’, ‘uvw’, and other metadata.
The data_groups dictionary is stored as an attribute of the
correlated (main) dataset (ms_xds) and can contain one or more
data_groups. Data variables can be shared between
data_groups or be unique to a specific group. For example, you
might have a ‘base’ group and an ‘imaging’ group:
Example: ms_xds.attrs['data_groups'] = {
'base': {'correlated_data': 'VISIBILITY', 'flag': 'FLAG', 'weight': 'WEIGHT', 'uvw': 'UVW'},
'imaging': {'correlated_data': 'VISIBILITY_CORRECTED', 'flag': 'FLAG', 'weight': 'WEIGHT_IMAGING', 'uvw': 'UVW'}
}
The ‘base’ and ‘imaging’ data_groups share the same ‘flag’ and
‘uvw’ data variables but have different ‘correlated_data’ and ‘weight’
data variables. When creating new versions of data variables, they must
contain the standard name followed by an underscore and a description
chosen by the developer. For example, a new version of VISIBILITY
could be VISIBILITY_PHASE_SHIFTED.
The ms_xds also allows for selecting a data_group. For example:
sub_ms_xds = ms_xds.xr_ms.sel(data_group_name='imaging')
Use Cases
The MS v4 has been designed to satisfy the following use cases:
Radio Interferometry
Single Dish Observations
On-the-fly (OTF) Mosaic Observations
Ephemeris Observations
Heterogeneous Antenna VLBI
Phased Array Stations (PAS)
Phased Array Feeds (PAF)
To ensure that these use cases are met, we have created stakeholder tests that convert MS v2 to MS v4 using the convert_msv2_to_processing_set function. These tests cover a wide selection of instruments:
Interferometer Stakeholder Tests
Dataset |
Telescope |
Mosaic |
Ephemeris |
VLBI |
PAS |
PAF |
|---|---|---|---|---|---|---|
Antennae_North.cal.lsrk.split.ms |
ALMA |
x |
||||
AA2-Mid-sim_00000.ms |
Simulated SKA Mid |
|||||
ska_low_sim_18s.ms |
Simulated SKA Low |
x |
||||
small_meerkat.ms |
MeerKAT |
|||||
small_lofar.ms |
LOFAR |
x |
||||
global_vlbi_gg084b_reduced.ms |
VLBA+EVN |
x |
||||
VLBA_TL016B_split.ms |
VLBA |
x |
||||
ngEHT_E17A10.0.bin0000.source0000_split.ms |
Simulated ngEHT |
x |
||||
venus_ephem_test.ms |
ALMA |
x |
x |
|||
ALMA_uid___A002_X1003af4_X75a3.split.avg.ms |
ALMA |
x |
x |
|||
SNR_G55_10s.split.ms |
VLA |
x |
||||
VLASS3.2.sb45755730.eb46170641.60480.16266136574.split.v6.ms |
VLA (VLASS) |
x |
||||
gmrt.ms |
GMRT |
|||||
askap_59750_altaz_2settings.ms |
ASKAP |
x |
||||
askap_59754_altaz_2weights_0.ms |
ASKAP |
x |
||||
askap_59754_altaz_2weights_15.ms |
ASKAP |
x |
||||
askap_59755_eq_interleave_0.ms |
ASKAP |
x |
||||
askap_59755_eq_interleave_15.ms |
ASKAP |
x |
Single Dish Stakeholder Tests
Dataset |
Telescope |
Multi-Target |
OFF Position |
Ephemeris |
Scan Pattern |
|---|---|---|---|---|---|
sdimaging |
GBT |
relative |
raster |
||
uid___A002_Xced5df_Xf9d9.small |
ALMA |
horizontal |
raster |
||
uid___A002_X1015532_X1926f.small |
ALMA |
relative |
x |
raster |
|
uid___A002_Xe3a5fd_Xe38e.small |
ALMA |
x |
absolute |
raster |
|
uid___A002_Xae00c5_X2e6b.small |
ALMA |
relative |
x |
fast |
OFF position column, possible values:
absolute: OFF position has its own field entry in the FIELD table.
relative: OFF is a fixed position specified as an offset coordinate relative to the ON position.
horizontal: OFF is also an offset position relative to the ON position but specified so that its elevation value is always close to the ON position (meaning that the OFF position is moving with time).
Scan pattern column: raster scan/fast scan.
raster: Scan is, as the name indicates, a raster pattern with a relatively long integration time (0.1-1 sec) and channelized spectral window.
fast: Scan is either a Lissajous or double-circle pattern with short integration time (1 msec) and single-channel spectral window.
Translating between MS v2 and MS v4
MS v2 table |
MS v4 Access Pattern |
Xarray Type |
|---|---|---|
MAIN |
ms_xdt |
datatree |
SCAN |
ms_xdt.scan_name |
coord |
POLARIZATION |
ms_xdt.polarization |
coord |
SPECTRAL_WINDOW |
ms_xdt.frequency |
coord |
DOPPLER |
ms_xdt.frequency |
coord |
FIELD |
ms_xdt.field_and_source_base_xds |
dataset |
EPHEMERIDES |
ms_xdt.field_and_source_base_xds |
dataset |
SOURCE |
ms_xdt.field_and_source_base_xds |
dataset |
OBSERVATION |
ms_xdt.observation_info |
dict |
PROCESSOR |
ms_xdt.processor_info |
dict |
ANTENNA |
ms_xdt.antenna_xds |
dataset |
FEED |
ms_xdt.antenna_xds |
dataset |
PHASED_ARRAY |
ms_xdt.phased_array_xds |
dataset |
POINTING |
ms_xdt.pointing_xds |
dataset |
SYSCAL |
ms_xdt.system_calibration_xds |
dataset |
WEATHER |
ms_xdt.weather_xds |
dataset |
GAIN_CURVE |
ms_xdt.gain_curve_xds |
dataset |
PHASE_CAL |
ms_xdt.phase_calibration_xds |
dataset |
EARTH_ORIENTATION |
ms_xdt.earth_orientation_xds |
dataset |
Summary of Key Changes
The following changes and deprecations are relative to MS v2. Some of the changes from MS v3 are adopted.
An MS v4 is fully self-describing and contains the information for a single observation, spectral window, polarization setup, observation mode, processor, and beam per antenna.
Data is stored in Datasets of labeled n-dimensional arrays (called data variables) instead of tables.
The table concept of rows has been replaced by relevant dimensions. For example, the
VISIBILITYcolumn in theMAIN tableof MS v2 is now an n-dimensional array with dimensions time x baseline x frequency x polarization (row has been split into time x baseline).Most keys that used to be implicit numbered indices have been changed to descriptive names. This improves code readability, allows for sub-selecting data without reindexing, and enables easy data combination. For example,
antenna_nameis used instead ofantenna_id.The concept of data description (DDI) is deprecated and replaced by
spectral_window_nameandpolarization_setup.Versioning of the
VISIBILITY/SPECTRUM,WEIGHT,UVW, andFLAGdata variables is done using data groups.The JPL Horizons ephemerides was used to create the ephemeris schema instead of the MS v2.
Redefinition of
WEIGHTto beWEIGHT_SPECTRUM. Contemporary interferometers are spectral machines; consequently, weights can be distinctive per channel. TheWEIGHTdata variable has the same shape as theVISIBILITY/SPECTRUMdata variable.Field, source, and ephemeris data have been combined into a single dataset.
Antenna and feed data have been combined into a single dataset (an MS v4 can only have one feed type per antenna).
Deprecation List MS v2
MAIN TableWEIGHT_SPECTRUM(Now calledWEIGHT).FLAG_CATEGORY:FLAGversioning can now be done with data groups, and FlagArray supports bit fields nowFLAG_ROW: The purpose ofFLAG_ROWwas to avoid having to renumber rows when e.g. removing an antenna. This is not necessary with data variables that are labeled since implicit indices are no longer used.SIGMA: This is redundant sinceWEIGHTis always considered to be1/SIGMA^2.BASELINE_REFERENCE: This is no longer used in VLBI.
FIELD TableFIELD_DELAY_CENTER: No use case has been found.EPHEMERIS_ID: Only a single ephemeris dataset per MS v4 is permitted.
ANTENNA Table:OFFSET: No use case has been found.
FEED Table:POL_RESPONSE,BEAM_OFFSET,POSITION: Not effectively used and will be replaced by a beam model dataset which is under development.
POINTING Table:TARGET,SOURCE_OFFSET,ON_SOURCE: Redundant information contained infield_and_source_xds.TRACKING: No use case has been found.
Delving further
To delve further into the Measurement Set v4,
The tutorial section demonstrates the schema and API usage.
This Jupyter notebook (.ipynb) can be run interactively via the Google Colab link at the top.
You can also download and run notebooks locally after installing XRADIO via pip or conda.
There are multiple guides
Examples show how different telescopes’ data can be represented.
If your telescope isn’t represented, open an issue and attach an example Measurement Set v2 (preferably 10MB or smaller).
Examine the MSv4 schema and API documentation.
The MSv4 schema went through a review process in late 2024 / early 2025. More details and discussion on the schema design are in the MSv4 Review Panel report and the discussion issues on the XRADIO GitHub repository (with label “MSv4 Review”).
Sub-package Layout
The XRADIO architecture with the measurement set sub-package expanded is
shown in Figure 2 (overview of XRADIO
architecture). The public API is contained in
the top-level .py files of the measurement_set sub-package (API
documentation).
Figure 2: XRADIO Architecture.
Figure 3 summarizes the available and planned future functionality. The sub-package currently allows direct opening of data from zarr and will support ASDM pre- and post- WSU (ALMA Wide Band Sensitivity Upgrade) and possibly NetCDF in the future. The WSU ASDM is an update of the ASDM implementation required by ALMA WSU where the spectral windows are no longer interleaved. Both zarr and NetCDF support cloud storage, and open_processing_set and load_processing_set support accessing data on Amazon S3. There are no plans to support direct access to MS v2, since the repeated reordering of data to MS v4 has been found to be computationally prohibitively expensive. Rather, data should be converted using the convert_msv2_to_processing_set.
Figure 3: Measurement Set Sub-package Current and Planned Functionality.
Future work
Schema-related:
Earth orientation dataset (VLBI)
Interferometer model dataset (VLBI)
Beam model dataset
Software-related:
Expand and refine Processing Set methods
Add ASDM and WSU-ASDM backend
Possibly add a NetCDF backend
Replace python-casacore backend with a more lightweight (possibly pure Python) package