How to extract the Gaia ancillary data using datalink - Gaia Users
Help supportShould you have any question, please check the Gaia FAQ section or contact the Gaia Helpdesk |
DataLink Service
Authors: Héctor Cánovas and Jos de Bruijne
In each Gaia data release, the key parameters of all sources are stored in the gaia_source table that contains the (mean) astrometric, photometric, and radial-velocity data as well as astrophysical parameters. This table, along with the complementary tables for, for instance, variable stars or solar system objects, is accessible by means of the IVOA compliant TAP+ table access protocol that allows to explore astronomical datasets stored in relational data bases using the ADQL query language. In addition to the gaia_source table, Gaia DR3 includes vast amounts of non-tabular data such as mean spectra, epoch photometry, and Monte Carlo Markov Chain samples for millions of sources (while Gaia DR4 will include epoch astrometry and epoch photometry for the whole sample plus billions of mean and epoch spectra). Storing these non-tabular datasets as plain tables in a monolithic, relational data base is impractical. Instead, these products are hosted by a dedicated service designed to handle massive data requests that is accessible via the DataLink protocol. DataLink is a data access protocol compliant with the IVOA architecture that provides a linking mechanism between datasets offered by different services. In practice, it can be seen and used as a web service providing the list of additional data products available for each object outside the main catalogue(s).
Since the Archive upgrade to version 2.14 the VOTables generated by the Archive contain a new resource that facilitates the access to the DataLink products to IVOA compliant clients (like TOPCAT):
</RESOURCE>
<RESOURCE type="meta" utype="adhoc:service" name="ancillary">
<DESCRIPTION>Retrieve DataLink file containing ancillary data for source</DESCRIPTION>
<PARAM name="standardID" datatype="char" arraysize="*" value="ivo://ivoa.net/std/DataLink#links-1.0"/>
<PARAM name="accessURL" datatype="char" arraysize="*" value="https://gea.esac.esa.int/data-server/datalink/links"/>
<PARAM name="contentType" datatype="char" arraysize="*" value="application/x-votable+xml;content=datalink"/>
<GROUP name="inputParams">
<PARAM name="ID" datatype="char" arraysize="*" value="" ref="DESIGNATION"/>
</GROUP>
</RESOURCE>
The entry point to the DataLink server is indicated by the "accessURL" parameter. To invoke the service and find out the resources associated to a given source it is necessary to combine the entry point with the target ID as:
https://gea.esac.esa.int/data-server/datalink/links?ID=Gaia+DR3+30343944744320
The output is a xml file not intended for humans but for IVOA-compliant clients. Opening this file with TOPCAT reveals the following content:
Figure 1: Content of the xml file generated by the DataLink service when invoked as explained above.
This file contains the URLs that give access to the DataLink products associated to the target source. For information about how to access to these products from the Gaia Archive web interface and programmatically, please see the DataLink: Access from the Archive GUI and the Command line access: DataLink tutorials, respectively. The structure and content of the DataLink products are described in the Datamodel Chapter in the Gaia DR3 documentation and DataLink products serialisation tutorial.
DataLink: Access from the Archive web interface
Authors: Héctor Cánovas, Jos de Bruijne, and Alcione Mora
Gaia DR3 includes vast amounts of non-tabular data such as high- and low-resolution (mean) spectra, epoch photometry, and Monte Carlo Markov Chain samples for millions of sources. These products are hosted by a dedicated service designed to handle massive data requests that is accessible via the DataLink protocol. This intermediate-level tutorial introduces the concepts needed to access and retrieve these products using the Gaia Archive web interface via its Advanced (ADQL) form. The complementary DataLink: command line access and DataLink: Python Access tutorials describe the programmatic access to these products using the Unix curl command-line utility and the Python package Astroquery.Gaia, respectively, while the DataLink products serialisation tutorial describes the structure of these products. In case of difficulties following this tutorial, please consult the DataLink service and Advanced (ADQL) tab tutorials.
Tutorial content:
1. How it works
The DataLink service searches for the DataLink product(s) associated to a list of Gaia designation(s) or, alternatively, a combination of Gaia source ID(s) and Gaia data release. By default, the server searches for the products associated to Gaia DR3. Users interested in searching the DataLink products from Gaia DR2 (which only contains epoch photometry data) simply have have to select "Gaia DR2" in the "Data release" dropdown menu that is indicated by the inclined arrow in Fig. 1.
2. Basic use cases
One of the most simple use cases that can be defined is: "I want to search for the DataLink products associated to the output of this (Gaia DR3) query". In a first attempt, we may run a simple, 0.25 degrees radius ADQL cone search similar to the first example directly accessible from the ADQL query editor (see also the Query examples):
SELECT DISTANCE(POINT(266.41683, -29.00781), POINT(ra, dec)) AS separation, *
FROM gaiadr3.gaia_source
WHERE DISTANCE(POINT(266.41683, -29.00781),POINT(ra, dec)) < 0.25
ORDER BY separation ASC
The first step to retrieve the DataLink products associated to this sample consists in clicking on the double chain ("paperclip") icon available in the job list area of the Advanced (ADQL) form (see Fig. 1). In this case, we will receive an error message explaining that the result of the query above contains 19,758 sources, which exceeds the threshold (5000 sources) imposed to not overload the DataLink server. To avoid this problem, we may reduce the cone search radius or, even better, make use of the "has_<datalink_product>" fields available in the gaiadr3.gaia_source table. Filtering the previous query as follows:
SELECT DISTANCE(POINT(266.41683, -29.00781),POINT(ra, dec)) AS separation, *
FROM gaiadr3.gaia_source
WHERE DISTANCE(POINT(266.41683, -29.00781),POINT(ra, dec)) <0.25
AND
-- Retrieve only sources with associated DataLink products
has_epoch_photometry ='True' AND
has_xp_sampled = 'True'
ORDER BY separation ASC
reduces the query output to just 26 sources, all of them having epoch photometry and XP sampled (as well as XP continuous) spectra. Note: advanced users may use the job_upload mechanism to apply these filters to their old queries without having to re-run the entire query. Clicking again in the double chain icon in the job list area will launch the DataLink wizard as shown in Fig. 1 below. This window lists all the available products associated with the sample generated by the previous query. It is possible to retrieve only selected products (e.g., just RVS mean spectra) or download all the DataLink products at once (by simply clicking on the "Save All Data" button). Note that, depending on the amount of data to be retrieved, the preparation of the dataset prior to starting the download may take up to several minutes. From the DataLink wizard, it is also possible to select different combinations of data structures and download formats (see the DataLink: products serialisation tutorial for details).
Figure 1:Gaia ESA Archive web interface DataLink wizard that appears when clicking on the DataLink icon (double chain link encompassed by a red circle above) in the job lists area.
The vertical arrows point to the drop-down menus that allow to select the data structure (output serialisation) and file format of the files (see the DataLink: Products serialisation tutorial).
The "phot_variable_flag" field in the main Gaia DR2 catalogue (gaia_dr2.gaia_source) allows to filter out the sources containing (DR2) epoch photometry. The DR2 equivalent of the last ADQL query would then be:
SELECT DISTANCE(
POINT(266.41683, -29.00781),
POINT(ra, dec)) AS separation, *
FROM gaiadr2.gaia_source
WHERE DISTANCE(POINT(266.41683, -29.00781), POINT(ra, dec)) < 0.25
AND
-- Retrieve only sources with associated DataLink products
phot_variable_flag = 'VARIABLE'
ORDER BY separation ASC
3. Advanced use caseS
From the DataLink wizard, it is also possible to 1) select the input ID columns to be used by the massive data server when searching for the DataLink products associated to a given sample, and 2) select the associated data release (see the dropdown menus highlighted by the inclined arrows in Fig. 1). These two options become relevant for users aiming to search for DataLink products in external or user-provided tables (see the the Upload a user table and Download data from an external TAP server tutorials to learn how to upload a table to the user space). The example shown in Fig. 2 illustrates the case of a table uploaded by a (registered) user that contains three different ID fields (and right ascension and declination). If the selected column contains Gaia source IDs but not designations, then the user must indicate as well the desired data release.
Figure 2: Same as Fig.1, but illustrating what happens when searching for DataLink products in catalogues containing several fields that could be used as input ID columns for the DataLink service.
datalink products serialisation
Authors: Héctor Cánovas, María Henar, Jos de Bruijne, Elena Racero, and Alcione Mora
The DataLink IVOA protocol implemented in the Gaia ESA Archive gives access to six different products (epoch photometry, medium- and low-resolution spectra, and probability density distributions for the different astrophysical parameters) available for a significant fraction of the sources included in the main Gaia DR3 table (gaia_dr3.gaia_source). These products are serialised according to different data models, and all of them can be retrieved in multiple file formats as well as multiple data structures. This document describes the contents (both data and metadata) of the DataLink products in the various serialisations generated by the Archive. Further information about the implementation of the DataLink protocol in the Archive is briefly described in the DataLink Service, while the Datalink: Command line access and the DataLink: Access from the web interface describe how to retrieve the DataLink products through the Archive web interface and programmatically, respectively. This other tutorial shows how to use the Astroquery.Gaia Python package to download these products.
The DataLink products served by the Archive (and the data models applied to serialise them) are listed below:
Product |
Retrieval type |
Short description |
Data model |
Data release |
---|---|---|---|---|
EPOCH_PHOTOMETRY |
Each row in this table contains the light curve for a given object in the G, BP, and RP bands as stored in the DataLink massive data base. |
DR3 & DR2 | ||
MCMC_GSPPHOT |
Monte-Carlo Markov Chain (MCMC) samples for the posterior probability distribution of all parameters derived from the General Stellar Parametrizer from Photometry (GSP-Phot). Some 2000 random MCMC samples are provided for (1) all sources brighter than G=12 mag, (2) a random subset of 1% of the sources fainter than G=12 mag. For all other sources fainter than G=12, the sample size is 100 (the last 100 samples in the MCMC). |
DR3 | ||
MCMC_MSC |
Monte-Carlo Markov Chain (MCMC) samples for the posterior probability distribution of all parameters derived from the Multiple Star Classifier (MSC). Some 100 random MCMC samples are provided for each source. |
DR3 | ||
XP_CONTINUOUS |
Time-averaged (mean) BP/RP spectra based on the continuous representation in basis functions (see this Chapter). |
DR3 | ||
XP_SAMPLED |
Time-averaged (mean) BP/RP externally-calibrated and sampled spectra are provided for a subset of all sources. All spectra are sampled to the same set of absolute wavelength positions, which can be found in the xp_merge table. |
DR3 | ||
RVS |
Time-averaged (mean) RVS normalised and sampled spectra are provided for a subset of all sources. |
IVOA spectrum | DR3 |
The serialisation of each product is detailed in the following sections.
Tutorial content:
1. retrieval parameters
1.1 Data Structure
This parameter defines the structure of the file that is being prepared for download. There are three possible options:
- INDIVIDUAL (default): one single file per product per selected source(s), with the data serialised in tabular format (one element per table cell).
- COMBINED: one single file per product, with the data for multiple sources serialised in a tabular format (one element per table cell).
- RAW: one single file per product, with the data for multiple sources serialised in a tabular format (one or more elements per table cell).
The latter format is the one used internally by the DPAC consortium, and it is documented in the Gaia Data Release 3 documentation (see the Datamodel description chapter).
1.2 Download FILE Format
Available file download formats are:
- VOTable (both binary and plain-text formats, .xml extension)
- FITS
- CSV
- *ECSV (Enhanced Character Separated Values)
*Note: It is not possible to download XP mean sampled spectra or RVS mean spectra in a COMBINED data structure using ECSV as file format. This format fundamentally does not support storing multiple tables (with their associated metadata) in a single file.
The VOTable, FITS, and ECSV file formats provide the table fields and metadata with column descriptions, UCDs, UTYPEs, and units when applicable. The CSV file format only includes the column names.
1.3. output file naming
The data structure and download format define the names of the retrieved files as follows:
Data structure |
File name |
Example |
---|---|---|
INDIVIDUAL One file per source |
< RETRIEVAL_TYPE >-<DESIGNATION>.<xml/fits/csv/ecsv> |
RVS-Gaia DR3 30343944744320.<xml/fits/csv/ecsv> |
COMBINED One file with all sources |
< RETRIEVAL_TYPE >_COMBINED.<xml/fits/csv/ecsv> | EPOCH_PHOTOMETRY_COMBINED.<xml/fits/csv/ecsv> |
RAW One file with all sources |
< RETRIEVAL_TYPE >_RAW.<xml/fits/csv/ecsv> | XP_SAMPLED_RAW.<xml/fits/csv/ecsv> |
By default, the ouput data is downloaded as a compressed .gzip file. However, some internet browsers like, for instance, Safari, automatically expand these files (without asking the user).
2. Data Models
The data model for the products serialised in the INDIVIDUAL and COMBINED data structures is described in the following subsections. In all the tables below, the fields that are added by the Archive (i.e., those that are not included in the DPAC data model) when serialisaing the different products are highlighted in bold fonts.
2.1 EPOCH PHOTOMETRY
The serialisation of the epoch photometry is based on IVOA Time series cube data model. No metadata is added into the file header, and all the information is repeated through the output table so each row is self-contained. Column names therefore remain identical in the INDIVIDUAL and COMBINED data structures.
Field |
Unit |
Data type |
UCD |
UTYPE |
---|---|---|---|---|
source_id |
|
long |
meta.id;meta.main |
|
transit_id |
|
long |
meta.id |
|
band |
|
string |
instr.bandpass |
ssa:DataID.Bandpass |
time |
d |
double |
time.epoch |
|
mag |
mag |
double |
phot.mag;em.opt |
|
flux |
'electron'.s**-1 |
double |
phot.flux;stat.mean |
|
flux_error |
'electron'.s**-1 |
double |
stat.error;phot.flux;em.opt |
|
flux_over_error |
|
double |
stat.snr;phot.flux;em.opt |
|
rejected_by_photometry |
|
boolean |
meta.code.status |
|
rejected_by_variability |
|
boolean |
meta.code.status |
|
other_flags |
|
long |
meta.code.status |
|
solution_id |
|
long |
meta.version |
|
Long descriptions for the added fields:
band: Photometric band. Values: G (per-transit combined SM-AF flux), BP (blue photometer integrated flux), and RP (red photometer integrated flux).
rejected_by_photometry: Rejected by DPAC photometric processing. Unavailable or rejected by DPAC photometric processing, or negative (unphysical) flux.
other flags: Additional processing flags. This field contains extra information on the data used to compute the fluxes and their quality. It provides debugging information that may be safely ignored for many general purpose applications. The field is a collection of binary flags, whose values can be recovered by applying bit shifting and masking operations. Each band has different binary flags in different positions, as shown below. Bit numbering is as follows: least significant bit = 1 and most significant bit = 64.
- G band:
- Bit 1: SM transit rejected by photometric processing.
- Bit 2: AF1 transit rejected by photometric processing.
- Bit 3: AF2 transit rejected by photometric processing.
- Bit 4: AF3 transit rejected by photometric processing.
- Bit 5: AF4 transit rejected by photometric processing.
- Bit 6: AF5 transit rejected by photometric processing.
- Bit 7: AF6 transit rejected by photometric processing.
- Bit 8: AF7 transit rejected by photometric processing.
- Bit 9: AF8 transit rejected by photometric processing.
- Bit 10: AF9 transit rejected by photometric processing.
- Bit 13: G band flux scatter larger than expected (all CCDs considered).
- Bit 14: SM transit unavailable by photometric processing.
- Bit 15: AF1 transit unavailable by photometric processing.
- Bit 16: AF2 transit unavailable by photometric processing.
- Bit 17: AF3 transit unavailable by photometric processing.
- Bit 18: AF4 transit unavailable by photometric processing.
- Bit 19: AF5 transit unavailable by photometric processing.
- Bit 20: AF6 transit unavailable by photometric processing.
- Bit 21: AF7 transit unavailable by photometric processing.
- Bit 22: AF8 transit unavailable by photometric processing.
- Bit 23: AF9 transit unavailable by photometric processing.
- BP band:
- Bit 11: BP transit rejected by photometric processing.
- Bit 24: BP transit photometry rejected by variability processing.
- RP band:
- Bit 12: RP transit rejected by photometric processing.
- Bit 25: RP transit photometry rejected by variability processing.
Figure 1: excerpt from the Epoch Photometry table as shown by TOPCAT.
2.2 MCMC GSP-PHOT samples
The serialisation of the MCMC GSP-Phot samples follows the DPAC data model, with the only exceptions being the "nsamples" field that is not included and the array serialisation (the array columns are flattened to one value per entry). No metadata is added into the file header, and all the information is repeated through the output table so each row is self-contained. Column names therefore remain identical in the INDIVIDUAL and COMBINED data structures. The table metadata does not contain any UTYPEs.
Field |
Unit |
Data type |
UCD |
---|---|---|---|
source_id |
|
long |
meta.id |
solution_id |
|
long |
meta.version |
teff |
K |
float |
phys.temperature.effective |
azero |
mag |
float |
phys.absorption;em.opt |
logg |
log(cm.s**-2) |
float |
phys.gravity |
mh |
'dex' |
float |
phys.abund |
ag |
mag |
float |
phys.absorption;em.opt |
mg |
mag |
float |
phys.magAbs;em.opt |
distancepc |
pc |
float |
pos.distance |
abp |
mag |
float |
phys.absorption;em.opt.B |
arp |
mag |
float |
phys.absorption;em.opt.R |
ebpminrp |
mag |
float |
phot.color.excess;em.opt |
log_pos |
|
float |
stat.probability |
log_lik |
|
float |
stat.likelihood |
radius |
solRad |
float |
phys.size.radius |
Figure 2: excerpt from the MCMC GSP-Phot table as shown by TOPCAT.
2.3 MCMC MSC SAMPLES
The serialisation of the MCMC MSC samples follows the DPAC data model, with the only exceptions being the "nsamples" field that is not included and the array serialisation (the array columns are flattened to one value per entry). No metadata is added into the file header, and all the information is repeated through the output table so each row is self-contained. Column names therefore remain identical in the INDIVIDUAL and COMBINED data structures. The table metadata does not contain any UTYPEs.
Field |
Unit |
Data type |
UCD |
---|---|---|---|
source_id |
|
long |
meta.id |
solution_id |
|
long |
meta.version |
teff1 |
K |
float |
phys.temperature.effective |
teff2 |
K |
float |
phys.temperature.effective |
logg1 |
log(cm.s**-2) |
float |
phys.gravity |
logg2 |
log(cm.s**-2) |
float |
phys.gravity |
azero |
mag |
short |
phys.absorption;em.opt |
mh |
'dex' |
float |
phys.abund.Fe |
distancepc |
pc |
float |
pos.distance |
log_pos |
|
float |
stat.probability |
log_lik |
|
float |
stat.likelihood |
Figure 3: excerpt from the MCMC MSC table as shown by TOPCAT.
2.4 XP CONTINUOS spectra
The XP Continuous mean spectra are serialised without deviations from the original DPAC data model for any data structure. The table metadata contains neither units nor UTYPEs.
Field |
Data type |
UCD |
---|---|---|
source_id |
long |
meta.id;meta.main |
solution_id |
long |
meta.version |
bp_basis_function_id |
short |
meta.id |
bp_degrees_of_freedom |
short |
stat.fit.dof |
bp_n_parameters |
short |
stat.fit.param |
bp_n_measurements |
short |
meta.number |
bp_n_rejected_measurements |
short |
meta.number |
bp_standard_deviation |
float |
stat.stdev |
bp_chi_squared |
float |
stat.fit.chi2 |
bp_coefficients |
double[] |
stat.fit.param |
bp_coefficient_errors |
float[] |
stat.error |
bp_coefficient_correlations |
float[] |
stat.correlation |
bp_n_relevant_bases |
short |
meta.number |
bp_relative_shrinking |
float |
stat.fit.param |
rp_basis_function_id |
short |
meta.number |
rp_degrees_of_freedom |
short |
stat.fit.dof |
rp_n_parameters |
short |
stat.fit.param |
rp_n_measurements |
short |
meta.number |
rp_n_rejected_measurements |
short |
meta.number |
rp_standard_deviation |
float |
stat.stdev |
rp_chi_squared |
float |
stat.fit.chi2 |
rp_coefficients |
double[] |
stat.fit.param |
rp_coefficient_errors |
float[] |
stat.error |
rp_coefficient_correlations |
float[] |
stat.correlation |
rp_n_relevant_bases |
short |
meta.number |
rp_relative_shrinking |
float |
meta.number |
Figure 4: excerpt from the XP Continuous Spectrum table as shown by TOPCAT.
2.5 XP SAMPLED spectra
The serialisation of the XP sampled mean spectra follows the IVOA Spectra Data Model. Both the INDIVIDUAL and COMBINED data structures contain the same data and metadata. However, due to the 8-characters length limit imposed by the FITS format to the keyword names, the (added) table metadata parameters are re-named when serialising this product in FITS format. In the table below, the rows in white and green background indicate the fields included in the table metadata and data, respectively. None of the metadata fields is included in the files generated in .csv format, which follows a particular serialisation (similar to the DPAC RAW serialisation but including a wavelength column).
Field (VOTable) |
Field (FITS) |
Unit |
Data type |
UCD |
UTYPE |
---|---|---|---|---|---|
source_id |
SOURCEID |
long |
meta.id;src |
spec:Target.Name |
|
solution_id |
SOLUTION |
long |
meta.version |
|
|
spatialLocation |
POS |
deg |
double[] |
pos.eq |
spec:Char.SpatialAxis.Coverage.Location.Value |
SpatialExtent |
APERTURE |
deg |
double |
instr.fov |
spec:Spectrum.Char.SpatialAxis.Coverage.Bounds.Extent |
TimeAxisCoverageLocation |
REFEPOCH |
yr |
double |
time.epoch |
spec:Char.TimeAxis.Coverage.Location.Value |
TimeAxisCoverageBoundsExtent |
EPOCHEXT |
yr |
double |
time.duration |
spec:Char.TimeAxis.Coverage.Bounds.Extent |
spectralLocation |
- |
nm |
double |
instr.bandpass |
spec:Char.SpectralAxis.Coverage.Location.Value |
spectralCoverageBoundsExtent |
WAVEEXTE |
nm |
double |
instr.bandwidth |
spec:Char.SpectralAxis.Coverage.Bounds.Extent |
spectralCoverageBoundsStart |
WAVESTAR |
nm |
double |
stat.min |
spec:Char.SpectralAxis.Coverage.Bounds.Start |
spectralCoverageBoundsStop |
WAVEEND |
nm |
double |
stat.max |
spec:Char.SpectralAxis.Coverage.Bounds.Stop |
spectralAccuracyStatError |
WAVEERRO |
nm |
double |
stat.error;em.wl |
spec:Char.SpectralAxis.Accuracy.StatError |
DataModel |
DATAMODE |
nm |
string |
|
spec:Spectrum.DataModel |
Publisher |
PUBLISHE |
string |
meta.curation |
spec:Curation.Publisher |
|
Title |
TITLE |
string |
|
spec:DataID.Title |
|
SpectralAxisUcd |
- |
string |
|
spec:Spectrum.Char.SpectralAxis.Ucd |
|
SpectralAxisUnit |
SPECTRAL |
string |
|
spec:Spectrum.Char.SpectralAxis.Unit |
|
FluxAxisUcd |
- |
string |
|
spec:Spectrum.Char.FluxAxis.Ucd |
|
FluxAxisUnit |
FLUXAXIS |
string |
|
spec:Spectrum.Char.FluxAxis.Unit |
|
wavelength |
|
nm |
double |
em.wl |
spec:Data.SpectralAxis.Value |
flux |
|
W.m**-2.nm**-1 |
float |
spect |
|
flux_error |
|
W.m**-2.nm**-1 |
float |
stat.error;spect |
|
Figure 5: excerpt from the XP Sampled Spectrum table and metadata as shown by TOPCAT.
2.6 RVS SPECTRA
The serialisation of the RVS mean spectra follows the IVOA Spectra Data Model. Both the INDIVIDUAL and COMBINED data structures contain the same data and metadata. However, due to the 8-characters length limit imposed by the FITS format to the keyword names, the (added) table metadata parameters are re-named when serialising this product in FITS format. In the table below, the rows in white and green background indicate the fields included in the table metadata and data, respectively. None of the metadata fields is included in the files generated in .csv format, which follows a particular serialisation (similar to the DPAC RAW serialisation but including a wavelength column).
Field (VOTable) |
Field (FITS) |
Unit |
Data type |
UCD |
UTYPE |
---|---|---|---|---|---|
source_id |
SOURCEID |
long |
meta.id;src |
spec:Target.Name |
|
solution_id |
SOLUTION |
long |
meta.version |
|
|
combined_transits |
NTRANSIT |
int |
|
|
|
combined_ccds |
NCCDS |
int |
|
|
|
deblended_ccd |
NDEBLEND |
int |
|
|
|
spatialLocation |
POS |
deg |
double[] |
pos.eq |
spec:Char.SpatialAxis.Coverage.Location.Value |
TimeAxisCoverageLocation |
REFEPOCH |
yr |
double |
time.epoch |
spec:Char.TimeAxis.Coverage.Location.Value |
TimeAxisCoverageBoundsExtent |
EPOCHEXT |
yr |
double |
time.duration |
spec:Char.TimeAxis.Coverage.Bounds.Extent |
spectralAccuracyStatError |
WAVEERRO |
nm |
double |
stat.error;em.wl |
spec:Char.SpectralAxis.Accuracy.StatError |
spectralLocation |
- |
nm |
double |
instr.bandpass |
spec:Char.SpectralAxis.Coverage.Location.Value |
spectralCoverageBoundsExtent |
WAVEEXTE |
nm |
double |
instr.bandwidth |
spec:Char.SpectralAxis.Coverage.Bounds.Extent |
spectralCoverageBoundsStart |
WAVESTAR |
nm |
double |
stat.min |
spec:Char.SpectralAxis.Coverage.Bounds.Start |
spectralCoverageBoundsStop |
WAVEEND |
nm |
double |
stat.max |
spec:Char.SpectralAxis.Coverage.Bounds.Stop |
SpatialExtent |
APERTURE |
deg |
double |
instr.fov |
spec:Spectrum.Char.SpatialAxis.Coverage.Bounds.Extent |
DataModel |
DATAMODE |
string |
|
spec:Spectrum.DataModel |
|
Publisher |
PUBLISHE |
string |
meta.curation |
spec:Curation.Publisher |
|
Title |
TITLE |
string |
|
spec:DataID.Title |
|
SpectralAxisUcd |
- |
string |
|
spec:Spectrum.Char.SpectralAxis.Ucd |
|
SpectralAxisUnit |
SPECTRAL |
string |
|
spec:Spectrum.Char.SpectralAxis.Unit |
|
FluxAxisUcd |
FLUXAXIS |
string |
|
spec:Spectrum.Char.FluxAxis.Ucd |
|
FluxAxisUnit |
|
string |
|
spec:Spectrum.Char.FluxAxis.Unit |
|
wavelength |
|
nm |
double |
em.wl |
spec:Data.SpectralAxis.Value |
flux |
|
float |
phot.flux;em.opt.I |
|
|
flux_error |
|
float |
stat.error;phot.flux;em.opt.I |
|
Figure 6: excerpt from the RVS Spectra table and metadata as shown by TOPCAT.
Datalink: python access
Authors: Héctor Cánovas and Jos de Bruijne
The main goal of the Jupyter Notebook displayed below is to teach how to retrieve and inspect the DataLink products using the Astroquery.Gaia Python package. This code has been tested in Python >= 3.8. The Jupyter notebook is included in this .zip file that also contains complementary notebooks, supplementary files, and a "tutorials.yml" environment file that can be used to create a conda environment with all dependencies needed to execute it (as explained in the official conda documentation).
Tutorial: Retrieve (all) the DataLink products associated to a sample¶
Release number: v1.0.1 (2022-12-06)
Applicable Gaia Data Releases: Gaia EDR3, Gaia DR3
Author: Héctor Cánovas Cabrera; hector.canovas@esa.int
Summary:
This code shows how to retrieve the different DataLink products from an input list of Gaia DR3 sources. These products are serialised in three different data structures:
- INDIVIDUAL
- COMBINED, and
- RAW
Although all data structures contain virtually the same information, the RAW format - the internal format used by the Gaia collaboration - is not intended for the final users (see for details the DataLink: Products serialisation tutorial). This notebook shows the content of the INDIVIDUAL & COMBINED products, whose serialisation follows different IVOA data model recommendations and it allows to easily inspect the product content. We recommend to select the COMBINED format when downloading DataLink products for large (>1000) amounts of sources to reduce the total download time.
Useful URLs:
- Questions or suggestions
- Tutorials, documentation, and more
- Known issues in the Gaia data
- Gaia data credits and acknowledgements
- GaiaXPy: GaiaXPy is a Python library to facilitate handling Gaia BP/RP spectra as distributed from the Gaia archive.
from astroquery.gaia import Gaia
import matplotlib.pyplot as plt
def extract_dl_ind(datalink_dict, key, figsize = [15,5], fontsize = 12, linewidth = 2, show_legend = True, show_grid = True):
""
"Extract individual DataLink products and export them to an Astropy Table"
""
dl_out = datalink_dict[key][0].to_table()
if 'time' in dl_out.keys():
plot_e_phot(dl_out, colours = ['green', 'red', 'blue'], title = 'Epoch photometry', fontsize = fontsize, show_legend = show_legend, show_grid = show_grid, figsize = figsize)
if 'wavelength' in dl_out.keys():
if len(dl_out) == 343: title = 'XP Sampled'
if len(dl_out) == 2401: title = 'RVS'
plot_sampled_spec(dl_out, color = 'blue', title = title, fontsize = fontsize, show_legend = False, show_grid = show_grid, linewidth = linewidth, legend = '', figsize = figsize)
return dl_out
def plot_e_phot(inp_table, colours = ['green', 'red', 'blue'], title = 'Epoch photometry', fontsize = 12, show_legend = True, show_grid = True, figsize = [15,5]):
""
"Epoch photometry plotter. 'inp_table' MUST be an Astropy-table object."
""
fig = plt.figure(figsize=figsize)
xlabel = f'JD date [{inp_table["time"].unit}]'
ylabel = f'magnitude [{inp_table["mag"].unit}]'
gbands = ['G', 'RP', 'BP']
colours = iter(colours)
plt.gca().invert_yaxis()
for band in gbands:
phot_set = inp_table[inp_table['band'] == band]
plt.plot(phot_set['time'], phot_set['mag'], 'o', label = band, color = next(colours))
make_canvas(title = title, xlabel = xlabel, ylabel = ylabel, fontsize= fontsize, show_legend=show_legend, show_grid = show_grid)
plt.show()
def plot_sampled_spec(inp_table, color = 'blue', title = '', fontsize = 14, show_legend = True, show_grid = True, linewidth = 2, legend = '', figsize = [12,4], show_plot = True):
""
"RVS & XP sampled spectrum plotter. 'inp_table' MUST be an Astropy-table object."
""
if show_plot:
fig = plt.figure(figsize=figsize)
xlabel = f'Wavelength [{inp_table["wavelength"].unit}]'
ylabel = f'Flux [{inp_table["flux"].unit}]'
plt.plot(inp_table['wavelength'], inp_table['flux'], '-', linewidth = linewidth, label = legend)
make_canvas(title = title, xlabel = xlabel, ylabel = ylabel, fontsize= fontsize, show_legend=show_legend, show_grid = show_grid)
if show_plot:
plt.show()
def make_canvas(title = '', xlabel = '', ylabel = '', show_grid = False, show_legend = False, fontsize = 12):
""
"Create generic canvas for plots"
""
plt.title(title, fontsize = fontsize)
plt.xlabel(xlabel, fontsize = fontsize)
plt.ylabel(ylabel , fontsize = fontsize)
plt.xticks(fontsize = fontsize)
plt.yticks(fontsize = fontsize)
if show_grid:
plt.grid()
if show_legend:
plt.legend(fontsize = fontsize*0.75)
Gaia.login()
Download data sample¶
The query below retrieves a random sample of Gaia (E)DR3 sources having all types of DataLink products.
query = f"SELECT source_id, ra, dec, pmra, pmdec, parallax \
FROM gaiadr3.gaia_source \
WHERE has_epoch_photometry = 'True' \
AND has_xp_sampled = 'True'\
AND has_rvs = 'True' \
AND has_mcmc_msc = 'True' \
AND has_mcmc_gspphot = 'True' \
AND random_index between 0 and 200000"
job = Gaia.launch_job_async(query)
results = job.get_results()
print(f'Table size (rows): {len(results)}')
results
Download DataLink Products: INDIVIDUAL¶
The example below retrieves ALL available DataLink products for the input sample of Gaia Source IDs. This option significantly increases the total download time, and here it is selected only for teaching purposes. If you are not interested in downloading all products we recommend you to specify the DataLink product in retrieval_type
.
The downloaded files can be stored locally by specifying the output file directory via the output_file
option in the load_data
method below. Note that:
- The DataLink products are stored in a .gz compressed directory. To avoid errors, this shoud be considered when naming the output file, e.g.,
output_file = 'datalink_output.gz'
- The individual files will also be saved in the same directory from where this notebook is being launched. This is a known bug and we are working to fix it.
- Finally, the metadata of some of the products raises an Astropy units warning. This is a known issue and we are also working on it.
retrieval_type = 'ALL' # Options are: 'EPOCH_PHOTOMETRY', 'MCMC_GSPPHOT', 'MCMC_MSC', 'XP_SAMPLED', 'XP_CONTINUOUS', 'RVS', 'ALL'
data_structure = 'INDIVIDUAL' # Options are: 'INDIVIDUAL', 'COMBINED', 'RAW'
data_release = 'Gaia DR3' # Options are: 'Gaia DR3' (default), 'Gaia DR2'
datalink = Gaia.load_data(ids=results['source_id'], data_release = data_release, retrieval_type=retrieval_type, data_structure = data_structure, verbose = False, output_file = None)
dl_keys = [inp for inp in datalink.keys()]
dl_keys.sort()
print()
print(f'The following Datalink products have been downloaded:')
for dl_key in dl_keys:
print(f' * {dl_key}')
Detailed content¶
The DataLink products are stored inside a Python Dictionary, where each element (key) contains a one-element list. In addition:
-
The epoch photometry, MCMC's, and XP continuous products consist in a table that includes a "source_id" field.
-
The XP sampled and RVS products consist in a table that is serialised following the IVOA Spectrum Data Model (see for details the DataLink: Products serialisation tutorial). As a result, a number of parameters (including the the source_id) associated to these files is stored in the table metadata. The cell below shows how to extract these parameters, and how to export the table content to an Astropy Table object.
dl_key = 'RVS-Gaia DR3 6196457933368101888.xml' # Try out using other XP_Sampled or RVS products (e.g., 'XP_SAMPLED-Gaia DR3 4911590910260264960.xml')
product = datalink[dl_key][0]
items = [item for item in product.iter_fields_and_params()]
if 'RVS' in dl_key or 'XP_SAMPLED' in dl_key:
for item in items:
print(item)
print()
print(f'Showing data for source_id: {product.get_field_by_id("source_id").value}')
prod_tab = product.to_table()
prod_tab[0:5]
The code below creates a plot if the downloaded product contains epoch photometry or a sampled spectrum (RVS or XP). Try yourself and examine the content of the different products by commenting/uncommenting the dl_key
variable below. The table displayed below only shows the first 5 elements to shorten this Notebook.
dl_key = 'EPOCH_PHOTOMETRY-Gaia DR3 4911590910260264960.xml'
# dl_key = 'MCMC_MSC-Gaia DR3 5924045608237672448.xml'
# dl_key = 'MCMC_GSPPHOT-Gaia DR3 5924045608237672448.xml'
# dl_key = 'XP_CONTINUOUS-Gaia DR3 4911590910260264960.xml'
# dl_key = 'RVS-Gaia DR3 6196457933368101888.xml'
# dl_key = 'XP_SAMPLED-Gaia DR3 6196457933368101888.xml'
dl_out = extract_dl_ind(datalink, dl_key, figsize=[20,7]) # Change the figsize to e.g. figsize=[20,7] to increase the size of the displayed image.
dl_out[0:5] # Remove the '[0:5]' to display the entire table.
Download COMBINED DataLink Products¶
As it happens with the INDIVIDUAL example above, the following example retrieves ALL available DataLink products for the input sample of Gaia Source IDs. If you are not interested in downloading all products we recommend you to specify the DataLink product in retrieval_type
(e.g., retrieval_type = 'RVS'
)
retrieval_type = 'ALL' # Options are: 'EPOCH_PHOTOMETRY', 'MCMC_GSPPHOT', 'MCMC_MSC', 'XP_SAMPLED', 'XP_CONTINUOUS', 'RVS', 'ALL'
data_structure = 'COMBINED' # Options are: 'INDIVIDUAL', 'COMBINED', 'RAW'
data_release = 'Gaia DR3' # Options are: 'Gaia DR3' (default), 'Gaia DR2'
datalink = Gaia.load_data(ids=results['source_id'], data_release = data_release, retrieval_type=retrieval_type, data_structure = data_structure, verbose = False, output_file = None)
dl_keys = [inp for inp in datalink.keys()]
dl_keys.sort()
print()
print(f'The following Datalink products have been downloaded:')
for dl_key in dl_keys:
print(f' * {dl_key}')
Detailed content¶
The DataLink products are stored inside a Python Dictionary, where each element (key) contains a one- or a multi-element list, depending on the product:
-
The epoch photometry, MCMC's, and XP continuous products consist in a single-element list, which is a table that includes a "source_id" field.
-
The XP sampled and RVS products consist in a multi-element list, where each element is a table serialised following the IVOA Spectrum Data Model.
Extract data for individual sources (epoch photometry, MCMC's, and XP continuous)¶
The table displayed below only shows the first 5 elements to shorten this Notebook.
dl_key = 'EPOCH_PHOTOMETRY_COMBINED.xml' # Try also with 'XP_CONTINUOUS_COMBINED.xml', 'MCMC_MSC_COMBINED.xml', 'MCMC_GSPPHOT_COMBINED.xml'
product = datalink[dl_key][0]
product_tb = product.to_table() # Export to Astropy Table object.
source_ids = list(set(product_tb['source_id'])) # Detect source_ids.
print(f' There is data for the following Source IDs:')
for source_id in source_ids:
print(f'* {source_id}')
inp_source = source_ids[0] # Replace "1" by "0" or "2" to show the data for the individual sources.
product_tb = product_tb[product_tb['source_id'] == inp_source]
print()
print(f'Showing data for source_id {inp_source}')
product_tb[0:5] # Remove the '[0:5]' to display the entire table.
Extract data for individual sources (XP sampled and RVS)¶
dl_key = 'RVS_COMBINED.xml' # Try also with 'XP_SAMPLED_COMBINED.xml'
product = datalink[dl_key][0] # Replace "1" by "0" or "2" to show the data for the individual sources.
items = [item for item in product.iter_fields_and_params()]
for item in items:
print(item)
print()
print(f'Showing data for source_id: {product.get_field_by_id("source_id").value}')
prod_tab = product.to_table()
prod_tab[0:5]
dl_key = 'XP_SAMPLED_COMBINED.xml' # Try also with 'RVS_COMBINED.xml'
source_ids = [product.get_field_by_id("source_id").value for product in datalink[dl_key]]
tables = [product.to_table() for product in datalink[dl_key]]
fig = plt.figure(figsize=[20,7]) # Change the figsize to e.g. figsize=[30,7] to increase the size of the displayed image.
source_ids_i = iter(source_ids)
for inp_table in tables:
plot_sampled_spec(inp_table, title=dl_key.replace('_COMBINED.xml', ''), legend = f'source ID = {next(source_ids_i)}', show_plot=False)
plt.show()
Tutorial - Programmatic download of large datasets through DataLink
Authors: Héctor Cánovas and Jos de Bruijne
The main goal of the Jupyter Notebook displayed below is to teach how to retrieve large amounts (data for more than 5000 sources) of DataLink products using the Astroquery.Gaia Python package. This code has been tested in Python >= 3.8. The Jupyter notebook is included in this .zip file that also contains complementary notebooks, supplementary files, and a "tutorials.yml" environment file that can be used to create a conda environment with all dependencies needed to execute it (as explained in the official conda documentation).
Tutorial: Download DataLink products for >5000 sources¶
Release number: v1.0 (2022-07-06)
Applicable Gaia Data Releases: Gaia EDR3, Gaia DR3
Author: Héctor Cánovas Cabrera; hector.canovas@esa.int
Summary:
This Jupyter Notebook allows to overcome the Gaia Archive DataLink products download threshold by first splitting an input source list into multiple chunks, each of them having $\leq$ 5000 sources. Then, a sequential download begins and the multiple outputs are finally merged. As explained in the DataLink: products serialisation tutorial, it is possible to retrieve DataLink products in various data structures and formats. We suggest to retrieve the DataLink products in COMBINED data structure (as shown in all the examples below) because our tests indicate that this is the most efficient data structure to download large amounts of products. For simplicity, all the products in the following examples are downloaded in VOTable. This allows to easily export them to several other formats using the tools available within the Astropy.table module. This complementary tutorial shows how to download and inspect all the different DataLink products via Astroquery.Gaia for an small sample of sources. Finally, while executing this notebook it is posisble to receive a few warnings about the units included in the product metadata. Those are known issues and we are working on them.
Useful URLs:
from astropy.table import Table, vstack
from astroquery.gaia import Gaia
import numpy as np
def chunks(lst, n):
""
"Split an input list into multiple chunks of size =< n"
""
for i in range(0, len(lst), n):
yield lst[i:i + n]
Gaia.login()
Execute ADQL Query¶
The query below retrieves data for 12000 sources that have associated all the DataLink products offered in Gaia DR3.
query = "SELECT TOP 5100 source_id, ra, dec, parallax from gaiadr3.gaia_source \
WHERE has_epoch_photometry = 'True' AND \
has_mcmc_gspphot = 'True' AND \
has_mcmc_msc = 'True' AND \
has_xp_sampled = 'True' AND \
has_rvs = 'True'"
job = Gaia.launch_job_async(query)
results = job.get_results()
results[0:5]
Download Datalink Products¶
Warning: The load_data
method allows to retrieve all types of DataLink products (epoch photometry, MCMC's, and spectra) in one single call (see below). However, selecting this option when attempting to retrieve DataLink products for large (>1000) amount of sources can severely delay the dataset preparation on the server side, and even result in a download error. Therefore, we strongly recommend to select one a product at a time in this case.
Split the input list into several chunks containing =<5000 elements each¶
dl_threshold = 5000 # DataLink server threshold. It is not possible to download products for more than 5000 sources in one single call.
ids = results['source_id']
ids_chunks = list(chunks(ids, dl_threshold))
datalink_all = []
print(f'* Input list contains {len(ids)} source_IDs')
print(f'* This list is split into {len(ids_chunks)} chunks of <= {dl_threshold} elements each')
retrieval_type = 'RVS' # Options are: 'EPOCH_PHOTOMETRY', 'MCMC_GSPPHOT', 'MCMC_MSC', 'XP_SAMPLED', 'XP_CONTINUOUS', 'RVS'
data_structure = 'COMBINED' # Options are: 'INDIVIDUAL', 'COMBINED', 'RAW' - but as explained above, we strongly recommend to use COMBINED for massive downloads.
data_release = 'Gaia DR3' # Options are: 'Gaia DR3' (default), 'Gaia DR2'
dl_key = f'{retrieval_type}_{data_structure}.xml'
ii = 0
for chunk in ids_chunks:
ii = ii + 1
print(f'Downloading Chunk #{ii}; N_files = {len(chunk)}')
datalink = Gaia.load_data(ids=chunk, data_release = data_release, retrieval_type=retrieval_type, format = 'votable', data_structure = data_structure)
datalink_all.append(datalink)
Concatenate the DataLink outputs into one single table¶
The sampled spectra (XP and RVS) are serialised following the IVOA Spectrum Data Model and as a result a number of parameters, including the associated source_id, are stored in the table metadata. This is taken into account in the cells below.
Epoch Photometry, MCMC, or XP Continuous¶
In this case, the merged product is one single table that includes the source_id in one of the table fields. The code below includes an example showing how to write the entire table using the Astropy.table module.
Warning: the written table can have a size >1 Gb.
if 'RVS' not in dl_key and 'XP_SAMPLED' not in dl_key:
temp = [inp[dl_key][0].to_table() for inp in datalink_all]
merged = vstack(temp)
file_name = f"{dl_key}_{data_release.replace(' ','_')}.vot"
print(f'Writting table as: {file_name}')
merged.write(file_name, format = 'votable', overwrite = True)
display(merged)
XP sampled or RVS¶
In this case, the merged product is one Python list whose elements are all the individual products. The code below includes an example showing how to write an individual table using the Astropy.table module
if 'RVS' in dl_key or 'XP_SAMPLED' in dl_key:
product_list_tb = [item for sublist in datalink_all for item in sublist[dl_key]]
product_list_ids = [item.get_field_by_id("source_id").value for sublist in datalink_all for item in sublist[dl_key]]
ii = 12 # Try different values to display the content of the individual products.
source_id = product_list_ids[ii]
product_tab = product_list_tb[ii].to_table()
file_name = f"{dl_key.replace('_COMBINED.xml', '')}_{data_release.replace(' ','_')}_{source_id}.vot"
print(f'Writting table as: {file_name}')
product_tab.write(file_name, format = 'votable', overwrite = True)
print()
print(f'Showing {retrieval_type} for source_id = {source_id}')
display(product_tab[:5])