Metadata Request - CSA Guide

METADATA REQUESTS

While the data requests (direct, asynchronous and streaming) have been developed to mimic the behaviour of the old CAIO system, to give continuity and minimal changes needed to scripts, the metadata requests have changed, but allow much more powerful searches by relying on ADQL (similar to SQL, a database access language) and Table Access Protocol (TAP).

Perhaps the most important aid to being able to make effective searches of the CSA database, is to understand the structure of the database, plus the syntax to query it.

Briefly, the database is arranged in tables of information with columns. The tables refer to a level of the archive, such as missions (csa.mission), experiments (csa.experiment), datasets (csa.dataset), and parameters (csa.parameter), although there are a few more. The most useful columns have been collected into 'views', which begin with 'v_': csa.v_mission, csa.v_experiment, csa.v_dataset, csa.v_parameter - these views collect information from the original tables, plus some additional columns from other tables, minus internal information of no use to the user (e.g., internal_xml_file_path).

The syntax to query the database is ADQL, based on SQL.

Simple, step-by-step instructions for formulating a metadata request are given on the How-to page, but further details are given on this page, including extensive examples. Inventory information is also available via metadata requests - see Inventory.

tl;dr

Expand the query in

https://csa.esac.esa.int/csa-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=CSV&QUERY=SELECT+*+FROM+csa.v_dataset

TAP - Table Access Protocol: Documentation Links

Table Access Protocol (TAP: http://www.ivoa.net/documents/TAP/) specified by the International Virtual Observatory Alliance (IVOA: http://www.ivoa.net) defines a service protocol for accessing general table data.

TAP+ is the ESAC Space Data Centre (ESDC: http://www.cosmos.esa.int/web/esdc/) extension of the Table Access Protocol. TAP+ adds more capabilities like authenticated access and persistent user storage area.

The TAP query language is Astronomical Data Query Language (ADQL: http://www.ivoa.net/documents/ADQL/2.0), which is similar to Structured Query Language (SQL), widely used to query databases.

The default returned data format is VOTable. Further details may be found in the VOTable format documentation and the astropy VOTable page.

ADQL

The tables are queried using Astronomical Data Query Language. The syntax is summarised in Section 2.2: Query Syntax:

SELECT
    [ ALL | DISTINCT ]
    [ TOP unsigned_decimal ]
    {
         * |
         { value_expression [ [AS] column_name ] }, ...
     }
    FROM {
            {
            table_name [ [AS] identifier ] |
            ( SELECT ....) [ [AS] identifier ] |
            table_name [NATURAL]
                [ INNER | { LEFT | RIGHT | FULL [OUTER] } ]
                JOIN table_name
                [ON search_condition | USING ( column_name,...) ]
            },
        ...
        }
    [ WHERE search_condition ]
    [ GROUP BY group_by_term, ... ]
    [ HAVING search_condition ]
    [ ORDER BY
        { order_by_expression } [ ASC | DESC],
        ...
        ]
    [ OFFSET unsigned_decimal ]

SELECT and FROM are mandatory parts of the query; SELECT <this> FROM <tableX>. The other syntax is for refining the search.

This page on SQL pattern matching may also be very useful: https://www.postgresql.org/docs/9.3/functions-matching.html

Tables and views

Metadata requests are directed at tables of information. To see all of the tables and all their columns in XML format, you can call the tap command
https://csa.esac.esa.int/csa-sl-tap/tap/tables

For users' ease of use, the most useful information has been collected together in custom tables called 'views' . These provide commonly requested information along with some contents of other tables and without database information which is of no use to the user. For the CSA, there are 11 different views containing different levels of information relating to e.g., experiments, datasets or parameters (much like the resource_class used in the old CAIO queries). A list of the view tables and their columns, including examples and their equivalent fields in the header metadata and in the old CAIO system are listed on the TAP Tables and Views page.

EXAMPLES

The easiest way to demonstrate the possibilities and details of this system is using examples. The ADQL syntax itself is simple, but requires some encoding to be accepted directly from the command line (using wget) or through the browser.

These examples may be requested directly using the hyperlink provided, but are also given as plain text which is more easily adaptable for your own use, and may be copied and pasted into a browser address bar to make the request.

The list of views and their content is given on the TAP Tables and Views page.

Quickly find an example:

Select all data from a view
Select one column from a view
Select more than one column from a view
Order (sort) the results
Use search criteria including examples of different search syntax
- URL encoding for browser and command line
- Database syntax - '%' and '_' wildcards
- Using 'LIKE'
- Requests involving time - counter-intuitive order
Inventory requests
Note on case sensitivity

This Jupyter notebook CSA-TAP_Metadata.ipynb (right click and 'Save link as') makes querying the tables and views, table contents, and queries clearer and easier to visualise.

Select all data from a view

The first example below queries the table view called csa.v_dataset which contains all dataset information for active (available) datasets. We shall start by asking for the whole table. This request is fairly large, since there are 2037 rows and 44 columns, but this is not too large to download.

The ADQL query to access metadata for all column in the csa.v_dataset view is

SELECT * FROM csa.v_dataset

In order to encode this for direct use, we need to replace the spaces with a plus ('+') and add the URL and some required information to the front. For this demonstration, we will request the results in CSV format using FORMAT=CSV, so we can view them easily in a simple text editor:

https://csa.esac.esa.int/csa-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=CSV&QUERY=SELECT+*+FROM+csa.v_dataset

This will work as a hyperlink:

https://csa.esac.esa.int/csa-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=CSV&QUERY=SELECT+*+FROM+csa.v_dataset

This whole table is only 2.7 MB and contains all information in the v_dataset view, but it's too much to be useful...

Select one column from a view

We can refine it by asking for only the dataset IDs - the column name is 'dataset_id':

SELECT dataset_id FROM csa.v_dataset

with the URL and necessary encoding, the syntax becomes:

https://csa.esac.esa.int/csa-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=CSV&QUERY=SELECT+dataset_id+FROM+csa.v_dataset

as a hyperlink:

https://csa.esac.esa.int/csa-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=CSV&QUERY=SELECT+dataset_id+FROM+csa.v_dataset

Select more than one column from a view

To request more than one column, we simply add a comma between the column we want:

SELECT dataset_id,gui_name FROM csa.v_dataset

with the URL and necessary encoding, the syntax becomes:

https://csa.esac.esa.int/csa-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=CSV&QUERY=SELECT+dataset_id,gui_name+FROM+csa.v_dataset

https://csa.esac.esa.int/csa-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=CSV&QUERY=SELECT+dataset_id,gui_name+FROM+csa.v_dataset

Order (sort) the results - ORDER BY

This request will download a CSV file containing a list of all 2037 active dataset IDs. They are not in the order you might expect, but can be ordered by the dataset_id using ORDER BY 1:

SELECT dataset_id FROM csa.v_dataset ORDER BY 1

Note that the space in the middle of the syntax ORDER BY also needs a '+'

https://csa.esac.esa.int/csa-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=CSV&QUERY=SELECT+dataset_id+FROM+csa.v_dataset+ORDER+BY+1

https://csa.esac.esa.int/csa-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=CSV&QUERY=SELECT+dataset_id+FROM+csa.v_dataset+ORDER+BY+1

or any other column in the table, e.g., experiments:

SELECT dataset_id FROM csa.v_dataset ORDER BY experiments

https://csa.esac.esa.int/csa-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=CSV&QUERY=SELECT+dataset_id+FROM+csa.v_dataset+ORDER+BY+experiments

https://csa.esac.esa.int/csa-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=CSV&QUERY=SELECT+dataset_id+FROM+csa.v_dataset+ORDER+BY+experiments

Refine request with search criteria - WHERE

We can request all dataset IDs belonging to a particular experiment, in this case, FGM, where the contents of the field are a string, and so need to be in single quotes:

SELECT dataset_id FROM csa.v_dataset WHERE experiments='FGM'

https://csa.esac.esa.int/csa-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=CSV&QUERY=SELECT+dataset_id+FROM+csa.v_dataset+WHERE+experiments='FGM'

https://csa.esac.esa.int/csa-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=CSV&QUERY=SELECT+dataset_id+FROM+csa.v_dataset+WHERE+experiments='FGM'

Other search criteria syntax:

This page on SQL pattern matching may also be very useful: https://www.postgresql.org/docs/9.3/functions-matching.html

For multiple search criteria, use AND:

SELECT * FROM csa.v_parameter WHERE entity='Electron' AND field_name='PITCH_SPIN'

Find a list of all possible values for a column (use on small groups!), using DISTINCT:

SELECT DISTINCT entity FROM csa.v_parameter

fields are empty (NULL), so to search for rows where a field is empty (for example):

SELECT * FROM csa.v_dataset WHERE dataset_reference IS NULL

To find rows where the field is not empty:

SELECT * FROM csa.v_dataset WHERE dataset_reference IS NOT NULL

Compare two columns, use = (equal to) or != (not equal to):

SELECT * FROM csa.v_dataset WHERE title!=gui_name

To find text within a field, use like. For example, find all dataset IDs (dataset_id) where the sole measurement type (measurement_types) is 'Electric_Field':

SELECT dataset_id FROM csa.v_dataset WHERE measurement_types LIKE 'Electric_Field'

Similarly, to do a 'negative' search to find records where that text is not used, simply use NOT LIKE:

SELECT dataset_id FROM csa.v_dataset WHERE measurement_types NOT LIKE 'Electric_Field'

However, (for example) 'Electric_Field' may not be the only measurement type (hence the column name measurement_types), so ask for rows where there may be text either side using the wildcard for multiple characters '%' (percentage sign, see Database Syntax, below), and return dataset_id and measurement_types:

SELECT dataset_id,measurement_types FROM csa.v_dataset WHERE measurement_types like '%Electric_Field%'

This can also be used to request metadata for more than one spacecraft:

SELECT file_name,file_start_date,file_end_date,caa_ingestion_date FROM csa.v_file WHERE dataset_id like '%_CP_FGM_5VPS'

URL Encoding

Important complications with encoding: the syntax on the line above is the ADQL syntax for the query, but when encoding the line above for a browser or command line, as well as replacing the spaces with '+', the % must also be replaced with it's URL encoded equivalent '%25', so that the whole command becomes:

https://csa.esac.esa.int/csa-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=CSV&QUERY=SELECT+dataset_id,measurement_types+FROM+csa.v_dataset+WHERE+measurement_types+like+'%25Electric_Field%25'

https://csa.esac.esa.int/csa-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=CSV&QUERY=SELECT+dataset_id,measurement_types+FROM+csa.v_dataset+WHERE+measurement_types+like+'%25Electric_Field%25'

Requests involving time

Metadata requests involving time should be given in quotes, leading with the year:

SELECT file_name FROM v_file WHERE csa_ingestion_date>'2021-08-23T00:00:00'

Intervals need to be given in 'reverse order', that is, counter to what one might automatically think, in order to capture all qualifying records. So the given start time is less than/equal to (<=) the desired end time and vice versa, e.g.:

start_time<='2002-05-01T00:00:00' AND end_time>='2002-04-01T00:00:00'

as illustrated here:

images/socciwiki/download/attachments/77248999/image2021-8-24_11-44-30.png

Database Syntax - '%' and '_' wildcards:

Within the database, % is a wildcard symbol meaning 'any characters of length 1 or more', like an * (asterisk) in a filename. Another wildcard is _ (underscore) which means 'any single character', like a ? (question mark) in a filename. This means that if we put an underscore in an ADQL query, it will mean 'any single character', so this query will not cause an error:

SELECT dataset_id FROM csa.v_quick_look_dataset WHERE dataset_id like '%_DWP_%'

but it will return results that we may not want - in this case 'D1_CG_STA-DWP_PSDP_PNG' is one of the results.

If we want it to be taken literally, like as text in a string, we need to 'escape' it, using a backslash:

SELECT dataset_id FROM csa.v_quick_look_dataset WHERE dataset_id like '%\_DWP\_%'

Inventory information

To find inventory information, i.e., gaps in data, missing files or version numbers of stretches of data, query v_dataset_inventory. The CAIO had a built-in command for this with specified fields, but now it's a normal view, and can be queried for whichever columns you wish. However, to get the same results as before, the query syntax is:

SELECT dataset_id,start_time,end_time,num_instances,inventory_version FROM csa.v_dataset_inventory WHERE dataset_id='C1_CP_FGM_SPIN' AND start_time<='2002-05-01T00:00:00' AND end_time>='2002-04-01T00:00:00' ORDER BY start_time

https://csa.esac.esa.int/csa-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=CSV&QUERY=SELECT+dataset_id,start_time,end_time,num_instances,inventory_version+FROM+csa.v_dataset_inventory+WHERE+dataset_id='C1_CP_FGM_SPIN'+AND+start_time<='2002-05-01T00:00:00'+AND+end_time>='2002-04-01T00:00:00'+ORDER+BY+start_time

https://csa.esac.esa.int/csa-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=CSV&QUERY=SELECT+dataset_id,start_time,end_time,num_instances,inventory_version+FROM+csa.v_dataset_inventory+WHERE+dataset_id='C1_CP_FGM_SPIN'+AND+start_time<='2002-05-01T00:00:00'+AND+end_time>='2002-04-01T00:00:00'+ORDER+BY+start_time

Remember that the times need to be given this way round to include all qualifying results (see Requests involving time above).

Note on case sensitivity

Within the query, the keywords (e.g., SELECT, FROM, WHERE) are case insensitive, but it can be useful to use these in upper case to identify them easily. The names of the views are case insensitive, as are the column names, but the contents of the tables are case sensitive and must be given exactly. For example, there is no dataset_id called 'c1_CP_FGM_SPIN', or measurement_types 'electric field' - the search must be for 'Electric_Field').

Jupyter Notebook

This Jupyter notebook CSA-TAP_Metadata.ipynb (right click and 'Save link as') makes querying the tables and views, table contents, and queries clearer and easier to visualise.