Metadata Request - Solar Orbiter Archive
METADATA REQUESTS
Metadata requests allow powerful searches by accessing the tables of the database using ADQL (similar to SQL, a database access language) and Table Access Protocol (TAP).
Perhaps the most important aid to being able to make effective searches of the SOAR database, is to understand the structure of the database, plus the syntax to query it.
Briefly, the database is arranged in tables of information with columns. The tables refer to different data, like FITS (soar.fits_dataset) and CDF (soar.cdf_dataset) descriptors, and the most useful columns of these tables have been gathered into 'views' (which begin with 'v_') for Low Latency (soar.v_ll_data_item and soar.v_ll_repository_file) or Science (soar.v_sc_data_item and soar.v_sc_repository_file), spice kernels (soar.v_auxiliary_file) or telemetry (soar.v_telemetry_item - with restricted access).
The syntax to query the database is ADQL, based on SQL.
Simple, step-by-step instructions for formulating a metadata request are given on the How-to page, but further details are given on this page, including extensive examples. Some links to TAP, ADQL, VOTable documentation are provided first, followed by a simplified version of the ADQL "SELECT" syntax.
Documentation
Table Access Protocol (TAP: https://www.ivoa.net/documents/TAP/) specified by the International Virtual Observatory Alliance (IVOA: https://www.ivoa.net) defines a service protocol for accessing general table data.
TAP+ is the ESAC Space Data Centre (ESDC: https://www.cosmos.esa.int/web/esdc/) extension of the Table Access Protocol. TAP+ adds more capabilities like authenticated access and persistent user storage area.
The TAP query language is Astronomical Data Query Language (ADQL: https://www.ivoa.net/documents/ADQL/2.0), which is similar to Structured Query Language (SQL), widely used to query databases.
The default returned data format is VOTable. Further details may be found in the VOTable format documentation and the astropy VOTable page.
ADQL
The tables are queried using ADQL, see reference document here. The full syntax is summarised in the Section 2.2: Query Syntax of that document, but there are extensive examples provided below:
SELECT
[ ALL | DISTINCT ]
[ TOP unsigned_decimal ]
{
* |
{ value_expression [ [AS] column_name ] }, ...
}
FROM {
{
table_name [ [AS] identifier ] |
( SELECT ....) [ [AS] identifier ] |
table_name [NATURAL]
[ INNER | { LEFT | RIGHT | FULL [OUTER] } ]
JOIN table_name
[ON search_condition | USING ( column_name,...) ]
},
...
}
[ WHERE search_condition ]
[ GROUP BY group_by_term, ... ]
[ HAVING search_condition ]
[ ORDER BY
{ order_by_expression } [ ASC | DESC],
...
]
[ OFFSET unsigned_decimal ]
SELECT and FROM are mandatory parts of the query; SELECT <this> FROM <tableX>. The other syntax is for refining the search.
This page on SQL pattern matching may also be very useful: https://www.postgresql.org/docs/9.3/functions-matching.html
Tables and views
Metadata requests are directed at tables of information. To see all of the available tables and all their columns in XML format, you can call the tap command
https://soar.esac.esa.int/soar-sl-tap/tap/tables
For users' ease of use, the most useful information has been collected together in custom tables called 'views' . These provide commonly requested information along with some contents of other tables and without database information which is of no use to the user. For the SOAR, there are a few different views containing different levels of information relating to e.g., instruments, datasets or parameters.
The format of that page is not very user friendly, but can be used to find if a column exists and to which table. A more user-friendly interface is offered by TOPCAT
EXAMPLES OF ADQL
The easiest way to demonstrate the possibilities and details of this system is using examples. The ADQL syntax itself is simple, but requires some encoding to be accepted directly from the command line (using wget) or through the browser.
These examples may be requested directly using the hyperlink provided, but are also given as plain text which is more easily adaptable for your own use, and may be copied and pasted into a browser address bar to make the request.
The contents of the main Science files view (v_sc_data_item) is given on the How-To page page.
Quickly find an example:
-
Use search criteria including examples of different search syntax
-
URL encoding for browser and command line
-
Database syntax - '%' and '_' wildcards
-
Using 'LIKE'
-
Requests involving time - counter-intuitive order
-
-
Note on case sensitivity
Select all data from a TABLE
The first example below queries the table called fits_dataset which contains information on all of the different types (descriptors) of FITS files, both Low Latency and Science. As a demonstration, this request asks for the whole table, but this table is fairly small, with only 56 rows and 8 columns.
The ADQL query to access metadata for all column in the fits_dataset view is
SELECT * FROM fits_dataset
In order to encode this for direct use, we need to replace the spaces with a plus ('+') and add the URL and some required information to the front - note that this is a synchronous request (sync). For this demonstration, we will request the results in CSV format using FORMAT=CSV, so we can view them easily in a simple text editor:
https://soar.esac.esa.int/soar-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=CSV&QUERY=SELECT+*+FROM+fits_dataset
This will work as a hyperlink (maybe not in Chrome) or copied and pasted into the address bar of a browser:
The table will download in CSV format, with a name containing the job number, like 1666969861633OPE-result.csv
Select one column from a view
Probably the most useful table in the database is the view called v_sc_data_item and contains all the information about the latest versions of Science files, and as such, it's a big table. The most useful column in this table is 'data_item_id' and is the piece of information needed to download the files.
SELECT data_item_id FROM v_sc_data_item
with the URL and necessary encoding, the syntax becomes:
https://soar.esac.esa.int/soar-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=CSV&QUERY=SELECT+data_item_id+FROM+v_sc_data_item
as a hyperlink:
This file will be about 90 MB.
Select more than one column from a view
To request more than one column, for example, the data_item_id, the filename and the filesize, we simply add a comma between the two columns we want:
SELECT data_item_id,filename,filesize FROM v_sc_data_item
with the URL and necessary encoding, the syntax becomes:
https://soar.esac.esa.int/soar-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=CSV&QUERY=SELECT+data_item_id,filename,filesize+FROM+v_sc_data_item
This will download a 210 MB CSV file. Note that very large metadata requests may time out - in that case, use an Asynchronous Request
Refine request with search criteria - WHERE
Since these requests are returning large files, we can reduce the request by imposing conditions using WHERE
. So, for example, we can request all data_item_id's belonging to a particular instrument, in this case, MAG, where the contents of the field are a string, and so need to be in single quotes:
SELECT data_item_id FROM v_sc_data_item WHERE instrument='MAG'
https://soar.esac.esa.int/soar-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=CSV&QUERY=SELECT+data_item_id+FROM+v_sc_data_item+WHERE+instrument='MAG'
This file is much smaller.
Order (sort) the results - ORDER BY
The results aren't automatically sorted - we need to tell the request how to order it using ORDER BY
. To sort it by the first column requested, just add ORDER BY 1
SELECT data_item_id,filename FROM v_sc_data_item ORDER BY 1
Note that the space in the middle of the syntax ORDER BY also needs a '+'
https://soar.esac.esa.int/soar-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=CSV&QUERY=SELECT+data_item_id,filename+FROM+v_sc_data_item+ORDER+BY+1
or any other column in that table, e.g., filesize:
SELECT data_item_id,filename FROM v_sc_data_item ORDER BY filesize
https://soar.esac.esa.int/soar-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=CSV&QUERY=SELECT+data_item_id,filename+FROM+v_sc_data_item+ORDER+BY+filesize
and ascending (ASC) or descending (DESC)
SELECT data_item_id,filename FROM v_sc_data_item ORDER BY filesize DESC
Other search criteria syntax:
This page on SQL pattern matching may also be very useful: https://www.postgresql.org/docs/9.3/functions-matching.html
For multiple search criteria, use AND:
SELECT * FROM v_sc_data_item WHERE instrument='MAG' AND level='L2'
Or an OR in brackets:
SELECT * FROM v_sc_data_item WHERE (instrument name = 'EPD' OR instrument name = 'MAG')
Find a list of all possible values for a column (use on small groups!), using DISTINCT:
SELECT DISTINCT instrument FROM v_sc_data_item
We can also just do a COUNT:
SELECT COUNT(data_item_id) FROM v_sc_data_item
Or just get the top 20 results:
SELECT TOP 20 data_item_id FROM v_sc_data_item
Find fields that are empty (NULL), so to search for rows where a field is empty (for example):
SELECT * FROM v_sc_data_item WHERE soop_type IS NULL
To find rows where the field is not empty:
SELECT * FROM v_sc_data_item WHERE soop_type IS NOT NULL
Compare two columns, use = (equal to) or != (not equal to):
SELECT * FROM v_sc_data_item WHERE begin_time!=end_time
To find text within a field, use LIKE. For example, find all dataset IDs (dataset_id) where the sole measurement type (measurement_types) is 'Electric_Field':
SELECT * FROM v_sc_data_item WHERE descriptor LIKE 'mag-rtn-burst'
Similarly, to do a 'negative' search to find records where that text is not used, simply use NOT LIKE:
SELECT * FROM v_sc_data_item WHERE descriptor NOT LIKE 'mag-rtn-burst'
However, (for example) 'mag-rtn-burst' is not be the only descriptor, so ask for rows where there may be text either side using the ADQL wildcard for multiple characters in a string, the percentage sign: '%' (see Database Syntax, below; the usual equivalent is *), and return data_item_id and file_format:
SELECT data_item_id,file_format FROM v_sc_data_item WHERE descriptor LIKE '%_mag-rtn%'
Note that the ADQL wildcard for a single character is the underscore '_' which in most cases won't matter since there are single underscores in the descriptors/filenames. However, it might mean that if the underscore is needed as a literal character, it will need to be escaped with a backslash:
SELECT data_item_id,file_format FROM v_sc_data_item WHERE descriptor LIKE 'solo\_L2\_mag-rtn%'
URL Encoding
Important complications with encoding: the syntax on the line above is the ADQL syntax for the query, but when encoding the queries for a browser or command line, as well as replacing the spaces with '+', the % must also be replaced with its URL encoded equivalent '%25', so that the whole command becomes:
https://soar.esac.esa.int/soar-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=CSV&QUERY=SELECT+data_item_id,file_format+FROM+v_sc_data_item+WHERE+descriptor+LIKE+'%25_mag-rtn%25'
There is a way to avoid having to URL encode the ADQL query if CURL is used on the command line:
curl -o output_results.csv "https://soar.esac.esa.int/soar-sl-tap/tap/sync" --data-urlencode "REQUEST=doQuery" --data-urlencode "LANG=ADQL" --data-urlencode "FORMAT=CSV" --data-urlencode "QUERY=SELECT filename, filesize FROM v_sc_data_item WHERE instrument='EUI' AND level='L2' AND begin_time>'2023-01-01'"
The metadata query results of this request will be saved in the specified file: output_results.csv
Python provides a function for encoding or un-encoding the ADQL requests for the URL:
# URL encoding
import urllib.parse
src_str = 'https://soar.esac.esa.int/soar-sl-tap/tap'
ADQL = ("SELECT filename FROM v_sc_data_item "
"WHERE instrument='SPICE' "
"AND (insertion_time>='2023-12-19 00:00:00+00:00') "
"AND (insertion_time<='2024-01-14 00:00:00+00:00')")
# This request needs the full http and URL encoding
extra_url_elements = "/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=CSV&QUERY="
browser_url = src_str + extra_url_elements + urllib.parse.quote_plus(ADQL)
print(browser_url)
# This can also work the other way around:
url = {'https://soar.esac.esa.int/soar-sl-tap/tap//sync?REQUEST=doQuery&LANG=ADQL&FORMAT=json&'
'QUERY=SELECT+filename+FROM+v_sc_data_item+WHERE+instrument%3D%27SPICE%27+'
'AND+%28insertion_time%3E%3D%272023-12-19+00%3A00%3A00%2B00%3A00%27%29+'
'AND+%28insertion_time%3C%3D%272023-12-20+00%3A00%3A00%2B00%3A00%27%29')
decoded_url = unquote(url)
print(decoded_url)
Requests involving time
Metadata requests involving time should be given in quotes, leading with the year:
SELECT file_name FROM v_sc_data_item WHERE insertion_time>'2023-08-23T00:00:00'
Intervals need to be given in 'reverse order', that is, counter to what one might automatically think, in order to capture all qualifying records. So the given start time is less than/equal to (<=) the desired end time and vice versa, e.g.:
start_time<='2023-05-01T00:00:00' AND end_time>='2023-04-01T00:00:00'
as illustrated here:
Database Syntax - '%' and '_' wildcards:
Within the database, % is a wildcard symbol meaning 'any characters of length 1 or more', like an * (asterisk) in a filename. Another wildcard is _ (underscore) which means 'any single character', like a ? (question mark) in a filename. This means that if we put an underscore in an ADQL query, it will mean 'any single character', so this query will not cause an error:
SELECT DISTINCT descriptor FROM v_sc_data_item WHERE data_item_id like '%_mag-rtn_%'
but it will return results that we may not want - in this case 'solo_L2_mag-rtn-burst' is one of the results.
If we want it to be taken literally, as text in a string, we need to 'escape' it, using a backslash:
SELECT DISTINCT descriptor FROM v_sc_data_item WHERE data_item_id like '%\_mag-rtn\_%'
Note on case sensitivity
Within the query, the keywords (e.g., SELECT, FROM, WHERE) are case insensitive, but it can be useful to use these in upper case to identify them easily. The names of the views are case insensitive, as are the column names, but the contents of the tables are case sensitive and must be given exactly. For example, there is no data_item_id called 'SOLO_L2_mag-rtn-normal-1-minute_20220112', or level 'l2' - the search must be for 'solo_L2_mag-rtn-normal-1-minute_20220112' or 'L2', resp.).
SYNCHRONOUS QUERY
For step-by-step instructions, see the How-To page.
A synchronous query (as opposed to an asynchronous query) provides the results as a VOTable, but can also return a CSV or JSON file.
Some examples:
- Low Latency products of instrument 'EPD' in JSON format:
- First two science products of instrument 'EPD' in JSON format:
- Low latency products in VOTABLE that fulfill the criteria:
(instrument name = 'EPD' OR instrument name = 'MAG') AND begin time > '2020-08-29T00:00:28'
Equivalent URL: - Science products in VOTABLE that fulfill the criteria:
(instrument name = 'EPD' OR instrument name = 'MAG') AND begin time > '2020-08-29+00:00:28'
Equivalent URL: - First 5 spice kernel products:
ASYNCHRONOUS QUERY
In some of the ADQL examples, some of the tables returned can be very large and the request may timeout, in which case, an asynchronous request would need to be used. An asynchronous metadata query does not require an account and authentication.
To specify it, it is only necessary to change 'sync?' to 'async?PHASE=run'
Some examples:
- Low latency products of instrument 'EUI' in VOTABLE format:
- Second science product of instrument 'EUI' in VOTABLE format:
- First 5 spice kernel products:
- Same as previous request, but this time launched via curl:
curl -i -X POST --data "PHASE=run&REQUEST=doQuery&LANG=ADQL&FORMAT=json&QUERY=SELECT+TOP+5+*+FROM+v_auxiliary_file" "https://soar.esac.esa.int/soar-sl-tap/tap/async?"