diff --git a/README.md b/README.md index 0e3aaba49fd0b276ef3f0ca68e72d21bb06f15c2..887f282d446ddc6c6c6d80cc64fadc2c14af5d67 100644 --- a/README.md +++ b/README.md @@ -18,12 +18,12 @@ WoSIS stands for 'World Soil Information Service', a large database based on Pos The source data come from different types of surveys ranging from systematic soil surveys (i.e., full profile descriptions) to soil fertility surveys (i.e., mainly top 20 to 30 cm). Further, depending on the nature of the original surveys the range of soil properties can vary greatly (see [https://essd.copernicus.org/articles/12/299/2020/](https://essd.copernicus.org/articles/12/299/2020/)). -The quality-assessed and standardised data are made available freely to the international community through several web services, this in compliance with the conditions (licences) specified by the various data providers. This means that we can only serve data with a so-called 'free' licence to the international community ([https://data.isric.org/geonetwork/srv/eng/catalog.search#/search?any=wosis_latest](https://data.isric.org/geonetwork/srv/eng/catalog.search#/search?any=wosis_latest)). A larger complement of geo-referenced data with a more restrictive licence can only be used by ISRIC itself for producing SoilGrids maps and similar products (i.e. output as a result of advanced data processing). Again, the latter map layers are made freely available to the international community ([https://data.isric.org/geonetwork/srv/eng/catalog.search#/search?resultType=details&sortBy=relevance&any=soilgrids250m%202.0&fast=index&_content_type=json&from=1&to=20](https://data.isric.org/geonetwork/srv/eng/catalog.search#/search?resultType=details&sortBy=relevance&any=soilgrids250m%202.0&fast=index&_content_type=json&from=1&to=20)). +Upon their standardisation, the quality-assessed data are made available freely to the international community through several web services, this in compliance with the conditions (licences) specified by the various data providers. This means that we can only serve data with a so-called 'free' licence to the international community ([https://data.isric.org/geonetwork/srv/eng/catalog.search#/search?any=wosis_latest](https://data.isric.org/geonetwork/srv/eng/catalog.search#/search?any=wosis_latest)). A larger complement of geo-referenced data with a more restrictive licence can only be used by ISRIC itself for producing SoilGrids maps and similar products (i.e. output as a result of advanced data processing). The latter map layers are made freely available to the international community ([https://data.isric.org/geonetwork/srv/eng/catalog.search#/search?resultType=details&sortBy=relevance&any=soilgrids250m%202.0&fast=index&_content_type=json&from=1&to=20](https://data.isric.org/geonetwork/srv/eng/catalog.search#/search?resultType=details&sortBy=relevance&any=soilgrids250m%202.0&fast=index&_content_type=json&from=1&to=20)).  _WoSIS workflow for ingesting, processing and disseminating data._ -During this master class, you will first learn what GraphQL and API (application programming interface) are. Next, using guided steps, we will explore the basics of WoSIS and GraphQL via a graphical interface. From that point onwards we will slowly increase complexity and use WoSIS data. Building upon this, we will show you how to create code that uses soil data from WoSIS. +During this master class, you will first learn what GraphQL and API (application programming interface) are. Next, using guided steps, we will explore the basics of WoSIS and GraphQL via a graphical interface. From that point onwards we will slowly increase complexity and use WoSIS data. Building on this, we will show you how to create code that uses soil data from WoSIS. The workshop requires no previous knowledge of WoSIS or GraphQL. However, it is advisable to have basic coding knowledge of the Python or R languages. @@ -33,14 +33,14 @@ The aim of this master-class is to provide clear instructions and documentation WoSIS data can be accessed via **OGC web services** and a **GraphQL API**. -Until recently, OGC web services provided the main entry point to download and access WoSIS. You can find more information on how to access WoSIS using the SOAP based OGC web services at [https://www.isric.org/explore/wosis/accessing-wosis-derived-datasets](https://www.isric.org/explore/wosis/accessing-wosis-derived-datasets). +Until recently, OGC web services provided the main entry point to download and access WoSIS. You can find more information on how to access WoSIS using the SOAP-based OGC web services at [https://www.isric.org/explore/wosis/accessing-wosis-derived-datasets](https://www.isric.org/explore/wosis/accessing-wosis-derived-datasets). In 2023, we developed a GraphQL API tool to easily access the data. The aim of this master-class is to show and describe how this tool can be used to explore and download WoSIS data. ---------- ## What is GraphQL? -*GraphQL is a query language for API's. GraphQL isn't tied to any specific database or storage engine and is instead backed by your existing code and data.* +*GraphQL is a query language for API's. GraphQL isn't tied to any specific database or storage engine. Instead it is backed by your existing code and data.* If you are new to GraphQL it might be good to check the official documentation: [https://graphql.org/learn/](https://graphql.org/learn/). @@ -76,7 +76,7 @@ https://graphql.isric.org/wosis/graphql This is the main GraphQL root endpoint. This is the endpoint to be used directly by applications and/or code scripts. If you are an advanced GraphQL user and you use a custom script or a GraphQL client this is what you should use. -Nonetheless, if you click on the above link using a web browser you'll probably get the following error message: +Nonetheless, if you click on the above link using a web browser you will probably get the following error message: ```json {"errors":[{"message":"Only `POST` requests are allowed."}]} @@ -98,7 +98,7 @@ For the exercises in this master-class we will use **graphiql**, but you are fre ## Explore current schema -The current WoSIS GraphQL schema is composed of __*Sites*__ that contain __*Profiles*__ that contain __*Layers*__ and for each layer several __*measurementValues*__ can be found per soil observations (e.g., PH assessed in aqueous solution). +The current WoSIS GraphQL schema is composed of __*Sites*__ that contain __*Profiles*__ that have __*Layers*__ and for each layer several __*measurementValues*__ can be found per soil observation (e.g., pH assessed in aqueous solution). __For a given property, each layer can have one or more measurements (e.g., one layer with several samples.)__ 1) Site A @@ -113,20 +113,20 @@ __For a given property, each layer can have one or more measurements (e.g., one 2) measurementValues R 3) measurementValues T -For more information on WoSIS data model please check WoSIS Scientific data paper [WoSIS snapshot - December 2023](https://data.isric.org/geonetwork/srv/eng/catalog.search#/metadata/e50f84e1-aa5b-49cb-bd6b-cd581232a2ec). +For more information on the WoSIS data model please check WoSIS Scientific data paper [WoSIS snapshot - December 2023](https://data.isric.org/geonetwork/srv/eng/catalog.search#/metadata/e50f84e1-aa5b-49cb-bd6b-cd581232a2ec). Please explore the current schema using **graphiql** IDE. For this, follow this link [https://graphql.isric.org/wosis/graphiql](https://graphql.isric.org/wosis/graphiql). You will be at the root: -- __*wosisLatesObservations*__ - All current observations distributed by WoSIS with the total number of sites; profiles and respective layers. -- __*wosisLatestLayers*__ - WoSIS layers, at this level you'll get all layers and respective measurements. -- __*wosisLatestProfiles*__ - WoSIS profiles, it contains all Profiles and respective down levels of WoSIS products (Profiles, Layers and measurements) - __*wosisLatestSites*__ - WoSIS sites, this is probably were you want to start since it contains all levels of WoSIS product (Sites, Profiles, Layers, measurements) +- __*wosisLatestObservations*__ - All current observations served from WoSIS (i.e., _wosis_latest_) with the total number of sites; profiles and respective layers. +- __*wosisLatestLayers*__ - WoSIS layers, at this level you will get all layers and respective measurements. +- __*wosisLatestProfiles*__ - WoSIS profiles, contains all Profiles and respective 'lower' levels of WoSIS products (Profiles, Layers and measurements) + __*wosisLatestSites*__ - WoSIS sites, this is probably were you want to start since it contains all levels of WoSIS product (Sites, Profiles, Layers,and measurements) Use the **graphiql** interface to spend some time exploring the WoSIS schema. - While expanding __*wosisLatestProfiles*__ we'll get the following: + While expanding __*wosisLatestProfiles*__ we will get the following: {width=25%} @@ -134,7 +134,7 @@ You will be at the root: ## Explore the documentation -One of the advantages of GraphQL is the automatically generated documentation. In order to access the documentation in GraphQL click on the __*DOCS*__ button marked in red in image below. +One of the advantages of GraphQL is the automatically generated documentation. In order to access the documentation in GraphQL click on the __*DOCS*__ button marked in red in the image below.  @@ -201,7 +201,7 @@ query MyQuery { In practice this query will return all WoSIS Latest Observations because currently we have less than 100 observations. -- Get the first 10 __*wosisLatestSites*__ 10 random sites +- Get the first 10 __*wosisLatestSites*__ random sites ```graphql query MyQuery { @@ -239,7 +239,7 @@ query MyQuery { } ``` -- Get the __first 10 *wosisLatestProfiles* profiles__ with all available classification records (FAO, USDA, WRB). +- Get the __first 10 *wosisLatestProfiles* profiles__ with all available classification records (i.e., FAO, USDA and WRB). ```graphql query MyQuery { @@ -339,9 +339,9 @@ query MyQuery { } } ``` -Note that the deeper you go in the structure the slowest the query will be. +Note that the deeper you go in the dataset structure the slower query execution will be. -Note that if we need to retrieve `profiles` we are not forced to start with the `sites`. We can retrieve `profiles` without querying `sites`. Same for the `layers`, if we only need specific layers we can retrieve some `layers` without querying `profiles`. In the next queries we will show that. +Note that if we need to retrieve `profiles` we are not forced to start with the `sites`. We can retrieve `profiles` without querying `sites`. The same applies for `layers`, if we only need specific layers we can retrieve these `layers` without querying `profiles`. In the next queries we will show how this is done. - Get __first 10 profiles__ and for each profile get also the __first 10 layers__: @@ -368,7 +368,7 @@ query MyQuery { } ``` -- Get __first 10 profiles__ and for each profile get also the __first 10 layers__ and also the __first 10 values of silt__: +- Get __first 10 profiles__ and for each profile get also the __first 10 layers__ and also the __first 10 values for silt__: ```graphql query MyQuery { @@ -398,7 +398,7 @@ query MyQuery { } ``` -- Get __first 10 profiles__ and for each profile get also the __first 10 layers__ and for each layer also get the __first 10 values of silt__ and the __first 10 values of organic carbon__: +- Get __first 10 profiles__ and for each profile get also the __first 10 layers__ and for each layer also get the __first 10 values for silt__ and the __first 10 values for organic carbon__: ```graphql @@ -433,15 +433,15 @@ query MyQuery { } ``` -Probably at this point you have some empty results in the `orgcValues` field due to the fact that some layers do not have any organic carbon measurement. +Probably, at this point you see some empty results in the `orgcValues` field. This is due to the fact that for some layers there are no organic carbon measurements in the source datasets. As exemplified, we can request all types of values (Silt; Sand; Organic carbon; pH etc.) but the more data we request the slower the query will be. -Exploratory queries without any filtering can be important as a first contact with the data, but at some point it is recommended to apply filters. +Exploratory queries without any filtering can be useful to get acquaited with the data, but at some point it is recommended to apply filters. ## Filtering -Perhaps the main advantage of this GraphQL API is the ability to filter data. In the majority of cases, a user may want to extract specific data; for this we will make use of Filtering capabilities. +Perhaps the main advantage of this GraphQL API is the ability to easily filter and explore data. In the majority of cases, however, a user may want to extract specific data. For this, we will make use of Filtering capabilities. Before we start performing queries please spend some time exploring the filter object inside `wosisLatestProfiles` as shown in the image below: @@ -515,7 +515,7 @@ query MyQuery { ``` Please note that some operators (`AND`, `OR` etc.) expect an array as input (`[]`). -- Get __first 5 profiles__ with the respective __first 10 layers__ from country __Netherlands__ `AND` __with at least one layer__. In other words, we do not want profiles without layers in this query. +- Get __first 5 profiles__ with the respective __first 10 layers__ from country __Netherlands__ `AND` __with at least one layer__. In other words, we do not want any profiles without layers in this query. ```graphql @@ -641,7 +641,7 @@ query MyQuery { ``` ## Using variables -In GraphQL we are able to use variables in our queries. Variables are important for: +In GraphQL we can also to use variables in our queries. Variables are important for: - Scripting, in order to be able to interact with our script variables - Ingest complex JSON objects into our query @@ -681,7 +681,7 @@ query MyQuery($first:Int, $continent:String) { } ``` -In your **graphiql** you should have something as bellow image: +In your **graphiql** you should have something as shown below:  @@ -716,15 +716,15 @@ query MyQuery($first:Int, $continent:[String!]) { } ``` -In the next chapter we'll make use of variables to better provide JSON components to our queries. +In the next chapter we will make use of variables to better provide JSON components to our queries. ## Spatial queries This API has spatial capabilities. It is possible to perform several __spatial queries__ and apply __spatial filters__. Spatial components are GeoJSON-based. -In order to use spatial queries, we'll use 2 geometries of Gelderland in GeoJSON format. +In order to use spatial queries, we will use two geometries of Gelderland, a province in the Netherlands, in GeoJSON format as examples. -You can use https://geojson.io to visualize, create and update GeoJSON geometries. +You can use https://geojson.io to visualise, create and update GeoJSON geometries. 1) Simplified geometry of Gelderland region in Geojson format: @@ -769,7 +769,7 @@ You can use https://geojson.io to visualize, create and update GeoJSON geometrie  -In order to simplify and make a more easy-to-read query we'll make use of `variables` in our spatial queries. +In order to simplify and make a more easy-to-read query we will make use of `variables` in our spatial queries. - Get __first 3 profiles__ that fall inside Gelderland using the MultiPolygon geometry. In this query we also make sure all __profiles have at least one layer__. @@ -803,12 +803,12 @@ Inside `Query variables` add the `geomGelderland` variable: } ``` -Example on what you should see in **graphiql**: +Example of what you should see in **graphiql**:  -The __GEOM object__ corresponds to the geometry. Please spend some time exploring this object in the **graphiql** interface. Make sure you explore the `Filter` capabilities. +The __GEOM object__ corresponds to the geometry. Please spend some time exploring this object in the **graphiql** interface. Make sure you explore the `Filter` capabilities too. @@ -854,13 +854,13 @@ query MyQuery($geomGelderland: GeoJSON!) { ## Pagination concepts -Depending on the way you create your query it can evolve high computational resources. Besides, if not using pagination you could easily create a query that returned a huge number of records, with all the problems that brings. +Depending on the way how you create your query it can involve high computational resources. Besides, if not using pagination you could easily create a query that returns a huge number of records, with all the problems that brings. -To solve this problem __we enforce pagination in this GraphQL API__. +To solve this issue __we enforce pagination in this GraphQL API__. For the moment, in order to make things easier, we propose a simpler list interface for the connections based on __Offset-based Pagination__. This means we *temporary disabled* [Relay Cursor Connections](https://relay.dev/graphql/connections.htm). -__If you are an advanced user and would like to have `Relay Cursor Connections` please contact us.__ +__If you are an advanced user and would like to have access to `Relay Cursor Connections` please contact us.__ __The `First:` argument__ @@ -872,7 +872,7 @@ __The `Offset:` argument__ The arguments `first` and `offset` are extremely important when you need to extract and download data. -We'll make use of pagination on our scripts. We'll show how to use pagination and extract a considerable amount of data from WoSIS using this GraphQL API. +We will make use of pagination in our scripts. We will show how to use pagination and extract a considerable amount of data from WoSIS using this GraphQL API. ## Scripting ### Python examples @@ -921,7 +921,7 @@ The result will be: Using variables in our script: -- Get the __first 3 profiles__ that are __inside Gelderland region__ and add it to a Pandas dataframe: +- Get the __first 3 profiles__ that are __inside Gelderland region__ and add them to a Pandas dataframe: ```python import requests @@ -1046,9 +1046,9 @@ The CSV result file can be found [here](./scripts/python/wosis_gelderland.csv) ### R examples -The simplest way to perform a GraphQL request in r is to use {httr}. +The simplest way to perform a GraphQL request in R is to use {httr}. -- Get the __first 5 profiles__ and add it to a Pandas dataframe: +- Get the __first 5 profiles__ and add them to a Pandas dataframe: ```r library(httr) @@ -1095,7 +1095,7 @@ The result will be: Using variables in our script: -- Get the __first 3 profiles__ that are __inside Gelderland region__ and add it to a Pandas dataframe: +- Get the __first 3 profiles__ that are __inside Gelderland region__ and add them to a Pandas dataframe: ```r library(httr) @@ -1154,7 +1154,7 @@ The result will be:  -- Get __all WoSIS profiles with layers that exist in Gelderland__ and also __export it to CSV__. +- Get __all WoSIS profiles with layers that exist in Gelderland__ and also __export these to CSV__. ```r @@ -1235,9 +1235,9 @@ CSV result file can be found [here](./scripts/r/wosis_gelderland.csv) ## Soil data validation and ingest into WoSIS -The process of ingesting data into Wosis involves a so-called Extract, Transform and Load (ETL) which is a standardised, semi-automatic process that guides the data processor during the ingestion of new datasets. +The process of ingesting data into WoSIS involves a so-called Extract, Transform and Load (ETL) which is a standardised, semi-automatic process that guides the data processor during the ingestion of new datasets. -This process is assisted by this API and the fist part is the mapping the different attributes from the original source data into WoSIS elements such as Observation measurements; site; profile and layer data. +This process is assisted by this API and the fist part is mapping the different attributes from the original source data into WoSIS elements such as Observation measurements; site; profile and layer data. Endpoint __*etlMappingFeatures*__ contains available features that can be used for this process. @@ -1267,4 +1267,4 @@ query MyQuery { } ``` -Note that in the above example API only returns 4 results because we dont have more. \ No newline at end of file +Note that in the above example the API only returns 4 results because we dont have more in the dataset.