Interact with data repositories
Introduction
In this section a crawler tool is introduced which let’s you interact with the metadata in a file based data repository. For this exercise we’ve prepared a minimal data repository containing a number of Excel-, Shape- and Tiff-files. Unzip the repository to a location on disk.
In the root folder of the repository already exists a minimal MCF file, index.yml
. This file contains some generic metadata properties which are used if a file within the repository does not include them. The tool we use is able to inherit metadata properties from this index.yml file through the file hierarchy.
Open index.yml and customise the contact details. Later you will notice that these details will be applied to all datasets which themselves do not provide contact details.
Consider to add additional index.yml files in other folders to override the values of index.yml at top level.
Setup environment
The tools we will use are based on Python. For this workshop we offer 2 approaches:
- If you are familiar with Python or are interested to learn more about it, you can setup a conda or Python virtual environment and run the scripts in that environment.
- Else, we recommend to run the Python tools in Docker containers.
Each of the exercises indicates an option to run the Python script directly, or from a container.
Conda
The tools have some specific dependencies which are best installed via Conda. Conda creates for each project a virtual environment, so any activity will not interfere with the base environment of your machine. Conda manages both the Python libraries as well as any other dependencies, such as GDAL.
If you don’t use Conda yet, consider to install mamba, a lighter alternative to Conda. Or alternatively run the scripts in a Python virtual environment.
Start a commandline or PowerShell with mamba enabled (or add mamba to your PATH). On windows look for the Mamba prompt
in start menu. First we will navigate to the folder in which we unzipped the sample data repository. Make sure you are not in the data
directory but one above.
cd {path-where-you-unzipped-zipfile}
We will create a virtual environment pgdc
(using Python 3.10) for our project and activate it.
manba create --name pgdc python=3.10
mamba activate pgdc
Notice that you can deactivate this environment with: mamba deactivate
and you will return to the main Python environment. The tools we will install below, will not be available in the main environment.
Install the dependencies for the tool:
mamba install -c conda-forge gdal
mamba install -c conda-forge pysqlite3
Install the crawler tool, GeoDataCrawler. The tool is under active development at ISRIC and facilitates many of our data workflows. It is powered by some popular metadata and transformation libraries; OWSLib, pygeometa and GDAL. If you want to learn more about these libraries, consider to read the ‘tutorials’ of these libraries (OWSlib, pygeometa, gdal), or enjoy the jupyter based geopyton workshop.
pip install pyGeoDataCrawler
Verify the different crawling options by typing:
crawl-metadata --help
Python/GDAL via Docker
In case you have difficulties setting up Python with GDAL on your local machine (or just want to try out), an alternative approach is available using Python via Docker. Docker is a virtualization technology which runs isolated containers within your computer.
- First install docker.
- Start the Docker Desktop tool.
- Now navigate to the folder where you unzipped the data repository and use the Docker image to run the crawler:
docker run -it --rm pvgenuchten/geodatacrawler crawl-metadata --help
For advanced Docker statements there are some differences between Windows commandline, Windows PowerShell and Linux bash. Use the relevant syntax for your system.
Initial MCF files
The initial task for the tool is to create for every data file in our repository a sidecar file based on embedded metadata from the resource.
crawl-metadata --mode=init --dir=data
docker run -it --rm -v $(pwd):/tmp \
\
pvgenuchten/geodatacrawler crawl-metadata --mode=init --dir=/tmp/data
docker run -it --rm -v ${PWD}:/tmp `
pvgenuchten/geodatacrawler crawl-metadata `
--mode=init --dir=/tmp/data
Notice that for each resource a {dataset}.yml
file has been created. Open a .yml file in a text editor and review the content. Notice on the Docker statements that we mount the local folder into the container, before we can run commands into it. To verify if the correct folder was mounted, run a ls
command to see the folder contents.
docker run -it --rm -v $(pwd):/tmp pvgenuchten/geodatacrawler ls /tmp
docker run -it --rm -v ${PWD}:/tmp pvgenuchten/geodatacrawler ls /tmp
Update MCF
The update
mode is meant to be run at intervals, it will update the MCF files if changes have been made on a resource.
crawl-metadata --mode=update --dir=data
docker run -it --rm -v $(pwd):/tmp \
\
pvgenuchten/geodatacrawler crawl-metadata --mode=update --dir=/tmp/data
docker run -it --rm -v ${PWD}:/tmp `
pvgenuchten/geodatacrawler crawl-metadata `
--mode=update --dir=/tmp/data
In certain cases the update mode will also import metadata from remote url’s. This happens for example if the dataset-uri is a DOI. The update mode will ten fetch metadata of the DOI and push it into the MCF.
Export MCF
Finally we want to export the MCF’s to actual iso19139 metadata to be loaded into a catalogue like pycsw, GeoNetwork, CKAN etc.
crawl-metadata --mode=export --dir=data --dir-out=export --dir-out-mode=flat
docker run -it --rm -v $(pwd):/tmp \
\
pvgenuchten/geodatacrawler crawl-metadata --mode=export --dir=/tmp/data \
--dir-out=/tmp/export --dir-out-mode=flat
docker run -it --rm -v ${PWD}:/tmp `
pvgenuchten/geodatacrawler crawl-metadata `
--mode=export --dir=/tmp/data `
--dir-out=/tmp/export --dir-out-mode=flat
Open one of the xml files and evaluate if the contact information from step 1 is available.
Summary
In this paragraph you have been introduced to the pyGeoDataCrawler library. In the next section we are looking at catalogue publication.