Catalogue publication

Authors

Paul van Genuchten

Tom Kralidis

Published

June 24, 2025

Catalogues facilitate data discovery in 3 ways:

Note

An important aspect is proper setup of authorisations for general public, partners and co-workers to access metadata as well as the actual data files behind the metadata. A general rule-of-thumb is that metadata can usually be widely shared, but data services with sensitive content should be properly protected. In some cases organisations even remove the data url from the public metadata, to prevent abuse of those urls. If a resource is not available to all, this can be indicated in metadata as ‘access-constraints’.


pycsw catalogue

Various catalogue frontends exist to facilitate dataset search, such as Geonetwork OpenSource, dataverse, CKAN. Selecting a frontend depends on metadata format, target audience, types of data, maintenance aspects, and personal preference.

For this workshop we are going to use pycsw. It is a catalogue software supporting various standardised query APIs, as well as providing a basic easy-to-adjust html web interface.

For this exercise we assume you have Docker Desktop installed on your system and running. Visit the docker get started tutorials in case you’re new to docker.

pycsw is available as Docker image at the github container registry, including an embedded SQLite database. In a production situation you will instead use a dedicated Postgres or MariaDB database for record storage. Notice that when you destroy the container, the SQLite database will be set to its default content.

Pull and run the pycsw container locally using this command in a command line client (cmd, powershell, bash):

docker pull ghcr.io/geopython/pycsw
docker run -p8000:8000 ghcr.io/geopython/pycsw

Open your browser and browse to http://localhost:8000 to see pycsw in action.

Return to the command line, press ctrl-C to stop the docker container process.

Docker Compose

Compose is a utility of docker, enabling setup of a set of containers using a composition script. A composition script can automate the manual startup operations of the previous paragraph. We’ve prepared a composition script for this workshop. The script includes, besides the pycsw container, other containers from next paragraphs.

Clone the workshop repository to a local folder (You don’t have git installed? You can also download the repository as a zip file).

git clone https://github.com/pvgenuchten/training-gitops-sdi.git

On the cloned repository in the docker folder there are 2 alternatives:

On both orchestrations a library is used called Traefik to facilitate path-routing to the relavant containers.

Also notice that some layout templates are mounted into the pycsw container. These templates override the default layout of pycsw.

Some environment variables should be set in a .env file. Rename the .env-template file to .env.

Then open a shell and navigate to the docker folder in the cloned repository and run:

docker compose -f docker-compose-sqlite.yml up

A lot of logs are produced by the various containers. You can also run in the background (-d or –detach) using:

docker compose -f docker-compose-sqlite.yml up -d

When running in the background, use docker compose down, docker ps, docker logs pycsw to stop, see active containers and see the logs of a container. Or interact with the containers from docker desktop.

Load some records

Make sure the docker setup is running in the background (-d), or open a second shell window.

Much of the configuration of pycsw (title, contact details, database connection, url) is managed in a config file. You will find a copy of this file in ./docker/pycsw. In this file, adjust the catalogue title and restart the orchestration. Notice the updated title in your browser.

For administering the contents of the catalogue a utility called pycsw-admin.py is available in the pycsw container. You can either open a shell in the container (via docker desktop) and type the commands directly, or use docker exec to run the commands from the host.

First clear the existing database:

pycsw-admin.py delete-records -c /etc/pycsw/pycsw.yml
docker exec -it pycsw bash -c "pycsw-admin.py delete-records -c /etc/pycsw/pycsw.yml"

Notice at http://localhost:8000/collections/metadata:main/items that all records are removed.

We exported mcf records as iso19139 in the previous section. Copy iso-xml documents to the ./docker/data/export folder in the docker project. This folder will be mounted into the container, so the records can be loaded into the pycsw database.

Use pycsw-admin.py to load the records into the catalogue database:

pycsw-admin.py load-records -p /etc/data/export -c /etc/pycsw/pycsw.yml -y -r
docker exec -it pycsw bash -c `
 "pycsw-admin.py load-records -p /etc/data/export -c /etc/pycsw/pycsw.yml -y -r"

Validate at http://localhost:8000/collections/metadata:main/items if the records are loaded, else check logs to identify a problem.

Customise the catalogue skin

pycsw uses jinja templates to build the web frontend. These are html documents including template language to substitute parts of the page.

You find 2 template files in ./docker/pycsw/. Notice in the orchestration file how the files are mounted into the container:

  • landing_page.html represents the home page of pycsw
  • _base.html is a main layout template which contains page header, footer and menu and wraps around all other templates

Open a template file and make some changes (colors, text, logo’s).

Restart the orchestration and view the result at http://localhost:8000

Have a look at the other templates available in pycsw, which can be tailored in a similar way.

Summary

In this paragraph you learned how datasets can be published into a catalogue. In the next paragraph, we’ll look at importing metadata from external sources.