generate-sitemap
Overview
The generate-sitemap GitHub action generates a sitemap for a website hosted on GitHub Pages, and has the following features:
- Support for both XML and txt sitemaps.
- When generating an XML sitemap, it uses the last commit date of each file to generate the <lastmod> tag in the sitemap entry. If the file was created during that workflow run, but not yet committed, then it instead uses the current date.
- Supports URLs for html and pdf files in the sitemap, and has inputs to control the included file types.
- Also supports including URLs for a user specified list of additional file extensions in the sitemap.
- Checks content of html files noindex directives, excluding any that do from the sitemap.
- Parses a robots.txt, if present at the root of the website, excluding any URLs from the sitemap that match disallow rules.
- Sorts the sitemap entries in a consistent order, such that the URLs are first sorted by depth in the directory structure (i.e., pages at the website root appear first, etc), and then pages at the same depth are sorted alphabetically.
- Assumes that for files with the name index.html that the preferred URL for the page ends with the enclosing directory.
- Provides option to exclude .html extension from URLs listed in sitemap (GitHub Pages automatically serves the corresponding html file).
The generate-sitemap GitHub Action is implemented in Python, and the source code repository is hosted on GitHub; and it is licensed under the MIT License. Also in the GitHub repository you will find detailed instructions for use including several sample GitHub workflows.
The generate-sitemap GitHub Action is developed by Vincent A. Cicirello. It was originally implemented for my own use, but I have decided to share it with others.
Live Examples
Website | Workflow | Sitemap Produced |
---|---|---|
This website | sitemap-generation.yml | sitemap.xml |
Documentation site for Chips-n-Salsa | docs.yml | sitemap.xml |
Documentation site for JavaPermutationTools | docs.yml | sitemap.xml |
Tech Stack
The generate-sitemap GitHub Action utilizes the following:
- Python 3 (implemented almost entirely in Python);
- The pyaction Docker container, which is a Docker container designed to support GitHub Actions development in the Python language (also see pyaction's GitHub repository);
- The cicirello/python-github-action-template repository, a template that we maintain to assist developers getting started developing a GitHub Action in Python;
- git to use last commit dates as last modified dates in the sitemap;
- Docker since it is implemented as a container action; and
- GitHub Container Registry, which is where we pull pyaction from at runtime for faster action loading (e.g., pulling the base Docker container from GitHub while an action is running on a GitHub server should be faster than pulling from Docker Hub).
Blog Posts About generate-sitemap
As the author of generate-sitemap, I occasionally post about this, and other software that I maintain on DEV.to. See my DEV.to profile for a full list of such posts. Below is a selection of blog posts specifically about generate-sitemap.
Deploy a Documentation Website for a Java Library Using GitHub Actions, posted on DEV on November 30, 2022.
This post explains how to use GitHub Actions to automate deployment of a documentation website for a Java library whenever a new release is available. This workflow builds the javadocs of the library, post-processes them to insert things like a referrer policy, the website's favicon, etc, updates an XML sitemap, and finally deploys to GitHub Pages.
Generate an XML Sitemap for a Static Website in GitHub Actions, posted on DEV on November 23, 2022.
This post explains the functionality and usage of the generate-sitemap GitHub Action that I've developed and maintain, and which is used to generate an XML sitemap for GitHub Pages sites entirely within GitHub Actions.