generate-sitemap
Overview
The generate-sitemap GitHub action generates a sitemap for a website hosted on GitHub Pages, and has the following features:
- Support for both XML and txt sitemaps.
- When generating an XML sitemap, it uses the last commit date of
each file to generate the
<lastmod>
tag in the sitemap entry. If the file was created during that workflow run, but not yet committed, then it instead uses the current date. - Supports URLs for html and pdf files in the sitemap, and has inputs to control the included file types.
- Also supports including URLs for a user specified list of additional file extensions in the sitemap.
- Checks content of html files noindex directives, excluding any that do from the sitemap.
- Parses a robots.txt, if present at the root of the website, excluding any URLs from the sitemap that match disallow rules.
- Enables specifying a list of directories and/or specific files to exclude from the sitemap.
- Sorts the sitemap entries in a consistent order, such that the URLs are first sorted by depth in the directory structure (i.e., pages at the website root appear first, etc), and then pages at the same depth are sorted alphabetically.
- Assumes that for files with the name index.html that the preferred URL for the page ends with the enclosing directory.
- Provides option to exclude .html extension from URLs listed in sitemap (GitHub Pages automatically serves the corresponding html file).
The generate-sitemap action is for GitHub Pages sites, such that the repository contains the html, etc of the site itself, regardless of whether or not the html was generated by a static site generator or written by hand. For example, I use it for multiple Java project documentation sites, where most of the site is generated by javadoc. I also use it with my personal website, which is generated with a custom static site generator. As long as the repository for the GitHub Pages site contains the site as served (e.g., html files, pdf files, etc), the generate-sitemap action is applicable.
The generate-sitemap action is not for GitHub Pages Jekyll sites (unless you generate the site locally and push the html output instead of the markdown, but why would you do that?). In the case of a GitHub Pages Jekyll site, the repository contains markdown, and not the html that is generated from the markdown. The generate-sitemap action does not support that use-case. If you are looking to generate a sitemap for a Jekyll website, there is a Jekyll plugin for that.
The generate-sitemap GitHub Action is implemented in Python, and the source code repository is hosted on GitHub; and it is licensed under the MIT License. Also in the GitHub repository you will find detailed instructions for use including several sample GitHub workflows.
The generate-sitemap GitHub Action is developed by Vincent A. Cicirello. It was originally implemented for my own use, but I have decided to share it with others.
Live Examples
Website | Workflow | Sitemap Produced |
---|---|---|
This website | sitemap-generation.yml | sitemap.xml |
Documentation site for Chips-n-Salsa | docs.yml | sitemap.xml |
Documentation site for JavaPermutationTools | docs.yml | sitemap.xml |
Tech Stack
The generate-sitemap GitHub Action utilizes the following:
- Python 3 (implemented almost entirely in Python);
- The pyaction Docker container, which is a Docker container designed to support GitHub Actions development in the Python language (also see pyaction's GitHub repository);
- The cicirello/python-github-action-template repository, a template that we maintain to assist developers getting started developing a GitHub Action in Python;
- git to use last commit dates as last modified dates in the sitemap;
- Docker since it is implemented as a container action; and
- GitHub Container Registry, which is where we pull pyaction from at runtime for faster action loading (e.g., pulling the base Docker container from GitHub while an action is running on a GitHub server should be faster than pulling from Docker Hub).
Blog Posts About generate-sitemap
As the author of generate-sitemap, I occasionally post about this, and other software that I maintain on DEV.to. See my DEV.to profile for a full list of such posts. Below is a selection of blog posts specifically about generate-sitemap.
Deploy a Documentation Website for a Java Library Using GitHub Actions, posted on DEV on November 30, 2022.
This post explains how to use GitHub Actions to automate deployment of a documentation website for a Java library whenever a new release is available. This workflow builds the javadocs of the library, post-processes them to insert things like a referrer policy, the website's favicon, etc, updates an XML sitemap, and finally deploys to GitHub Pages.
Generate an XML Sitemap for a Static Website in GitHub Actions, posted on DEV on November 23, 2022.
This post explains the functionality and usage of the generate-sitemap GitHub Action that I've developed and maintain, and which is used to generate an XML sitemap for GitHub Pages sites entirely within GitHub Actions.