Automatic sitemap generation with Jekyll

14 Jan 2021 - tsp
Last update 14 Jan 2021
Reading time 4 mins

This is a short summary on how to use the jekyll-sitemap plugin to automatically generate a sitemap.xml, how to include it in your robots.txt and how to exclude specific directories - for example containing static PDFs - from the sitemap when building a static webpage using Jekyll - like this page is being built.

What is a sitemap anyways

Basically a sitemap is just a list of all pages that make up a website. There are various formats that are supported - most commonly a simple plain text format that just contains a list of all fully qualified URIs of all pages which would look like the following with UTF-8 encoding:

https://www.example.com/page1.html
https://www.example.com/page2.html
https://www.example.com/page3.html

The second major format is an XML based format that’s also stored in an UTF-8 encoded text file. Such an file would look like the following:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
 <url>
  <loc>http://example.com/</loc>
  <lastmod>2021-01-14</lastmod>
  <changefreq>daily</changefreq>
  <priority>0.95</priority>
 </url>
 <url>
  <loc>http://example.com/page1.html</loc>
  <lastmod>2020-06-05</lastmod>
  <changefreq>monthly</changefreq>
  <priority>1.0</priority>
 </url>
</urlset>

The idea behind sitemaps is to allow better indexing of websites - usually a web crawler tries to follow links inside a website. In case there are pages that are not linked correctly it’s possible for a crawler to miss some pages. The sitemap usually does not magically add pages to the search engines crawler - the crawler still works as usual. But it helps debugging problems. In Google’s search engine console for example you can verify how many pages are submitted via the sitemap and how many of these pages are actually indexed. The sitemap also provides an indication to search engines which content one’s seeing by oneself as good search engine landing pages - for example one should not list index pages without high quality content.

Additionally the XML sitemap allows one to specify the change interval - in addition to the classic HTTP expires header to somewhat control crawling frequency. The mentioned Jekyll plugin by default only generates loc and lastmod entries.

In case one wants to add additional information it’s usually the best idea to build a custom sitemap.xml template.

Installing required plugins

The sitemap plugin itself is contained in the jekyll-sitemap gem. In addition it’s advisable to install jekyll-last-modified-at to supply the correct last modified date to the sitemap. On FreeBSD one would install the GEMs system wide by using:

$ gem install jekyll-sitemap
$ gem install jekyll-last-modified-at

On other systems - or when using a Gemfile one would only add the plugins and re-run bundler again.

After that one can add the plugins to the _config.yml:

plugins:
  - jekyll-sitemap
  - jekyll-last-modified-at

Including and excluding pages

First off the plugin defaults to include all resources that have been processed by Jekyll into the sitemap. This can be controlled using the sitemap front-matter command. If one only has a small number of files that should not be included into the sitemap one can simply set sitemap: false on those pages. For larger areas it might be interesting to set the defaults inside of _config.yml. If one - for example - wants to exclude all static PDF asset files contained inside the /assets/pdf directory as well as up to three sub directories one could decide to do this for example by this configuration:

defaults:
 -
   scope:
     path: "/assets/pdf/*.pdf"
   values:
     sitemap: false
 -
   scope:
     path: "/assets/pdf/*/*.pdf"
   values:
     sitemap: false
 -
   scope:
     path: "/assets/pdf/*/*/*.pdf"
   values:
     sitemap: false
 -
   scope:
     path: "/assets/pdf/*/*/*/*.pdf"
   values:
     sitemap: false

Publishing the sitemap

The sitemap can then be either linked manually on all desired search engines or it can be referenced from robots.txt. The latter is of course the preferred way of linking the sitemap. Just as a reminder robots.txt allows one to give a hint to search engines which resources one would like to be indexed and which one doesn’t want to be crawled - this can be done even on a per crawler basis but of course it’s totally voluntary to be honored by the crawlers. One can simply add a sitemap reference - which is by default generated in sitemap.xml to allow search engines to discover the sitemap automatically:

user-agent: *
disallow:
allow: /
sitemap: https://www.tspi.at/sitemap.xml

Automatic sitemap generation with Jekyll

What is a sitemap anyways

Installing required plugins

Including and excluding pages

Publishing the sitemap

Related articles

Building Semantic Suggested Articles for a Static Blog (and How To Visualize Embeddings)

Should You Accept Tracking Cookies on Webpages?

Adding tags for indexing webpages with Jekyll

Growing Out of the 90s: Why Dynamic Web Applications Need to Be Rethought

Automatically Publishing Static Site Blog Articles to a Facebook Page

The Web Is for Everyone, Not Only for Humans

Pandoc and nbconvert recipes

A simple cookie banner implementation (JavaScript)

Also on this blog

Expanding GPU Capabilities on Notebooks and Mini PCs Without PCIe Slots via M.2 NVMe Slots

Motivation and Stress in Neurotypical vs. Asperger’s Adults (Autism Level 1) in School and Work

Fantasy-Based Interaction and Role-Play in Autism Spectrum Disorder

Switched mode DC-DC converter basics