Automatic sitemap generation with Jekyll
14 Jan 2021 - tsp
Last update 14 Jan 2021
4 mins
This is a short summary on how to use the jekyll-sitemap plugin to automatically
generate a sitemap.xml, how to include it in your robots.txt and how
to exclude specific directories - for example containing static PDFs - from the
sitemap when building a static webpage using Jekyll - like this page is being built.
What is a sitemap anyways
Basically a sitemap is just a list of all pages that make up a website. There
are various formats that are supported - most commonly a simple plain text format
that just contains a list of all fully qualified URIs of all pages which would
look like the following with UTF-8 encoding:
https://www.example.com/page1.html
https://www.example.com/page2.html
https://www.example.com/page3.html
The second major format is an XML based format thatās also stored in an UTF-8
encoded text file. Such an file would look like the following:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
<loc>http://example.com/</loc>
<lastmod>2021-01-14</lastmod>
<changefreq>daily</changefreq>
<priority>0.95</priority>
</url>
<url>
<loc>http://example.com/page1.html</loc>
<lastmod>2020-06-05</lastmod>
<changefreq>monthly</changefreq>
<priority>1.0</priority>
</url>
</urlset>
The idea behind sitemaps is to allow better indexing of websites - usually a web
crawler tries to follow links inside a website. In case there are pages that
are not linked correctly itās possible for a crawler to miss some pages. The sitemap
usually does not magically add pages to the search engines crawler - the crawler
still works as usual. But it helps debugging problems. In Googleās search engine
console for example you can verify how many pages are submitted via the sitemap
and how many of these pages are actually indexed. The sitemap also provides an indication
to search engines which content oneās seeing by oneself as good search engine landing
pages - for example one should not list index pages without high quality content.
Additionally the XML sitemap allows one to specify the change interval - in addition to the
classic HTTP expires header to somewhat control crawling frequency. The
mentioned Jekyll plugin by default only generates loc and lastmod entries.
In case one wants to add additional information itās usually the best idea to
build a custom sitemap.xml template.
Installing required plugins
The sitemap plugin itself is contained in the jekyll-sitemap gem. In addition
itās advisable to install jekyll-last-modified-at to supply the correct
last modified date to the sitemap. On FreeBSD one would install the GEMs
system wide by using:
$ gem install jekyll-sitemap
$ gem install jekyll-last-modified-at
On other systems - or when using a Gemfile one would only add the plugins
and re-run bundler again.
After that one can add the plugins to the _config.yml:
plugins:
- jekyll-sitemap
- jekyll-last-modified-at
Including and excluding pages
First off the plugin defaults to include all resources that have been processed
by Jekyll into the sitemap. This can be controlled using the sitemap front-matter
command. If one only has a small number of files that should not be included
into the sitemap one can simply set sitemap: false on those pages. For
larger areas it might be interesting to set the defaults inside of _config.yml.
If one - for example - wants to exclude all static PDF asset files contained inside
the /assets/pdf directory as well as up to three sub directories one could
decide to do this for example by this configuration:
defaults:
-
scope:
path: "/assets/pdf/*.pdf"
values:
sitemap: false
-
scope:
path: "/assets/pdf/*/*.pdf"
values:
sitemap: false
-
scope:
path: "/assets/pdf/*/*/*.pdf"
values:
sitemap: false
-
scope:
path: "/assets/pdf/*/*/*/*.pdf"
values:
sitemap: false
Publishing the sitemap
The sitemap can then be either linked manually on all desired search engines
or it can be referenced from robots.txt. The latter is of course the
preferred way of linking the sitemap. Just as a reminder robots.txt allows
one to give a hint to search engines which resources one would like to be
indexed and which one doesnāt want to be crawled - this can be done even on
a per crawler basis but of course itās totally voluntary to be honored by
the crawlers. One can simply add a sitemap reference - which is by default
generated in sitemap.xml to allow search engines to discover the sitemap
automatically:
user-agent: *
disallow:
allow: /
sitemap: https://www.tspi.at/sitemap.xml
This article is tagged: