14 Jan 2021 - tsp
Last update 14 Jan 2021
4 mins
This is a short summary on how to use the jekyll-sitemap
plugin to automatically
generate a sitemap.xml
, how to include it in your robots.txt
and how
to exclude specific directories - for example containing static PDFs - from the
sitemap when building a static webpage using Jekyll - like this page is being built.
Basically a sitemap is just a list of all pages that make up a website. There are various formats that are supported - most commonly a simple plain text format that just contains a list of all fully qualified URIs of all pages which would look like the following with UTF-8 encoding:
https://www.example.com/page1.html
https://www.example.com/page2.html
https://www.example.com/page3.html
The second major format is an XML based format that’s also stored in an UTF-8 encoded text file. Such an file would look like the following:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
<loc>http://example.com/</loc>
<lastmod>2021-01-14</lastmod>
<changefreq>daily</changefreq>
<priority>0.95</priority>
</url>
<url>
<loc>http://example.com/page1.html</loc>
<lastmod>2020-06-05</lastmod>
<changefreq>monthly</changefreq>
<priority>1.0</priority>
</url>
</urlset>
The idea behind sitemaps is to allow better indexing of websites - usually a web crawler tries to follow links inside a website. In case there are pages that are not linked correctly it’s possible for a crawler to miss some pages. The sitemap usually does not magically add pages to the search engines crawler - the crawler still works as usual. But it helps debugging problems. In Google’s search engine console for example you can verify how many pages are submitted via the sitemap and how many of these pages are actually indexed. The sitemap also provides an indication to search engines which content one’s seeing by oneself as good search engine landing pages - for example one should not list index pages without high quality content.
Additionally the XML sitemap allows one to specify the change interval - in addition to the
classic HTTP expires
header to somewhat control crawling frequency. The
mentioned Jekyll plugin by default only generates loc
and lastmod
entries.
In case one wants to add additional information it’s usually the best idea to
build a custom sitemap.xml
template.
The sitemap plugin itself is contained in the jekyll-sitemap
gem. In addition
it’s advisable to install jekyll-last-modified-at
to supply the correct
last modified date to the sitemap. On FreeBSD one would install the GEMs
system wide by using:
$ gem install jekyll-sitemap
$ gem install jekyll-last-modified-at
On other systems - or when using a Gemfile
one would only add the plugins
and re-run bundler
again.
After that one can add the plugins to the _config.yml
:
plugins:
- jekyll-sitemap
- jekyll-last-modified-at
First off the plugin defaults to include all resources that have been processed
by Jekyll into the sitemap. This can be controlled using the sitemap
front-matter
command. If one only has a small number of files that should not be included
into the sitemap one can simply set sitemap: false
on those pages. For
larger areas it might be interesting to set the defaults inside of _config.yml
.
If one - for example - wants to exclude all static PDF asset files contained inside
the /assets/pdf
directory as well as up to three sub directories one could
decide to do this for example by this configuration:
defaults:
-
scope:
path: "/assets/pdf/*.pdf"
values:
sitemap: false
-
scope:
path: "/assets/pdf/*/*.pdf"
values:
sitemap: false
-
scope:
path: "/assets/pdf/*/*/*.pdf"
values:
sitemap: false
-
scope:
path: "/assets/pdf/*/*/*/*.pdf"
values:
sitemap: false
The sitemap can then be either linked manually on all desired search engines
or it can be referenced from robots.txt
. The latter is of course the
preferred way of linking the sitemap. Just as a reminder robots.txt
allows
one to give a hint to search engines which resources one would like to be
indexed and which one doesn’t want to be crawled - this can be done even on
a per crawler basis but of course it’s totally voluntary to be honored by
the crawlers. One can simply add a sitemap reference - which is by default
generated in sitemap.xml
to allow search engines to discover the sitemap
automatically:
user-agent: *
disallow:
allow: /
sitemap: https://www.tspi.at/sitemap.xml
This article is tagged:
Dipl.-Ing. Thomas Spielauer, Wien (webcomplains389t48957@tspi.at)
This webpage is also available via TOR at http://rh6v563nt2dnxd5h2vhhqkudmyvjaevgiv77c62xflas52d5omtkxuid.onion/