10 Feb 2020 - tsp
Last update 22 Mar 2020
9 mins
Update: In addition an implementation in Python has been added to show a short draft on how one can get started with Selenium in Python as well. This can be found at the end of the article.
This blog entry is a short description on how to get started using Selenium with chromedriver on FreeBSD with a Java application. This can be used to develop automatic test applications for web applications or simple bots that scrape content from webpages or automate actions on the web using a full browser capable of running JavaScript, running browser plugins, etc.
Note that this is just a short tutorial on how to setup your IDE and write a first simple program that accesses the webpage content and executes click on a single link identified by an XPath expression. It’s not a complete introduction to Selenium or it’s Java interface. If one wants to get a detailed step by step tutorial on how to use Selenium to build a web application testing one can for example refer to Test Automation using Selenium WebDriver with Java: Step by Step Guide by Navneesh Garg (note: Amazon affilate link; this pages author profits from qualified purchases).
First one needs a working Chromium installation. This is usually done via packages
pkg install www/chromium
or via ports
cd /usr/ports/www/chromium
make install clean
This automatically installs the chromedriver binary at /usr/local/bin/chromedriver
Now one only needs to fetch the Selenium Java libraries. They can be found at the selenium webpage. Just fetch the Selenium Java package (ZIP file) and save at a convenient location. Unzipping the files yields:
client-combined-*.jar
file that should be added to your projects
classpathlibs
folder containing various JAR’s. These should also be
added to the projects classpathWhen using Eclipse IDE simply start a new Java
project, right click your project and select properties
. Select Java Build Path
and use the Add external JARs
function to add both the client-combined-*.jar
file (not the -source
version) and all JARs from the libs
folder
to your projects classpath. This will have an effect during build and also
while launching from the Eclipse IDE.
When distributing your applications you have to use the method mentioned later on, reference them in the JARs manifest, install the JARs into a system wide known location or (beware of licensing problems!) merge the JARs into a single one.
In case you’re running from your IDE you can simply configure your
classpath either by setting the CLASSPATH
environment variable in your
shells init script or using env CLASSPATH=
on each command invocation (or
while launching a subshell). This might be done in a wrapper script if desired.
Do not forget to add the classpath for your own classes (JAR or directory tree)
to your classpath though.
For example, one might use the following invocation:
env CLASSPATH=.:~/selenium/client-combined-3.141.59.jar:~/selenium/byte-buddy-1.8.15.jar:... javac MyTestclass.java
Note that one has to list each and every dependency from the libs folder in this case so specifying them on the commandline is rather inconvenient
Now for a simple application that will fetch the Slashdot webpage, accept the cookie banner if present and fetches a list of stories together with their links.
First we create a test file named like our test program (in this example
called TestProg
) containing our basic skeleton.
Note that the style applied in this example is not suited for a real
application. One should nearly never ever use catch Exception
for example
but implement proper exception handling.
import java.util.List;
import org.openqa.selenium.By;
import org.openqa.selenium.NoSuchElementException;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
public class TestProg {
public static void main(String[] args) {
try {
// Set path of chromedriver binary
System.setProperty("webdriver.chrome.driver", "/usr/local/bin/chromedriver");
// Create the driver
WebDriver driver = new ChromeDriver();
// Let the user see the final state for 10 seconds
Thread.sleep(10000);
driver.quit();
} catch(Exception e) {
e.printStackTrace();
}
return;
}
}
As one can see we have set the webdriver.chrome.driver
system property.
This is not exactly good style either - this should be set (if possible in
any way) from the external launcher script. As one can see this property
has to point to our chromedriver
binary. This has been installed
automatically together with our www/chromium
package. Then we
create the driver using new ChromeDriver()
. This creates the browser
instance which is remotely controlled by WebDriver. This should also
be indicated at your standard error output:
Starting ChromeDriver 78.0.3904.108 (4b26898a39ee037623a72fcfb77279fce0e7d648-refs/branch-heads/3904@{#889}) on port 47736
Only local connections are allowed.
Please protect ports used by ChromeDriver and related test frameworks to prevent access by malicious code.
Feb 10, 2020 10:03:26 PM org.openqa.selenium.remote.ProtocolHandshake createSession
INFO: Detected dialect: W3C
Now - before we can fetch some data - we have to accept the cookie banner presented by Slashdot. To do that we first have to determine how we can locate the button. Luckily that’s easy on Slashdot - we use the Inspect feature of Chromium in an incognito tab (to start without any cookies or other session information present):
Now we simply copy the XPath to the element
With the known XPath of the link to accept the conditions - in this case it’s
luckily an link inside an unique identified element so the path expression is
really unique and simple ("//*[@id=\"cmpwelcomebtnyes\"]/a"
) - we can
simply locate the required element using findElement
with the By.xpath
method and raise a click()
event on the webpage:
try {
// Fetch a webpage. For this example we use Slashdot
driver.get("https://slashdot.org/");
// Locate the "Accept" button
WebElement bannerElem = driver.findElement(By.xpath("//*[@id=\"cmpwelcomebtnyes\"]/a"));
/*
Click the element (if it's not present the NoSuchElementException
would already have been thrown)
*/
bannerElem.click();
// Display a message and provide some time for the user to see the action
System.out.println("Clicked the cookie banner ...");
Thread.sleep(250);
} catch(NoSuchElementException e) {
System.out.println("Didn't have to click the cookie banner ...");
}
As one can see the findElement
function would raise an NoSuchElementException
in case the banner is not present. This already provides a (not so clean)
solution to detect the presence of the cookie banner.
Now to our main task - fetching the titles and links. For this we use the
method findElements
and supply a class name that we’ve also determined
using the inspect method of chromium as an interactive user. This method
delivers an list of elements that are tagged with the given class name.
After that we can iterate through the elements, locate the link (a
) element
contained inside the story-title
element, fetch the title which is
simply the text contained inside the link as well as the href
attribute
and output them to the commandline:
List<WebElement> titles = driver.findElements(By.className("story-title"));
for(WebElement elem : titles) {
WebElement titleLink = elem.findElement(By.tagName("a"));
String strTitle = titleLink.getText();
String strHref = titleLink.getAttribute("href");
System.out.println(strTitle + " + " + strHref);
}
Now we’ve fully created an scraper for slashdot headlines and their links.
If you intend to use selenium to create a bot beware that there are some bot detection scripts that scan for modifications made by Selenium to the browser (injected JavaScript, added properties inside the DOM, etc.). There are ways to prevent this injection and detection by anti-bot scripts but as soon as you’re blacklisted you might have trouble getting unlisted depending on the service. Remember that Selenium is basically created for testing webpages and supplying input that a real user would use. You’ll encounter such Selenium detection scripts when accessing webpages like your bank’s online presence, payment portals and big merchant portals. Be sure to check if they block your account before using your main credentials (at least use some test credentials before being banned with your main account or use some additional set of accounts also on the day to day basis). Also beware that using automated bots might violate terms of service so webservices have a right to block your accounts and deny any further bussines with you …
In any case - please don’t write a spambot. There’s already enough spam on the web. Noone likes that. There are of course many valid reasons to write bots to scrape information from webpages that make lives for direct fetching and processing hard because they do build their webpages using JavaScript without any fallback to plain HTML - that’s worst webdesign practice in my opinion (and normally I simply do not use such pages any more).
The full source is available as GitHub GIST
Because I’ve been asked by a student how to achive the same effect
with Python - that’s pretty easy. First one requires again the www/chromium
package and the Selenium Python libraries (installed via pip install selenium
).
Now one can use the selenium
package from webdriver
:
from selenium import webdriver
from time import sleep
driver = webdriver.Chrome()
driver.get("https://slashdot.org/")
Access to elements works similar as in Java using functions like
an so on. Accessing attributes uses get_attribute
and access to inner HTML
content is done using the text
property.
One can assemble this into the following short program hosted on a GitHub GIST
This article is tagged:
Dipl.-Ing. Thomas Spielauer, Wien (webcomplains389t48957@tspi.at)
This webpage is also available via TOR at http://rh6v563nt2dnxd5h2vhhqkudmyvjaevgiv77c62xflas52d5omtkxuid.onion/