03 May 2022 - tsp
Last update 03 May 2022
21 mins
So this is a project that I wanted to do for a long time since I have some previous experience with social network analysis and web scraping and I wanted to “see” how easy and how privacy invading it is when one just hacks this in the fastest and most dirty version (not with sophisticated data management as I’ve done before for different applications) - my initial thoughts on this have been that it’s way too easy (I basically oppose supporting online status on chat systems where you invite your whole telephone book or everyone you know and totally oppose read receipts or mail delivery notifications) and provides way too much insight that one would not expect. And I was not disappointed - only somewhat surprised that it was even easier than I thought - writing this blog post took way longer than implementing scraping and basic analysis and reading it will also take longer for sure.
Usually the reaction when talking about this problem is just a comment like one doesn’t have anything to hide, one cannot infer much from a simple online status or that it requires extensive skills to perform such analysis (or even that platforms do provide protective measures against such data gathering - which is unfortunately not possible - as soon as someone can see data one can automatically scrape and process it, a clean API just makes the job minimally easier and less frustrating when one’s doing some honest stuff but really no one will ever care when doing malicious stuff to do the minimal extra work required to circumvent any protection)
First of a short but important disclaimer: The people who had been monitored with these tools had been notified and asked for permission. I scraped the whole service that I used but immediately applied a whitelist and discarded data of anyone who didn’t consent. But beware that anyone who has dishonest or stalking intentions can do this without asking and without filtering.
The first question is how to gather data from one of the large closed source chat services (I decided I didn’t want to monitor stuff like my XMPP network but something that masses are using and thinking they’re protected by it being a proprietary island). I choose a service where I had a web based interface in addition to the mobile application since this makes life way more easy. This is the case for most mass popular messaging solutions up to day anyways.
The first idea was to inspect HTTP(S) transactions during the usage of the service to figure out how notifications worked and replay this monitoring from own custom scripts. This would be the ideal case but the service I used had some measures in place where scripts and tokens had been changing on a regular basis - which by the way is the biggest stopper for third party clients or transports to open chat networks that would provide a huge gain in usability of such services - so long term monitoring was not possible without reimplementing the whole browser transactions and login also used some more complicated client features. So why not reuse the client? The first idea was to use Selenium and host the whole browser session inside the scraping application - but since I wanted to use the Chromium browser this turned out to be more challenging since the client side scripts tried to detect a page running inside an automated session and since Selenium is not a hacking tool it happily exposes it’s presence. Since I didn’t really want to spend more than a few minutes on extracting the data the decision was clear: Just use the browser and access the page content using a content script form a fast hacked Chromium extension to access the data readily available inside the DOM of the page. This turned out to be rater static and reliable and allowed to use the default login flow to prepare the page. The sessions also never timed out when keeping the messaging webpage open due to regular transfer happening which made this approach stable enough to perform data gathering.
So the basic idea was:
chrome://extensions
.The first file I created was a Manifest in manifest.json
:
{
"name" : "Status scraper",
"description" : "Playing with statistical analysis",
"version" : "0.1",
"manifest_version" : 3,
"action" : {
"default_popup" : "popup.html"
},
"background" : {
"service_worker" : "background.js"
},
"permissions" : [ "activeTab", "scripting" ],
"host_permissions": [
"http://www.example.com/*"
]
}
As one can see the manifest declares some basic information about the browser extension and then:
popup.html
that will be displayed in the browser toolbar
as soon as the extension is ready. I use this popup to start scraping on
demand in a non automated fashion since I also wanted to use other tabs
on the same domain that should not be scraped. The popup.html
also
references a popup.js
that will then be included inside the webpage.
This script file also includes the content script function.background.js
that will accept messages from the
content script that include the presence information that will be forwarded
to a Node-Red instance that will shove this information into a simple SQL
database.activeTab
which is required to get a reference to the foreground tab
when launching the scraper functionscrpting
which is required to inject content scripts into the foreground
tabhost_permissions
entry that tells Chrome which endpoints the
background script is allowed to contact. This is the address of the Node-Red
instance (one could even list the endpoint itself)My popup.html
is pretty basic since I didn’t care about it being pretty
or expressive - it was sufficient to provide a launch method for injecting the
scraping script.
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<p> <button id="scraperStart">Start</button> </p>
<script src="popup.js"></script>
</body>
</html>
The popup.js
script that also includes the handler for the button
with id scraperStart
is the main workhorse on the scraping side.
In case one wants to start the script automatically and inject a content script
without human interaction a nice way is simply using the manifest and putting
the content function that’s currently contained in popup.js
in contentscript.js
:
{
// ...
"content_scripts": [
{
"matches": ["https://www.example.com/chatsession/*"],
"js": ["contentscript.js"],
"run_at" : "document_idle"
}
],
// ...
}
Before I could implement this script I had to determine what to scrape. So I
searched a way inside the messaging application to display only online users.
Luckily this existed (in three different ways). Then I used the inspection feature
of Chrome to locate the wrapping element and used the Copy / XPath feature
inside the inspection utility to determine the XPath for the element. Even though
the page layout was pretty complex due to the framework that had been used there
has been an simple list (li
) element that wrapped one entry after each other
that hosted a single link (a
) that I used to extract a unique user ID as
well as a nested span
element that hosts a plain text human readable screenname
of the user.
I won’t put the exact XPath for the page below but substitute it with two
example values below that I already substituted the running index (starting from 1)
with the variable i
:
"//*/div[1]/ul/li["+i+"]/div/a/div/span"
for the user name"//*/div[1]/ul/li["+i+"]/div/a"
for the link that I wanted to use to extract
the user ID from.The basic idea was to locate the two element using document.evaluate
, check
if they really exist and if extract the inner text from the user name element
as well as the user ID from the splitted link target. If everything turns out
to work I simply append the ID and users screen name to a list of seen users.
After iteration the whole structure will be passed to the backend script
using chrome.runtime.sendMessage
. The whole scraping function is
then executed every 15 seconds so it records which users are seen online every
15 seconds. This also allows some kind of monitoring due to periodic heartbeat.
scraperStart.addEventListener("click", async () => {
let [tab] = await chrome.tabs.query({ active: true, currentWindow: true });
chrome.scripting.executeScript({
target : { tabId : tab.id },
function : runContactScraper
});
});
/*
In case one wants to start the script automatically via the content-script
instead of the action mechanism one simply only uses the content of runContactScraper
inside the content.js script
*/
function runContactScraper() {
window.setInterval(() => {
let i = 1;
let tsTimestamp = Date.now();
let activeData = {
"ts" : tsTimestamp,
"users" : [ ]
};
for(;;) {
let pathName = "//*/div[1]/ul/li["+i+"]/div/a/div/span";
let pathLink = "//*/div[1]/ul/li["+i+"]/div/a";
let elementName = document.evaluate(
pathName,
document,
null,
XPathResult.FIRST_ORDERED_NODE_TYPE,
null
).singleNodeValue;
let elementLink = document.evaluate(
pathLink,
document,
null,
XPathResult.FIRST_ORDERED_NODE_TYPE,
null
).singleNodeValue;
if((elementName != null) && (elementLink != null)) {
let contactName = elementName.innerText;
if(contactName == '') {
activeData = false;
break;
}
let contactLink = elementLink.getAttribute("href");
let contactId = (contactLink.split("/"))[1];
activeData.users.push({
"id" : contactId,
"screenName" : contactName
});
} else {
break;
}
i = i + 1;
}
chrome.runtime.sendMessage({
"message" : "activeData",
"payload" : activeData
}, response => { console.log(response); });
}, 15000);
}
Note that there is no way the page can determine that this script is running and scraping their DOM. It’s running in a separated and isolated scripting environment - the only resource it shares with the page itself is the DOM of the page. So there will never be protection against this kind of scraping except when one rebuilds the page layout randomly - but even then it’s not that hard to locate the information one wants to scrape anyways. Please don’t try to evade scraping - people build really useful tools and add value to your webservices.
The background.js
script then only has to accept this JSON and pass
it to the fetch API:
chrome.runtime.onMessage.addListener(function (request, sender, sendResponse) {
fetch("http://www.example.com/noderedendpoint", {
method: 'post',
headers: {
"Content-type": "application/json;charset=UTF-8"
},
body: JSON.stringify(request.payload)
}).then(function (data) {
console.log('Request succeeded with JSON response', data);
}).catch(function (error) {
console.log('Request failed', error);
});
sendResponse(request.payload);
return true;
});
After finishing the scripts I simply loaded it into chromes extension space
by using the load unpacked extension feature on chrome://extensions
. There
one also is able to access the error console for the background script as well
as the error messages when processing the manifest. It’s also there where one
reloads the extension on change.
The next part in the processing chain had been realized using Node-Red. Usually
I won’t recommend using Node-Red for any production stuff but it’s just a
quick hack and the setup has already been there - and it’s nice to play
around. So I simply added a HTTP in
node on a dashboard and configured
it for POST
requests and an arbitrary chosen URI (/dataana/examplestatus
).
the payload is then always deserialized by a JSON node into a JavaScript object.
Since I wanted to write into a MySQL database I added - at the end - a mysql
node and configured database, username and password. The SQL queries will then
be pushed via the topic
field of the messages, the payload
contains
the bound parameters for the statements.
Then I prepared the database:
USE exampledb;
CREATE TABLE presenceAnalysisUserNames (
userid BIGINT UNSIGNED NOT NULL,
screenname VARCHAR(256) NOT NULL,
CONSTRAINT pk_presenceAnalysisUserNames_id PRIMARY KEY (userid)
);
CREATE TABLE presenceAnalysisSeen (
userid BIGINT UNSIGNED NOT NULL,
ts BIGINT UNSIGNED NOT NULL,
CONSTRAINT pk_presenceAnalysisSeen PRIMARY KEY (userid, ts),
CONSTRAINT fk_presenceAnalysisSeen_userid FOREIGN KEY (userid) REFERENCES presenceAnalysisUserNames (userid) ON DELETE CASCADE ON UPDATE CASCADE
);
CREATE INDEX presenceAnalysisSeenIndexTS ON presenceAnalysisSeen (ts);
GRANT SELECT ON exampledb.* TO 'grafana'@'localhost';
GRANT SELECT,INSERT,UPDATE ON exampledb.* TO 'nodered'@'localhost';
Now I used a simple JavaScript function to transform the incoming payload into a sequence of SQL insert statements that are then passed in sequence to the MySQL node.
let msgs = [];
let ts = msg.payload.ts;
msg.payload.users.forEach(element => {
msgs.push({
"topic" : "INSERT INTO presenceAnalysisUserNames (userid, screenname) VALUES (:uid, :scrname) ON DUPLICATE KEY UPDATE userid = userid",
"payload" : {
"uid" : parseInt(element.id),
"scrname" : element.screenName
}
});
msgs.push({
"topic" : "INSERT INTO presenceAnalysisSeen (userid, ts) VALUES (:uid, :ts) ON DUPLICATE KEY UPDATE userid = userid",
"payload" : {
"uid" : parseInt(element.id),
"ts" : msg.payload.ts / 1000
}
});
});
return [ msgs ];
Now that the data is available let’s first do some basic visualizations:
The basic query that I’m using is just a basic select on the presenceAnalysisSeen
table that uses integer division (realized by SQLs round) to do basic binning
of the timestamp values (this is also what would be done by Grafanas $__timeGroup
macro), groups by this bins and the screen name that’s fetched via a simple INNER JOIN
on the user id and is then filtered by the current selected time range in the
Grafana dashboard using the $__unixEpochFilter
macro. The measure for
activity is simply the number of occurrences of each user inside the bin, the
bin size is determined by the number of seconds divided and multiplied by. For
example for a 5 minute bin size 300 seconds:
SELECT
ROUND(ts / 300, 0) * 300 AS "time",
screenname AS metric,
COUNT(ts) AS activity
FROM presenceAnalysisSeen
INNER JOIN presenceAnalysisUserNames ON presenceAnalysisUserNames.userid = presenceAnalysisSeen.userid
WHERE $__unixEpochFilter(ts)
GROUP BY time, screenname
ORDER BY time;
Unfortunately I did not figure out how to introduce NULL values for times when people are not present with this simple query and graph setup.
As it turns out even the simple graphs generated show pretty much insight into the daily behavior of people and allow one to separate different groups of people.
As a first test I checked on the first hours of gathered data. First a summary of the stacked 5 minute binned activity of the test group:
As one can see this whole group shared some common group behavior - they had been much more inactive before around 6 PM - this is due to working behavior most likely. Then one can see a drop in activity before news and prime TV hours started with a short increase in activity during advertising between those two TV blocks. Note that this is collective behavior. Individual (non stacked displayed) behavior is much more individual:
If one looks at individual behavior one can see some people just checked in for about half an hour:
While other people had been active over a longer period of time:
The series plot also contains immediate information about the activity of individuals on the services webpage or mobile app:
The next time I checked back was when the script ran for nearly a full week. The first thing that one immediately sees for collective behavior is the daily pattern. This worked best in stacked 1 hour binned view:
What I found most interesting on the collective patterns is:
Then I did take a look into distribution of activity levels:
As one can see a single individual turned out to be way more active than anyone else (after talking back this had been someone writing up a PhD thesis). But even this person had the typical activity pattern that one sees for more active people so it was not a client just left running 24/7, it really was the services usage pattern.
On the other hand I found one person who (also asked afterwards and got a confirmation for this theory) used a mobile application version of this communication solution. At any point in time the device the phone went out of standby the application indicated available presence. This exposed - in addition to daily usage pattern - the daily charging pattern of the mobile phone. So one could assume that this is the usual time of being home and most likely being asleep.
After gathering data for a few weeks I then decided to take a deeper look. The basic ideas I wanted to tackle:
Since I had changed some ways I gather data I had to limit myself first to a time span using the same gathering method to compensate for those effects to not have to take that into account.
To get a feeling if there is a huge difference in how often people use the given service I first simply counted the times they’ve been seen online and counted events. As one can see there are of course already some people who are way more active than others. This can already be used to define a normal range by calculating the normal 5 point summary and thus segment due to Quartils:
Now lets get to more interesting stuff. Let’s look at the average usage time of day by segmenting the day into quarter of hours and collecting counts there:
To compensate for extremes one can simply only use the center Quartils and discard the users who are really extensively or rarely using the service:
The next idea was of course to see how good people fit that average activity by normalizing the activity levels and comparing them to the average one to get a score how average people are (or in other words how well they adhere to majorities common behavior). Note that this is of course already biased by excluding extreme users from the baseline before - it somehow marginalizes the patterns by around 50% of all users in this case but that’s usually not much of a problem since the bandwidth of normal behavior is usually pretty large.
Checking against again the real life behavior of those people that had been marked red in the plot above most of them had some pretty obvious deviations from the behavior of the remaining group (being dead and just have a lingering profile, being retired whereas the majority was not, being in a crisis and unemployed with strange sleeping habits, having huge health problems, etc.). The people marked green are usually within one standard deviation of the original baseline data, the people marked blue seem to be over-compliant (or represent the seemingly ideal average behavior).
The next step was to do some correlation analysis. Since the chat service supports real time communication it’s likely that people who communicate primarily with each other will be more likely be active at the same (or similar) time. This will be done by looking at correlations between peoples activity times:
As one can see the used dataset is not really suitable for doing such an analysis since it mainly contained a single social group - as one can see from the structure. The most interesting features of this matrix are to be found in the less populated regions. There some random symmetric local maxima really indicate people who communicate most likely with each other - couples who don’t use the service often but when they use it they use it with each other for example, …
The last thing I wanted to try with this dataset is now anomaly detection - calculating and individual baseline behavior and span for individual people inside a sliding window and check when they deviate more than usual from their usual behavior. The sliding window has been chosen to be about one week so this should show results even when people went to holidays, etc. Hopefully I find some time (and enough more collected data) the next few weeks to add that …
This article is tagged: Basics, Web, JavaScript, Data Mining, Basics
Dipl.-Ing. Thomas Spielauer, Wien (webcomplains389t48957@tspi.at)
This webpage is also available via TOR at http://rh6v563nt2dnxd5h2vhhqkudmyvjaevgiv77c62xflas52d5omtkxuid.onion/