Skip to content

Stealth collection of evidence using WGET

From time to time, I find myself needing to collect evidence against some of the bad guys on the Internet, including numerous scammers and other shady characters who sell fake or malicious software to innocent victims.  It is an unbelievably tedious process, and this alone surely contributes significantly to the growing epidemic of cyber crime, especially when in monetary terms it is often easier to let them get away with it, rather than invest the time to pursue the matter further.

When I do come across a site that I need to collect evidence from, one of the most valuable tools is GNU’s WGET which is easily available on multiple platforms; with my personal platform of choice being Ubuntu Server.

Now to the command line options used for some stealth evidence collection.

First of all, WGET implements the Robots Exclusion Standard which is not what we want, because the bad guys usually setup a robots.txt file with a “Disallow: /” to tell spiders (like Googlebot) not to obtain any of their content.  So to disable this in WGET, we use:

-e robots=off

Secondly, and most importantly, we do NOT want them to know we’re using WGET to grab pages from their webserver, since this is a dead giveaway. So we need to use something like this:

–user-agent=”Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 6.0)”

It sets the HTTP User-Agent header value so that WGET looks like it is just another Web Browser, I usually pretend to be Internet Explorer, since this represents 80%+ of usual web surfer traffic – you can see this list of User-Agent strings.

Next, we need to be careful about how we connect to the bad guy’s web server.  We can’t just suddenly start collecting everything we see as quickly as possible, so we need some settings that will make our requests look more natural (human-like) and a bit slower than usual with the following:

–limit-rate=50k
–wait=10
–random-wait
–no-http-keep-alive

These settings set a rate limit of 50kb/s download speed, which I find is a workable speed, a wait time of 10 seconds between requests, and then this gets randomized even further (to between 5 – 10 seconds) and finally we do NOT do all of this in the one socket connection, because this would also be a giveaway that we are doing all of this in one hit.

Lastly, we need to mirror (usually) the entire site, since part of collecting evidence is that it should be as complete as possible.

–mirror

When this is put altogether, you should have something like this:

wget -e robots=off –user-agent=”Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 6.0)” –limit-rate=50k –wait=10 –random-wait –no-http-keep-alive –mirror http://somewebsite.null/

A good alternative if you do this regularly is to use a .wgetrc file (usually saved in your user home path) which will save these options to be used by default every time you use WGET.  It can look like this:

# Example .wgetrc file
user-agent = Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 6.0)
robots = off
limit-rate = 50k
wait = 10
random-wait = on
http-keep-alive = off

Then you’ll only need to do:

Good luck with your evidence collecting!  And Take Care out there.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: