Lab 1 2020: Catching a Hacker: Statistics in Networks


The goal of this lab is to apply your knowledge of summary statistics and methods for displaying and understanding data to a task that you you are likely to face in a future job.

You are working as a Network Engineer for a small company and part of your job involves managing the company networks.

The company network has a dedicated server that is configured to allow incoming ssh connections to all staff.

This allows any staff member to be able to access the server from anywhere in the world using programs such as PuTTY (Windows) or ssh (OSX, unix variants).

Misuse of this service is protected against by requiring a user attempting access to enter their staff username and password. Failure to provide the correct password held on file for the given username will cause the connection attempt to be rejected and details of the failure to written to a system log.

Over the last past month you have noticed that overnight jobs like the payroll program have been taking much longer to complete than usual indicating that the server must being used for other tasks at the same time. However, no other programs are scheduled to run at that time and staff claim that they are not logging in at night to run unscheduled ones.

You suspect that a hacker has gained access to your server by guessing the username and password combination.

Your task in this lab is to carry out to determine if this is the case and to determine where in the world these hackers might be based.

You will be using a spreadsheet to do analysis of the ssh program system log. We assume you are using LibreOffice Help but you are welcome to use your preferred spreadsheet or other tools.

The Source Data

You have been provided the system log generated by the sshd program.

It contains data from two periods.

The first is from March 16th to March 30th, while the second is from April 15th to April 26th.

Each line in the file is a log message, only three types of log message appear in this file.

Successful login

The log message below is generated by a successful login event:

Apr-15 12:02:55 app-1 sshd[9418]: Accepted password for valid_user user1 from port 33553 ssh2

This tells us that on April 15th at 12:02:55 that a staff member with the username user1 successfully logged into the system from a machine somewhere in the world that has the IP address

Unsuccessful login

There are two cases here. In the first case, the log message is generated by an unsuccessful login event where the wrong password was used:

Apr-15 14:47:52 app-1 sshd[10174]: Failed password for valid_user user1 from port 33737 ssh2

This tells us that on April 15th at 12:02:55 that a staff member with the username user1 provided an incorrect password when trying to login from a machine with the IP address

This could be accidental or someone trying the guess the correct password.

In the second case, the log message is generated by an unsuccessful login event due to attempting to access the system using a username that doesn't exist:

Apr-19 05:19:10 app-1 sshd[7169]: Failed password for invalid_user admin from port 41483 ssh2

This tells us that on April 15th at 12:02:55 that a staff member with the username admin provided an incorrect password when trying to login from a machine with the IP address

This could be someone mistyping their username or a hacker trying to guess the right username to use.

Dictionary Attacks

You will be looking for evidence of a dictionary attack on our server. A dictionary attack is a method of breaking into a password-protected computer or server by systematically entering every word in a dictionary as a password.

Hacking tools exist to automate this process and are widely available.

Some characteristics of a dictionary attack are:

  • Unusually large numbers of failed logins due to incorrect username
  • Unusually large numbers of failed logins due to incorrect password

An attacker who has successfully guessed password will subsequently use that to get access to the target machine. Note that it is possible that a user who is just poor at remembering password might generate a similar pattern of access.

The video below explains in more detail how a dictionary attack works and a tool for doing it.

DISCLAIMER Note that we are not suggesting you go out and try this yourself. It is illegal unless you are doing it on a computer system that you own. Want to know more? The good folks at Kiwicon put it this way.

CORE (50%)

Part 1: Importing the Data into LibreOffice Calc

We want to turn the data from a text file into a spreadsheet file so that we can use the column organisation to help us manage the data.

The idea is to use the SPACE character to separate the parts of each log message into columns.

Follow the steps below to create your spreadsheet:

  1. Start the application LibreOffice Calc.
  2. Choose File - Open.... The Open dialog appears.
  3. Choose sshd.log and confirm with OK. The Text Import dialog appears.
  4. Check that only Space is checked under Separated by and confirm the import with OK.

Your spreadsheet should now contain all of the log data with the different fields separated into columns.

Note that the first row are the field names we have assigned to the different parts of the log messages to help you with the analysis.

Part 2: Using a Pivot Table for Analysis

We are going to use a tool called a Pivot Table to do most of our analysis. According to Wikipedia this is "... a data summarization tool found in data visualization programs such as spreadsheets or business intelligence software. Among other functions, a pivot table can automatically sort, count total or give the average of the data stored in one table or spreadsheet, displaying the results in a second table showing the summarized data. Pivot tables are also useful for quickly creating unweighted cross tabulations. The user sets up and changes the summary's structure by dragging and dropping fields graphically. This "rotation" or pivoting of the summary table gives the concept its name."

Click on a cell somewhere within your spreadsheet, then choose Insert - Pivot Table. The Select Source dialog appears. Accept the default options by confirming with OK. You should get a window like this:


Note that along the right side of this window all of your fields are listed.

To set your different field in the Pivot Table you just drag-and-drop each field to the appropriate area in the window.

  • Page Fields – This is a place to potentially limit the data to one value in one of your columns. You won't need to use this option for the lab.
  • Data Fields – This area must contain at least one variable. Fields in this area are aggregated in some way (sums, counts, maximums etc.). The idea is that you are going aggregate each data field for some set of values to be defined in the Row and Column fields.
  • Column Fields – Whatever you put here will be a column in the resulting Pivot Table.
  • Row fields – And these will be rows in the resulting Pivot Table

We want to count how many accepted and failed logins contained in the data set.

Drag the ACCEPTED-FAILED field into the Row Fields box and the IP-ADDRESS field into the Data Fields box. Note that the Data Fields box now contained Sum - IP-ADDRESS. This means that it will add the values in this column together instead of counting them. Change this by double-clicking on Sum - IP-ADDRESS to bring up the Data Field dialog. Choose Count and choose OK to confirm your selection.

Choose OK to create the Pivot Table. This will appear in a new tab.

You should now see the following result:


You can also do a lot of other useful things with Pivot Tables:
  • Sort the results.
  • Filter the output.
  • Group by more than one field at a time,

Now, use a Pivot Table to answer the following question.

Q1. How many logins failed because an incorrect password was used?
Q2. How many logins failed because an incorrect username was used?

Part 3: Evidence of Attack

In the previous section you should have noticed that the number of failed attempts is much higher than those accepted.

You now need to build a baseline to see if there have been any changes between March and April.

When computing averages, remember that March covers a 15 day period and April covers a 12 day period.

Q3. How many successful logins were there in March and how many in April?
Q4. What were the average number of successful logins per day for each month?
Q5. How many failed login attempts were there in March and how many in April?
Q6. What were the average number of failed login attacks attempts per day for each month?
Q7. Refer back to the characteristics of a dictionary attack and discuss whether these apply to the observed change in login activity between March and April?


We do believe that we have been under the attack.

We now want to understand the pattern of attack in terms of change over time, what usernames did they target and identify if any of our staff accounts may have been compromised.

Part 1: Number of Attempts

Using a graphic representation is a useful way to look for patterns in the number of attempts over time.

Make sure that the X axis is ordered in increasing date order.

Q8. Create a bar chart showing the number of successful logins per day as recorded in the log.
Q9. Create a bar chart showing the number of failed login attempts per day as recorded in the log.

Part 2: Choice of Usernames

We will use frequency counts to determine the top five usernames targeted by our attacker.

Q10 What are the top five usernames used for failed login attempts? Note that some of these will be real staff accounts while others will just be guesses by the attacker.
Q11 Which of those top five usernames are real staff accounts?
Q12 Consider your answers to Q9 and Q10. Why do you think certain usernames were chosen for special attention and how successful were the attackers at guessing actual staff accounts?

Part 3: Origin of Attack

Anyone attempting to login into our server leaves the IP addresses of their machine recorded in the event log. This information is sufficient to make fairly accurate guesses about the geographical location of that machine. This is done by using vast reverse lookup tables that associate IP addresses (or ranges of them) with physical locations. These tables have been developed by harvesting publicly available information by companies that offer paid geoip location services.

We can use this approach to determine the most popular country of origin of our attackers.

To lookup the physical location of a given IP address you can make use of a free service provided by Simply go to their webpage and type in an IP address in the upper right hand corner. For example, looking up the IP address is associated with a machine located in the USA.

Given that you have to do manual lookups we are restricting the scope of our investigation, a full analysis would require resolving every IP address in the data set and we would exceed our free quota for use of the Geo IP service.

Q13 What are the top three IP addresses in terms of number of failed login attempts?
Q14 Identify which countries each of those IP addresses is hosted within and what was the country from where the most attacks came from?


We now know that we were attacked and have some knowledge of who the attackers were and their methods.

However we still do not know which of our staff accounts might have been successfully compromised by the attacker and is now being used by them for their own nefarious ends.

Q15 Identify all of the staff accounts that may have been successfully compromised by an attacker. How did you do this?
Q16 Rank the potentially compromised staff accounts in terms of how likely they are to have been real compromises as opposed to case where are user occasionally forgets their password. Justify your ranking process.


You can download the sshd log file here:

Please submit your answers as a spreadsheet, we suggest using a separate tab for part and put the answers to the questions in textboxes or similar in upper right corner of each tab.
Topic attachments
I Attachment Action Size Date Who Comment
pivot-table.pngpng pivot-table.png manage 37 K 13 Sep 2016 - 09:44 Main.sarcher  
result.pngpng result.png manage 6 K 13 Sep 2016 - 09:44 Main.sarcher  
sshd.loglog sshd.log manage 2 MB 13 Sep 2016 - 09:44 Main.sarcher