XMUT103
Assignment 4: Analysing and Measuring Sets, Binary Trees, Sorting and lambdas

days Due 11 May , 19 pm

Download zip file of necessary code and data and extract it to an Assig4 folder (eg, in a COMP103 folder in your home folder). It should contain templates for the Java programs that will be marked.

ALERT! Important note about the style of your code.

To submit

  • Report.pdf
  • DecisionTree.java
  • EarthquakeSorter.java and Earthquake.java

ALERT! Do not rename these files.

Remember to submit all of these files. When you have submitted them, check that you can read the files listed on the submission page, and complete the submission process.

Part 1: Analysing and Measuring Sets (Weight: 15%)

We want to find the number of unique words within a given file of words. You can use three different types of collections to do this: ArrayList, HashSet and TreeSet.

The VocabularyMeasure.java file has an example code for measuring the cost of finding the number of unique words.

Read and understand the code, write a brief report (submit as Report.pdf) to:
  1. Give your analysis of the costs from the following parts using O(...) notation.
  2. Draw a graph to show how efficient each of the collections (i.e., ArrayList, HashSet and TreeSet) is with an increasing number of words. The x-axis (the dependent variable) of the graph should be the number of words and the y-axis (the independent variable) should be time (in seconds).
  3. Document your conclusions, and how you arrived at these conclusions.

(a): ArrayList

If you used an ArrayList with n items, what would be the cost? (Express the cost in big-O notation in terms of n).

HELPHint: For each line, work out the cost of doing that line once, and the number of times (worst case) that the line will be repeated. Then fill up the total cost line.


int count = 0;                                          // cost= O(  ) times= 
List<String> vocab = new ArrayList<String>();           // cost= O(  ) times= 
    try{
        Scanner sc = new Scanner (Path.of(fname));      // cost= O(  ) times= 
        while (sc.hasNext()){
            String word = sc.next();                    // cost= O(  ) times= 
            count++;                                    // cost= O(  ) times= 
            if(!vocab.contains(word)){                  // cost= O(  ) times= 
                vocab.add(word);                        // cost= O(  ) times= 
            }
        }
        sc.close();                                     // cost= O(  ) times= 
    }catch(IOException e){UI.println("bad"+e);}
UI.printf("Words = %,10d  Vocab = %,d\n", count, vocab.size());

                                                        // Total Cost = O( )

(b): HashSet
If you used an HashSet with n items, what would be the cost? (Express the cost in big-O notation in terms of n).

int count = 0;                                          // cost= O(  ) times= 
Set<String> vocab = new HashSet<String>();              // cost= O(  ) times=
    try{
        Scanner sc = new Scanner (Path.of(fname));      // cost= O(  ) times=
        while (sc.hasNext()){   
            String word = sc.next();                    // cost= O(  ) times=
            count++;                                    // cost= O(  ) times=
            vocab.add(word);                            // cost= O(  ) times=
        }
        sc.close();                                     // cost= O(  ) times=
    }catch(IOException e){UI.println("bad"+e);}
UI.printf("Words = %,10d  Vocab = %,d\n", count, vocab.size());

                                                        // Total Cost = O( )

(c): TreeSet
If you used a TreeSet with n items, what would be the cost? (Express the cost in big-O notation in terms of n).

int count = 0;                                          // cost= O(  ) times= 
Set<String> vocab = new TreeSet<String>();              // cost= O(  ) times= 
    try{
        Scanner sc = new Scanner (Path.of(fname));      // cost= O(  ) times= 
        while (sc.hasNext()){
            String word = sc.next();                    // cost= O(  ) times= 
            count++;                                    // cost= O(  ) times= 
            vocab.add(word);                            // cost= O(  ) times= 
        }
        sc.close();                                     // cost= O(  ) times= 
    }catch(IOException e){UI.println("bad"+e);}
UI.printf("Words = %,10d  Vocab = %,d\n", count, vocab.size());

                                                        // Total Cost = O( )

Part 2: Decision Trees (Binary Trees) (Weight: 60%)

A Decision Tree is a structure for getting to a decision by following through a tree of yes/no questions.
  • The root node of the tree is the starting point.
  • Each internal node contains a question and two children:
    • the Yes child is the subtree to follow if the answer to the question is yes,
    • the No child is the subtree to follow if the answer to the question is no
  • A leaf node of the tree contains the decision that the tree will give if the answers to the questions led to this point in the tree.

For this question, you are to complete a program for using, growing, displaying, saving, and loading decision trees.

The program uses the DTNode class to represent the nodes in the tree. DTNode.java is already complete, and defines constructors, getters, setters, and a draw method.

The DecisionTree class is the main program. The setupGUI method is written for you and sets up buttons for the methods that you need to complete. The loadTree method is also written for you and will create an initial decision tree in the theTree field.

You should complete the following methods

Core

  • printTree() should print out the whole tree. printTree() should be recursive - printing the text in a node, then (if it has children) printing the yes subtree and then the no subtree.
    For example, the sample-animal-tree should be printed as:
    has whiskers?
    yes:bigger than person?
    yes:Tiger
    no:Cat
    no:has legs?
    yes:has trunk?
    yes:Elephant
    no:Rhino
    no:Snake

  • runTree() should "run" the current decision tree by asking the user questions, following down the tree from the root to a leaf, according to the answers. When it gets to a leaf, it should print the answer.
    For example, with the sample tree, runTree() might generate the interaction:
        Is it true: has whiskers (Y/N): yes
        Is it true: bigger than person (Y/N): no
        The answer is: Cat

  • growTree() should play an interactive "game" with the user to grow the tree by adding new answers and new questions. For the sample tree, this would be adding new animals that the tree can classify and new questions to distinguish the new animals.

growTree() should start the same way as runTree by starting at the root of the tree, and working down, asking questions, until it gets to a leaf node.
When it is at a leaf, it prints the decision and asks the user if the decision is correct. If the decision was wrong, it
    • asks the user what the decision should have been,
    • asks for a property or question to distinguish the right decision from the wrong one
    • changes the text in the node to be the question
    • adds two new children (leaf nodes) to the node with the two decisions.

For example, with the sample tree, the interaction might be
    Is it true: has whiskers (Y/N): yes
    Is it true: bigger than person (Y/N): no
    I think I know. Is it a Cat? no
    OK, what should the answer be? Dog
    Oh. I can't distinguish a Cat from a Dog
    Tell me something that's true for a Dog but not for a Cat?
    Property: barks
    Thank you! I've updated my decision tree.
Printing the tree would now produce:
    has whiskers?
       y:bigger than person?
          y:Tiger
          n:barks?
             y:Dog
             n:Cat
       n:has legs?
          y:has trunk?
             y:Elephant
             n:Rhino
          n:Snake

Completion

  • Improve printTree() to print the decision tree using indentation and y/n to show the tree structure.
    For example, the sample-animal-tree.txt should be printed as:
    has whiskers?
       y:bigger than person?
          y:Tiger
          n:Cat
       n:has legs?
          y:has trunk?
             y:Elephant
             n:Rhino
          n:Snake

  • saveTree() should save the decision tree to a file in a way that it can be loaded back in (by the loadTree() method). The method will need to be recursive.

The format of a decision tree (the sample tree) in a text file for saving and loading:
    Question: has whiskers
    Question: bigger than person
    Answer: Tiger
    Answer: Cat
    Question: has legs
    Question: has trunk
    Answer: Elephant
    Answer: Rhino
    Answer: Snake

Challenge:

  • drawTree() should display the tree in the graphics window so that each node is drawn as a rectangle with text in it, and there are lines connecting parent nodes and their child nodes. See a simple example below.
    dt.png

  • Yes/No questions are quite limited. Extend the program (including DTNode) to allow multiway questions, for example a node could contain the property: "the color of the stripes", which might lead to five different children, corresponding to the answers "black", "brown", "green", "yellow", and "orange". Further extend these methods:
    • growTree() should be able to add a new answer to an existing question as well as add a new question.
    • runTree()
    • printTree()
    • saveTree()

Part 3: Earthquake Sorter (25%)

Earthquake Sorter is a small tool for quickly sorting and analysing the data for earthquakes within New Zealand. The program is able to sort based on various properties of the earthquakes, including magnitude and depth, location, date and time, and ID.

The EarthquakeSorter program has two classes:
  • The EarthquakeSorter class, which handles the user interaction, stores the list of earthquakes, and contains several methods for sorting the earthquakes according to various metrics.
    • The user interface has buttons for loading the earthquake data, as well as sorting it in various ways.
    • The loadData method loads the earthquake data from a data file (two have been provided: earthquake-data.txt and bigearthquake-data.txt).
    • The sortById method sorts the list of earthquakes according to their IDs.
    • The sortByMagnitude method sorts the list of earthquakes according to their magnitude, largest first
    • The sortByTime method sorts the earthquakes by date and time, with earlier events occurring first in the list.
    • The sortByRegion method sorts the earthquakes according to the region they occurred in. Within a region, earthquakes are sorted by magnitude (highest first). If two earthquakes have the same region and magnitude, they are sorted by depth (more shallow first).
    • The sortByProximity method asks the user for a latitude and longitude, and then sorts the earthquakes by distance to that location (with closer earthquakes at the top of the list).

  • The Earthquake class which represents information about a single earthquake.
    • The constructor in the Earthquake class requires a lot of parameters, and populates all of the relevant fields.
    • The distanceTo methods calculate the distance between this earthquake and another earthquake, or this earthquake and a specified latitude and longitude.
    • The get methods return the values of the fields.
    • The toString method provides a nicely formatted string for printing the earthquake.
    • The compareTo method provides the default ordering of earthquakes (by ID). (You need to complete this.)

Core

In Earthquake.java:
  • Complete the compareTo method to provide a natural ordering based on the ID. You may want to see if the String class has any methods you can use to help you here.

In EarthquakeSorter.java:
  • Complete the loadData method to load the data from a file and store it in the list of earthquakes.
    • Each line of the data file is one earthquake.
    • The file format is: ID date time latitude longitude magnitude depth region
    • Each line should be used to create an Earthquake object, which should be added to the list of earthquakes.
  • Complete the sortByID method to sort the list of earthquakes using their natural ordering.
  • Complete the sortByMagnitude method to sort the list of earthquakes according to their magnitude (largest first).

Completion

  • Complete the sortByTime method to sort the list of earthquakes according to the date and time that they occurred.
  • Complete the sortByRegion method in EarthquakeSorter.java to sort the list of earthquakes according to region. If two earthquakes have the same region, they should be sorted by magnitude and then depth.

Challenge

  • Complete the sortByProximity method in EarthquakeSorter.java to sort the list of earthquakes according to their proximity (distance) to a given location (closest earthquakes first). The earthquakes should be printed out with their distance to the location.
    • Hint: You may find the distanceTo method in Earthquake.java useful here.
      Note that this is not a particularly challenging challenge!

Note 1: We have provided output examples for each of the sorting methods, based on the bigearthquake-data.txt file. Use these to check your code and make sure that it is outputting data in the right order. EG:
  • output-ID.txt listing by ID.
  • output-Magnitude.txt listing by Magnitude.
  • output-Time.txt listing by date and time
  • output-Region.txt listing by region, and magnitude then depth within region
  • output-Proximity-ChCh.txt listing by proximity to Christchurch (location -43.5321, 172.6362)

Note 2: If you select "Set Input" on the UI frame's MENU then choose chch.txt, the UI will automatically input the latitude and longitude data from that file. This may help with testing your program when working on the completion and challenge.