CSci 3501 Algorithms and Computability - Sorting competition.

About the sorting competition

The Sorting Competition is a multi-lab exercise on choosing the fastest sorting algorithm for a given type of data. By "fast" we mean the actual running time and not the Big-Theta approximation. You will get description of the data and a sample input file to practice, but the final test will be done on a different file.

A portion of your grade will depend on your place in the competition. Three groups that get the best time in the final run will get extra credit (the amount depending on the place).

The assignment will have two lab periods: one to get preliminary competition results and the second one for the final competition. In addition there will be a correctness analysis assignmment, a presentation, and a final write-up.

Make sure to make your Eclipse workspace read-protected to keep your "trade secrets" secret.

Data and sorting criteria

The data generator and the sample (not particularly inefficient) sorting are available at https://github.com/elenam/SortingCompetitionMaterials2014. The best way to study properties of data and to understand the sorting criterion is by experimenting with the data generator and studying the comparison method in the comparator.

The data consists of strings of 0s and 1s that are generated as follows:

Your goal is to develop and implement an algorithm that sorts these data by the following:

For instance, the following strings appear in a sorted order:

001110
100110
011101
00010001
01000100
11110100
10111111
011111001
111110110
0110010100
1100100110
0101111011
100100100000
000101110000
001010100010
100001001010
100100010010
001001010101
101000011100
101100000011
110100100010
001101011001
010111000011
011001111000
011100100101
101011001001
101111000100
110010111000
111100001100
001110101101
100100111011
100100111110
100011110111
110111110011
111111011100

The parameter for the Poisson distribution (lambda) is between 0 (exclusive) and 8 (inclusive), and will be supplied with data. Note that lambda is a double, not an integer. You might want to study how the data changes with different values of lambda.

Given that sorting takes a parameter, at every round of the competition there will be three runs (with three different lambdas), and your place will be determined by the sum of your places in all three runs. If the sum turns out to be equal for two groups, the sum of medians of the run times for each lambda will be the deciding factor.

Rules and technical requirements

  1. You must use Java. While this may not be the best language choice for efficiency purposes, this assignment is not about languages, it's about algorithms. Thinking about implementation details is important, however.
  2. Multithreading or any other form of parallel processing isn't allowed. The program will run pinned to a single processor
  3. Calls to non-Java methods are not allowed.
  4. The program must take three command-line arguments, in this order:
  5. Your program must read data from the input file, store it internally in a data structure of some kind, sort it, and then output it into the output file. This link may be helpful for file reading/writing: http://www.javapractices.com/topic/TopicAction.do?Id=42 ; also please study the sample code on github.
  6. The first line in an input file is lambda, the rest is strings to be sorted, one per line.
  7. You need to implement the timing in the program, similarly to lab 1 (using System.currentTimeMillis()). The timer starts after you read the file and stops before you start writing out the result. You might want to stop the current thread for a few milliseconds after reading the file before you start the timer to allow time to close the file reader.
  8. Your program will consist of the following three parts:
    1. Reading data and preprocessing. Read the data from the file given as the first argument. Store the data in a structure of your choice (such as an array or an array list; you are not allowed to use a structure that provides some ordering for elements, such as a hashmap). The nubmer of elements is not known ahead of time.
      At this point you are allowed to collect information about the data as you are reading it. However, all such information must be stored in a constant space (a few variables and/or an array of a constant size with respect to the size of the data; the size of such array may depend on lambda). Your constant must be relatively small.
    2. Copy-and-sort loop. The purpose of the loop is so that we can better distinguish closer sorting times by repeating the sorting multiple times and taking the total. It also reduces interference from interactions with the operating system, such as file closing and a call to the timer. The number of times the loop runs is one of the command-line arguments of the program. The loop must be as following:
      • You must start the timer before the first itertation of the loop and stop it right after the last one. Do not restart the timer for every iteration since this leads to waiting for the OS and therefore less accurate time readings.
      • Inside the loop you first copy the data from the original container into a new one (which may or may not be the same type as the original one; it may not have any ordering).
      • Then you sort the data using any methods you would like (your own or pre-defined). At the end of the loop the data must be in the container that it was copied into in the beginning of the loop and it must be sorted.
    3. Writing output After the timer is stopped, write out the sorted data from the last loop iteration into the output file that was specified as the command-line argument, one string per line.
    4. Write out the time as measured by the timer to the standard output (console).
  9. You may use system diff command to check if your resulting ordering is the same as the one given by the sample sorting.

Round 1 (Oct. 9th): what to submit.

Every group will receive a number and should name their main file GroupN, where N is the number that you got. All your supllemental files can have any name. Please use the default package and assume that the input and output files are at the root of the project, so no file path is needed. Please make sure that your program takes the right command line arguments in the right order.

The first line of the input file is the value of lambda, for example 2.5.

Please submit all your Java files to me (CC your group) by email. You must have authors' names in every file and there should be comments clarifying what you are doing, but no other documentation is required for this round.

Round 2 (tentatively Oct 23rd): what to submit.

After the sorting programs trial run, continue working on your sorting program. For the final submission include, in addition to your Java files, a write-up that describes your sorting algorithm. Discuss its efficiency both in theory (the Big-O) and in practice: effects of programming choices, such as an array versus ArrayList, using specific Java methods, etc.

Use the same file name as you used in the first round. Make sure to include authors' names. Your code will be posted at this point.

If you changed anything in your program after the initial run, please briefly indicate what you changed and why. You also might want to include things that you tried that didn't work out.

Correctness Analysis.

Every group will be assigned a group to review. Your goal is to verify that they followed all of the rules. If any of the rules were broken, you should also assess if this has given the group unfair advantage in the sorting time. If all the rules were followed, please briefly justify your finding, for instance: "the resulting array is sorted because the numbers were stored as integers and Java-provided quicksort was applied to sort the integers".

Please upload your review to google docs and share it with me and to all members of the team you are reviewing.

Correctness analysis: for the program given to you please verify that all the rules have been followed, including the following:

Additionally you may analyze a sorting program from any group other than the one that the one you are assigned. This will not give you extra credit, unless you were the first to discover a problem that the assigned reviewing group missed. This part can be done individually or in groups.

Correctness Analysis Response.

If the group that was analysing your algorithm found any issues, you have an opportunity to respond to them (by e-mail, CC me) if you disagree with their findings. I will be happy to discuss this with you. Note that if you don't send a response before the presentation time then I assume that you agree with their findings.

Presentation.

You need to prepare a 3-5 minute presentation about your algorithm. Prepare 3-5 slides about your algorithm (PDF or OpenOffice so that you can display them in the lab).

Your presenation has to address:

Keep in mind that lists and tables work better for slides than long sentences.

Please take notes when listening to the presentation; you will need those for the final write-up.

Final Write-Up.

The final write-up has two parts: the description of your algorithm (if you presented on another group's algorithm then of their algorithm), and a summary of what algorithms overall turned out to be the most effective for this problem and why.

Each of the two parts should be about a page long.

The algorithm description should follow the guidelines for the presentation, only be more detailed. Examples (of data storage, etc.) help.

The algorithm comparison should list all approaches that were used for the problem by all teams (e.g. "three groups used variations of radix sort") and their comparative effectiveness. Note that data represntation plays a significant part in efficiency. What have you learned through the competition?