In the 1960's a number of new and efficient methods for sorting records stored in the main memory of a digital computer were developed: Merge sort, Heap sort, and QuickSort ran rings about their older competitors, such as Insertion and Bubble sort. They were so much better than older methods that they appeared to work instantaneously, they were free, and they became widely implemented on most computers. Since that time, relatively little effort has gone into developing new methods for sorting, and numerous books and articles have been written based on the mistaken implicit assumption that the problem had been solved as well as possible, for all time. A few people working on arcane real-time systems found their sorting performance to be unpredictable, dependent on the original order of the data (running QuickSort, normally the fastest of the group, on an already sorted list can be several times as slow as sorting a randomly ordered list), and they continued research to reduce the number of times data was moved around in the process. The instability of some algorithms was also an annoyance: records with equal keys frequently were switched during sorting.
In this article we will examine ABCsort, a different, revolutionary approach to internal sorting that is both faster than classical methods and also amenable to Field Programmable Gate Array (FPGA) and Application Specific Integrated Circuit (ASIC) implementations, where it is a lot faster than classical methods.
Most classical internal sorting methods are based on what is called the "comparison model" of computing. This is a model framework for the design and implementation of algorithms that is based on the comparison of data items as a driving methodology. Everything from Bubble-sort to HeapSort to QuickSort are designed on the basis of this model, and algorithms based on sorting, such as Median Filters and Convex Hull algorithms, can fall into the comparison model. As we shall see, this is not the only model of algorithm design, nor is it the best in all cases. Academically, the comparison model has been useful because it encourages extensive performance analysis, especially worst-case analysis, of the algorithms. However, the limiting performance of this class of algorithms is the theoretical O(NlogN) lower limit on sorting using comparisons.
The classical algorithms compare (key fields in) records and exchange them as needed to establish the desired order. Variants of these algorithms obtain improved performance by exchanging not records and keys, which may span several words, but single-word pointers to them. Along the way, they may produce an Index Table (IT), which is a list of pointers to records in the desired order. (As a concrete example, the MatLab sort (A, IT) function sorts a vector, A, of real numbers and produces an index table IT.) The pointers may be the actual addresses of records or (more usually) offsets of addresses from the base address of an array of records. The IT, usually a side effect of the sort, is then used to move records from one place to another, not by making sequential references to A[i] but to A[IT[i]]. Shuffling single-word IT entries is usually a lot faster than shuffling the records they point to. QuickerSort, the version of QuickSort that moves only pointers, is the fastest on average of the comparison methods.
Get this idea into your head now: ABCsort does not do key comparisons. Ever. It produces the IT directly, and in sequence, so you can use the IT entry immediately as it is generated, or save the whole table for later. But no records or keys are moved, no keys are compared, and any characters that don't matter are not even examined. (If there is only one key that starts with Q in your data base, who cares what's in the rest of the record? It belongs between the P's and the R's, and that's all we need to know. By the time we discover that there is only one 'Q' key, ABCsort is done with all the P's, so that record is next in sequence.)
There are some peculiar advantages to using and saving the IT's without moving the records around. For one, it is possible to sort the list of records using different keys while maintaining the validity of previously produced IT's. So, for example, we can sort a file of telephone numbers and owners by either number or name, producing two views of the file that both work at the same time. Even if we append records to the file, old ITs remain valid, although new values are not taken into account. On top of that, if we have available a shared-memory multiprocessor, we can sort a file in shared memory by multiple keys (for example, caller name, caller number, callee name, and callee number) in parallel, producing one IT per processor. And since ABCsort produces each IT in order, we can do some neat things with partial sorts, by cutting off the work at some intermediate point. For example, we can get a "top ten" look at the data, or stop halfway through with a statistically useful median value.
ABCsort is a relatively new algorithm (patented by Allen Beechick in 1993) that produces a virtual IT (directly, without doing any comparisons or moving any record/key data). It is called a virtual IT because the table entries are produced in sequence and can be used and discarded immediately; they need not be stored. In producing the IT, ABCsort examines every key character at most once, so its performance is O(N) in both time and space (actually it's O(N*K), where K is the number of key fields, but since K is fixed for any application, it's O(N)), and largely because it never moves records or keys, its overall performance is faster than any of the alternative methods. For example, in sorting a list of 10,000 records with 30-character keys, ABCsort is about 2.5 times as fast as QuickerSort, the variant of QuickSort that moves pointers to keys rather than the keys themselves. On a 4-processor SPARC-10 system, we have succeeded in sorting a file on 4 keys ten times as fast as the task could be done on a single-processor system using QuickerSort.
A fully parallel sort of a file on one key with two processors can be done by sorting in both ascending and descending order, and building the IT in halves from both ends toward the middle.
In application, ABCsort is split into two parts: (1) a pure sorting engine that builds virtual IT's without reference to any information about key semantics, and (2) a pair of functions, supplied by the user, that (a) provide all the key semantics and values, and (b) dispose of the sequential IT entries. The key semantics function, GetNextField, simply responds to a request by the sorting engine for the value in a specific key field from a specific record. In the case of character keys, its response is normally a single ASCII character. For more complex keys (e.g. floating-point numbers) the key sematics function returns as its response a subfield of the datum, in most-to-least significance order. Since the sorting engine does no comparisons, the ability to do floating-point subtraction or comparison is not required. With the right key semantics, keys expresssing ASCII dates, e.g. in DD-MMM-YYYY format, can be sorted without conversion to an internal form.
The core sorting engine is a single procedure whose logic is constant; it can be implemented in hardware as well as firm- or soft-ware. It has two parameters: the number, K, of subfields in a key and the number, N, of records. This algorithm does not care where the records are stored or what form they take; it only asks for specific key bytes from specific records. Its access to the keys is read-only: they might well be stored on a ROM. As the algorithm obtains information, it builds and manipulates two tables, called the Record Table (RT) and the Letter Table (LT), to produce the IT entries in sequence. The application may require storage as described below to keep track of the IT.
The LT is a table of size <alphabet size> * <number of key fields>. As the algorithm proceeds, LT is occupied by pointers to entries in the RT, each erased when no longer needed. The RT is a vector of length N that is used to construct threads of records that have similar keys. Although it is rarely necessary, the RT can be used to store the real IT, since once the IT position of a record is determined, its RT entry is no longer needed.
The key semantics processing function GetNextField provides all the information required by the sorting engine in the form of short fields (typically corresponding to a single byte); it accepts as inputs the desired record number and the key field number. To do its job may require somewhat convoluted (but efficient) code to decide what value to return. This is a function that definitely rewards ingenuity (e.g. implementation as an in-line macro), on the part of the user, with superior performance. Where the keys are ASCII strings, it simply returns the appropriate ASCII character, but other types of keys require more effort.
Let's take the example of using a date as key. Most algorithms require that we first convert the date to an integer, but it's faster to leave it in format DD-MMM-YYYY. The core algorithm will ask for fields in the order Y1,Y2,Y3,Y4, MMM, D1,D2. So we can start with:
switch (field);
case(1) date[7]; break;
case(2) date[8]; break;
case(3) date[9]; break;
case(4) date[10]; break;
case(5) hashmonth(3); break;
case(6) date[0]; break;
case(7) date[1]; break;
end;
Where hashmonth(n) can be any function that returns 1...12 for a month name starting in character "n". My favorite is
date[n] + date[n+1] - date[n+2] - constant,
which in English returns the unique index to a table of 31 values, of which twelve are 1...12 and the others are zero. Other languages, other hash functions. Note that the sequence of values representing the months need not be 1..12 nor even contiguous, only in order. 0...5, 10...15 works fine.
Sorting any list into reverse order is easy: just return the complement of the usual data in each field. The order can even be changed within keys, a feature not found in other sorts.
The other function (or macro, for speed), PutCurrRecord (RecordNo), that the user needs to provide, simply deals with the IT and its associated values. It returns the next IT entry in sequence, so simply "print data[RecordNo]" will display all the sorted records in sequence. Or,
...
#define PutCurrRecord(r) IT[++ITindex] = (r)
ITindex = 0;
...
will cause the IT to be saved for later.
Those two functions are all there is to the data semantics and control of the index table.
Now let's look at how the sorting engine does its job. Its action takes the form of two passes: On the first pass it looks at the first key field of every record, and puts each record number into a string of record numbers in the Record Table (RT), emanating from that key field, in the top row of the Letter Table (LT). The second and subsequent passes are recursive: we walk along the current LT row, looking for nonempty entries. When a nonempty entry points to an empty RT entry, it is a record number that is ready to be output, and PutCurrRecord is called without further ado. If the LT entry points to a nonempty RT table entry, then the two are among duplicates that need to be resolved, and we drop down a level and do a similar pass on the subsequent letters of all records in the current string. Finally, when we reach the maximum key length -- the bottom of the LT -- we output every remaining record indicated by the current LT entry, since no further resolution of duplicate keys is possible. When each level of the LT table has been fully resolved, we resume analysis at the next higher level, until that level has no more nonempty entries. When the top level is empty, we are done.
ABCsort is stable, that is, records with equal key values appear in the same order as they appeared before sorting. This is ensured, if necessary, by inserting them in the RT table in reverse order so that the final pass can correct the order. Duplicate key values can be optionally suppressed (keeping the last only) by reversing the stable ordering and ignoring all but the first entry (the chronologically last) for each key value.
At this point it is time to examine the sorting engine code. A simple version of the algorithm (in C) is given below. After you take a look at this version, we will discuss speed enhancements that make ABCsort so hot.
/* ABCsort in C
** The sorting algorithm represented by this program is the property of
** Allen Beechick and is protected under U. S. Patent No. 5,218,700.
** Use of this algorithm without permission of the owner is prohibited
** by law.
** The author of this program is:
** Lynn D. Yarbrough
** Mathematical and Computational Sciences
** Palm Desert, CA
**======================================================================
*/
#include <malloc.h>
#include <stdio.h>
#include <stdlib.h>
long *RT; /* The Record table */
long **LT; /* The Letter table */
void ABCsort (int keys) /* No. of key fields to use */
void process (int, int);
long register i, j, recno;
int register c;
int nodup = noduplicates;
long start, step, stop;
/* Allocate storage for the internal tables */
LT = lmatrix(1, keys, alphamin, alphamax);
RT = lvector(1, N);
/* Initialize the LT table: */
for (j = alphamin; j <= alphamax; j++)
for (i = 1; i <= keys; i++)
if ((keys & 1) ^ nodup)
for (recno = start; recno != stop; recno += step)
c = GetNextField(recno, 1);
RT[recno] = LT[1][c];
LT[1][c] = recno;
}
free_lmatrix(LT, 1, keys, alphamin, alphamax);
free_lvector(RT, 1, N);
}
/* Function to process data after first pass: */
void process (int level, int keys)
long i, newlevel, nextrec, recno;
int nodup = noduplicates;
unsigned char c;
for (i = alphamin; i <= alphamax; i++) /* Loop over alphabet */
recno = LT[level][i];
LT[level][i] = 0;
while (recno != 0)
if (RT[recno] == 0)
PutCurrRecord(recno);
recno = 0;
continue;
}
if (level == keys)
while (recno != 0)
PutCurrRecord(recno);
recno = RT[recno];
if (nodup) recno = 0;
}
{/* Move down a level */
newlevel = level + 1;
while (recno != 0)
nextrec = RT[recno];
c = GetNextField(recno, newlevel);
RT[recno] = LT[newlevel][c];
LT[newlevel][c] = recno;
recno = nextrec;
}
process(newlevel, keys);
}
/* end of function "process" */
Note that the auxiliary procedures lvector and lmatrix allocate 1-based vector storage, but the 1-base is not essential. It does, however, free up index 0 to mean "empty". The recursive procedure "process" can be written iteratively, and is done so in the ASIC version described later.
If you run this version of the algorithm with a profiler, you would see that the algorithm spends a significant part of the time doing something not explicit in the listing, namely the failure case in the search of the LT-table for nonempty entries. Especially at the lowest levels (higher index values) of the table, where there are usually very few entries to be found, the amount of time spent there dominates everything else.
There are a number of things that can be done to reduce the time spent in this loop, the most immediate being reducing the size of the LT table by halving the size of bytes examined. Going from an 8-bit byte to 4 bits nearly doubles the number of fields tested, but reduces the width of the LT table, and thus its search time, by a factor of 16 - a net improvement by a factor of 8 in this invisible piece of code. Well, what about using still smaller table size? Its advantage is limited by the increased cost of extracting the bits from each key byte (in GetNextField), and my experimental implementations of this idea have not shown significant improvement in performance. However, some speed can also be gained by maintaining a bit-table of the LT table's nonzero entries, which can be examined eight- or sixteen-at-a-time. This method, which depends somewhat on the system architecture, usually pays off, and combinations of the two methods may be worth more careful evaluation. I have neglected this approach in favor of investigating FPGA/ASIC implementation of the algorithm.
In 1998, working with Yuri Panchul of Compilogic Corp., we used Compilogic's C-to-Verilog compiler to produce a hardware-gate-level description of the algorithm, then ran it with a gate-level simulator to measure its performance at sorting an array of 32-bit integers. Although comparing a 200 MHz Pentium with a 120 MHz ASIC is a lot like comparing bananas and oranges, the clock cycle counts on the two systems show a distinct advantage to be gained by using a hardware implementation of ABCsort: While the Pentium required about 39000 cycles to run our test, the ASIC required only about 9000 cycles. The total performance gain over "classical" methods -- ABCsort running on an ASIC vs. Quickersort run on the Pentium -- is a staggering 7.5 to 1. And that's without using any parallelism.
Possibly the most effective application of this algorithm in embedded systems is in classical applications that depend on sorting, such as the Median Filter algorithm (used to eliminate noise spikes in audio or video signals) or the Convex Hull algorithm (used in certain computer graphics applications). Historically, the behavior of both these applications has been thought to be O(Nlog(N)) because they both depend on sorting, but both are O(N) except for sorting; so using ABCsort to implement them would make them both theoretically and practically faster.