12-Jan-2000 Hash tables
People have said that mathematicians are good at thinking up simple names for complex ideas. That probably goes for computer scientists, too. If you already know what hash tables are you probably don't mind seeing them in the title. If you don't know, you don't know enough to skip this entry because it might get technical. I'll save that part for last, anyway. I haven't got rid of The Happy Nigun going through my head. It's still there, and it has company. I'm also humming A Freyliche Nacht in Gan Eyden, aka a Jumping Night in the Garden of Eden, or maybe it should be translated An Evening in Paradise. Last night in klez we had two clarinets (three if you count Glen, but he mostly plays piano in class), an alto and a baritone sax, a flute, drummer (snare and hi-hat), trombone, cello, piano, electric bass, and me on trumpet or euphonium. Some people besides me must have practiced the tunes, because we were sounding good. BTW, Glen says that the Hypnotic Clambake still exists and tours. He gets a phone call from their leader once in a while. There were ruddy ducks (three) and hooded mergansers (ten) on Bullough's pond yesterday morning and again today. Yesterday I had binocs with me and got out of the car to look. An older woman out for her walk asked me what the proper names for the ducks were, Someone told me something like bubble... I handed her the binocs for a closer look. There was a bufflehead here about six weeks ago, I said, but these are... She asked if they were on their way somewhere, and I said I thought they would be around as long as there was open water. Later I realized that the answer is that they're as lazy as the rest of us, and they don't want to fly farther than they have to go to find food. It's work for them to go south. They're not going to Florida on frequent flyer miles. OK, if you made it this far, a hash table is supposed to be a quick way for a computer to file and retrieve a large number of pieces of information. When you have, let's say what I was working with, a dictionary with 200,000 words and their pronunciations, you don't want to put them in a list and go through the list every time you need to look one up. It would take a long time to go through the list, even for a computer, if they weren't in some order that you could take advantage of. It would take time to put them in order, if they didn't start out in the order you wanted. The hash table trick is to make a code number for each one, and put them in a big table filed by code number. When you want to look one up, you compute the code number and go right to the correct slot in the table. The catch is, what if two words give you the same code number? You have to look in the slot and see if you're looking at the right one, and look in the next slot if not, until you are. For 200,000 words, I want a table with about 600,000 slots so there's plenty of room, and I want to compute codes that spread my words around the table so I mostly get the right one on the first try. OK, now let's compute a code number by giving each letter a number, A=65, B=66, etc, a=97, b=98, etc, and compute a code number for a word by adding up the letter codes. Cool. Not so cool. Since most words are only 8 to 12 letters long, their codes are all less than 12*123, 1476. But I had 200,000 words, and there are only 1500 codes to go around, so almost all of them are duplicates. That's what the program was doing yesterday. I spent most of today figuring out and testing better ways to compute hash codes. Yesterday's program took 4 3/4 minutes to build a table of those 200,000 entries. After I got through, it took 11 seconds. That's what makes a software engineer feel constructive.
|