Hashing

Hashing is a concept in used in two distinct areas of programming.

A hash function is any function that can be used to map data of arbitrary size to a smaller/shorter value. Two or more keys can generate an identical hash. This phenomenon is called a collision. Your algorithm then has to cope with this.

The USE of the hash number may be for cryptography- if a user receives a file and its hash, done according to some stated protocol, they can verify the file has not been mangled in transmission, nor tampered with by a third party. So-called digital signatures.

The other use is to implement associative arrays ( dictionaries) of key/value pairs and be able to access them directly by the key value, rather than reading through sequentially until you find the key. Reading an array term by term until a value is found will involve on average reading and then skipping them if wrong will loop om average through half of the array. Hashing takes you either straight to the correct location, or to within a few places. Only when the table gets VERY full will it slow down.

Use of a hash function to index a hash table is called hashing or scatter storage addressing.

Hash functions and their associated hash tables are used in data storage and retrieval applications to access data in a small and nearly constant time per retrieval. They require an amount of storage space only fractionally greater than the total space required for the data or records themselves. Hashing is a computationally and storage space-efficient form of data access.

I use a simple addition of ASCII codes of the index, then 'mod' it with 'tableSize' so it always points to an in-range index.

LB has the very useful word$() and instr() commands, that make it very easy to build associative array dictionaries. But LB4 has a flawed word$() that makes its use on large associative arrays slow. LB5 has removed the problem. I use indexing into an array, and word$() to separate the value part.

This diagram shows the hash generated is posting the asociated data in a different order, and consecutive storage locations may well be empty.

The next diagram shows that a collision occurs if two keys hash to the same index. The usual response is to then try consecutive array addresses until an empty one is found.

The animation below shows in slow motion the filling of the sparse array; the addition of a new entry for Wales; and the removal of the United Stats ( entry!).


LB Code

The code below reads a series of country names and their populations into such an array. The hash of the country name decides where to ( try to) store the population value. If that location is already in use it tries successively higher addresses. When looking up a country the population nwill be at either the first hash address, or somewhere a bit further on.

    '   The hashing function takes a string as input.
    '       It processes the string a byte at a time.
    '       The ASCII values for bytes are added together.
    '   In the end, the resulting sum is converted using the modulus operator
    '       so it addresses a storage space within an array.
    '   This array is pre-dimensioned to several times the number of expected entries.
    '       This makes for a sparce array, and collisions are less likely.

    '   If the hash points to an already in use term, keep trying for incrementing values
    '       of hash until it finds an empy slot or has tested all locations.

    '   Get by key. Given a key, lookup the pair matching the key and return the value.

    '   Set. Given a key and a value, add a new element for that key and value.
    '       Note that there should be no existing element matching the key.
    '       If there is, then keep incrementing hash and trying until space found.

    '   Delete by key. Remove the element addressed by the key, if it exists, ie set the element to null "".


    '   Format is "key,value".
    '       The hash is used on the first 'key' section to find the storage array index.
    '       The value is read/written to/from the second 'value' section.
    '       key$    =word$( drawer$, 1, ",")
    '       value$  =word$( drawer$, 2, ",")

    '   NB This is a demo version. 'tableSize' is 80 so can be shown on-screen.
    '       For this size use a range for t of up to 79 but ideally less than 40.
    '       Running under debug slows eerything down and you can see each new entry
    '           being out in a hashed/indexed position.

    nomainwin

    WindowWidth  =380
    WindowHeight =880

    graphicbox #w.gb1, 160, 2, 200, 840

    button     #w.b1, "Set new pair",   setClicked,    LR, 330, 600, 140, 30
    button     #w.b2, "Get by key",     getClicked,    LR, 330, 560, 140, 30
    button     #w.b3, "Delete by key",  deleteClicked, LR, 330, 520, 140, 30


    open "Hashing demo" for window as #w

    #w "trapclose quit"

    #w.gb1 " font Courier_New  7 bold"
    #w.b1  "!font Courier_New 12"
    #w.b2  "!font Courier_New 12"
    #w.b3  "!font Courier_New 12"

    global tableSize, Pair$
    tableSize   =80

    dim drawer$( tableSize)

    #w.gb1 "down ; fill white ; flush"
    #w.gb1 "getbmp scr 0 0 200 840"
    bmpsave "scr", "assoc0000.bmp"

    for t =1 to 39
        read keyT$, valueT$
        entry$      =keyT$ +"," +valueT$
        success     =0
        h               =hash( word$( entry$, 1, ","))
      [tryAgain]
        if drawer$( h) ="" or word$( drawer$( h), 1, ",") =word$( entry$, 1, ",")  then
            drawer$( h)    =entry$
            #w.gb1 "up ; goto 10 "; 10 +h *10
            #w.gb1 "down"
            op$             =str$( h) +"  " +drawer$( h)
            #w.gb1 "\"; op$
            success     =1
        end if
        scan
        if success =0 then h =( h +1) mod tableSize: goto [tryAgain]
        #w.gb1 "flush"
        #w.gb1 "getbmp scr 0 0 200 840"
        bmpsave "scr", "assoc" +right$( "0000" +str$( t), 4)+".bmp"
    next t

    scan

    wait

    sub setClicked    bh$   '   enter new key and value as CSV pair.
        prompt "New entry."; Pair$
        success =0

        if Pair$ <>"" then
            h               =hash( word$( Pair$, 1, ","))
         [tryAgain]
            if drawer$( h) ="" or word$( drawer$( h), 1, ",") =word$( Pair$, 1, ",")  then
                drawer$( h)    =Pair$
                #w.gb1 "up ; goto 10 "; 10 +h *10
                #w.gb1 "down ; color red ; backcolor white"
                op$             =str$( h) +"  " +drawer$( h) +space$( 30)
                #w.gb1 "\"; op$
                success     =1
            end if
            scan
            if success =0 then h =( h +1) mod tableSize: goto [tryAgain]
        end if
        #w.gb1 "flush ; getbmp scr 0 0 200 840"
        bmpsave "scr", "assoc0040.bmp"
    end sub

    sub getClicked    bh$   '   display entry by key
        prompt "Show entry."; Pair$
        if Pair$ <>"" then
            h               =hash( word$( Pair$, 1, ","))
            success =0
            tries =1
         [tryAgain]
            if word$( drawer$( h), 1, ",") =Pair$ then
                op$             =str$( h) +"  " +drawer$( h)
                notice op$
                success     =1
            end if
            scan
            if ( success =0) and ( tries "" then
            tries =1
            h           =hash( word$( k$, 1, ","))
            success     =0

          [tryAgain]
            if word$( drawer$( h), 1, ",") =k$ then
                drawer$( h)    =""
                #w.gb1 "up ; goto 10 "; 10 +h *10
                #w.gb1 "down ; backcolor green"
                #w.gb1 "\"; "                             "
                success     =1
            end if
            scan
            tries =tries +1
            'if tries >= tableSize then end if: goto [q]
            if success =0 then h =( h +1) mod tableSize: goto [tryAgain]
        end if
      [q]
        #w.gb1 "flush ; getbmp scr 0 0 200 840"
        bmpsave "scr", "assoc0041.bmp"
    end sub

    sub quit h$
        close #h$
        end
    end sub

    function hash( i$)
        hash    =0
        'Indx    =instr( i$, ",")
        for i =1 to len( i$)
            hash    =hash +asc( mid$( i$, i, 1))
        next i
        hash    =hash mod tableSize
    end function

data "India","1430829857"
data "China","1425611876"
data "United States","340302365"
data "Indonesia","277916769"
data "Pakistan","241272532"
data "Nigeria","224688878"
data "Brazil","216629296"
data "Bangladesh","173247489"
data "Russia","144361198"
data "Mexico","128612942"

data "Ethiopia","127057608"
data "Japan","123184509"
data "Philippines","117632643"
data "Egypt","113006610"
data "DR Congo","102815462"
data "Vietnam","98968428"
data "Iran","89282245"
data "Turkey","85891465"
data "Germany","83288317"
data "Thailand","71817069"

data "United Kingdom","67774698"
data "Tanzania","67765488"
data "France","64777637"
data "South Africa","60505551"
data "Italy","58842340"
data "Kenya","55282699"
data "Myanmar","54643789"
data "Colombia","52127275"
data "South Korea","51777743"
data "Uganda","48804975"

data "Sudan","48316383"
data "Spain","47512259"
data "Argentina","45821781"
data "Algeria","45721221"
data "Iraq","45674040"
data "Afghanistan","42427756"
data "Poland","40875535"
data "Canada","38836030"
data "Morocco","37903204"
data "Saudi Arabia","37035976"
data "Ukraine","36969906"

data "Angola","36869473"
data "Uzbekistan","35251316"
data "Yemen","34576146"
data "Peru","34407984"
data "Malaysia","34369580"
data "Ghana","34230410"
data "Mozambique","34056308"
data "Nepal","30954487"
data "Madagascar","30446789"
data "Ivory Coast","28994091"

data "Venezuela","28932713"
data "Cameroon","28770982"
data "Niger","27371797"
data "Australia","26482797"
data "North Korea","26175558"
data "Taiwan","23928013"
data "Mali","23412124"
data "Burkina Faso","23348857"
data "Syria","23415969"
data "Sri Lanka","21903053"

data "Malawi","21020702"
data "Zambia","20663425"
data "Romania","19838599"
data "Chile","19634327"
data "Kazakhstan","19641898"
data "Chad","18372256"
data "Ecuador","18222064"
data "Somalia","18236013"
data "Guatemala","18134133"
data "Senegal","17838956"

data "Netherlands","17627247"
data "Cambodia","16974827"
data "Zimbabwe","16724359"
data "Guinea","14246404"
data "Rwanda","14147843"
data "Benin","13773357"
data "Burundi","13296982"
data "Tunisia","12476645"
data "Bolivia","12418572"
data "Haiti","11748448"

data "Belgium","11690877"
data "Jordan","11344421"
data "Dominican Republic","11349815"
data "Cuba","11191291"
data "South Sudan",",11119324"
data "Sweden","10622613"
data "Honduras","10621694"
data "Czech Republic","10496874"
data "Azerbaijan","10421599"
data "Greece","10334961"

data "Papua New Guinea","10360985"
data "Portugal","10243394"
data "Hungary","10126238"
data "Tajikistan","10175123"