Dictionaries and content-addressable memory.

I've lived with and loved a Welsh girl for 50 years, and through her absorbed a lot of Welshness. And learned a bit of the language and culture. Hence perhaps some of my dictionary-test words!

 Search for a Welsh word
              Cegin, n. a kitchen
              Cariad, n. love; a lover
              Diolch, n. thanks; praise: v. to give thanks
              Draig, n. dragon; lightning
              Yma, adv. here, in this place

 Search for an English word
              Cegin, n. a kitchen
              Baban, n. a babe, baby
              Coflaid, n. what is embraced; a bosom friend; a darling
              Arlas, a. tipped with blue
              Anesgud, a. not quick, slow

I found an online download of a Welsh-->English dictionary on Gutenberg. There are plenty of online services such as Google Translate, but I wanted the fun of programming it myself.

The one I found is old and idiosyncratic. It has 50000+ words. ( and preamble and statement of conditions-of-use which can be stripped off if not putting that version on the 'net.)

It's structure is.. Welsh word with initial capital letter, comma, English equivalent, followed by TWO CRLF pairs. e.g.

"Gwaladru, to arrange, to order<CRLF><CRLF>"

This is pretty easy to read in LB. The problem is how to use it!

My initial thoughts involved sparse arrays and binary chops to locate words, hashing, and other ways to get the data quickly. Would involve quite a lot of coding and not necessarily fast or efficient.

The "word$(" problem.

My first attempts used "word$(" on the whole array. I find "word$(" a VERY useful command... BUT.... This is not just slow- it is, if you are LUCKY, VERY slow, and if unlucky it hogs all your machine resources and refuses to allow you to halt it. An 'interesting' bit of Carl's language that I've commented on before.

Even finding the FIRST word took a minute and a half!

See this code and graph of how "word$(" gets catastrophically slow on big strings and how to get round it.

    global N: N =1000 '   Use >=1000 but beware!!

    open "numbers1to"; str$( N); ".crlf" for output as #fOut
        for i =1 to N
            #fOut "John_"; str$( i)
        next i
    close #fOut

    open "numbers1to"; str$( N); ".crlf" for input as #fIn
        longString$ =input$( #fIn, lof( #fIn))
    close #fIn

        print "": print ""
        print "Showing the time to find n'th term in a LONG string of "; N' " terms."
        print "John_1//John_2//....  . . . John_N, but with separators CRLF."
        print ""

        time =time$( "ms")

        print data$( longString$,     1),: print time$( "ms") -time; " ms."
        print data$( longString$,     9),: print time$( "ms") -time; " ms."
        print data$( longString$,    99),: print time$( "ms") -time; " ms."
        print data$( longString$,   999),: print time$( "ms") -time; " ms."
        print data$( longString$,     N),: print time$( "ms") -time; " ms."

        print"": print "Now using LB's flawed word$() function."
        print ""

        CRLF$ =chr$( 13) +chr$( 10)

        time =time$( "ms")

        print word$( longString$,     1, CRLF$),: print time$( "ms") -time; " ms."
        print word$( longString$,     9, CRLF$),: print time$( "ms") -time; " ms."
        print word$( longString$,    99, CRLF$),: print time$( "ms") -time; " ms."
        print word$( longString$,   999, CRLF$),: print time$( "ms") -time; " ms."
        print word$( longString$,     N, CRLF$),: print time$( "ms") -time; " ms."

        N =N *2

        if N >40000 then print "Done.": end

    goto [again]


    function data$( i$, n)
        open "numbers1to"; str$( N); ".crlf"  for input as #fIn
            longString$     =input$( #fIn, L)

            for i =1 to n
                input #fIn, data$
            next i

        close #fIn
    end function

CAM- content addressable memory.

What is really needed is a content-addressable structure- otherwise you have to read the original file sequentially ( or its entries stored in an array) until you find a matching term. SLOW!

Luckily "instr(" does not suffer the same coding problem. It took only a short time to throw together the following code. Welsh to English is easy. Use "instr(" to find the word. Print the rest of that entry as the English equivalent. English to Welsh is harder. Look for the English with "instr(". Work back to a capital letter which will be the start of the Welsh equivalent. Print from there to the comma.

This is NOT a perfect forward/reverse dictionary. For instance looking for English 'blue' gets me 'Arlas' rather than 'las'. And Welsh has a lot of 'mutations'- vowel changes and masculine.feminine distinctions.

Fun anyway!

LB code ( needs the dictionary file)

    open "cymraeg.txt" for input as #fIn
        dict$ =input$( #fIn, lof( #fIn))
    close #fIn

    print " Search for a Welsh word"'   forward look-up of Welsh word

    data "Cegin", "Cariad", "Diolch", "Draig", "Yma"

    for c =1 to 5
        read w$
        i       =instr( dict$, w$ +",")
        if i <>0 then print, upto$( mid$( dict$, i, 50), chr$( 13) +chr$( 10))
    next c

    print "": print " Search for an English word"'   backward look up of English

    data "kitchen", "baby", "darling", "blue", "house"

    for c =1 to 5
        read E$
        i       =instr( dict$, " " +E$)
        if i <>0 then
            j   =i
                j   =j -1
                m$  =mid$( dict$, j, 1)
            loop until instr( "ABCDEFGHIJKLMNOPQRSTUVWXYZ", m$)

            entry$ =mid$( dict$, j, i -j +len( E$) +1)
            print, entry$
        end if
    next c