A namesake of my father, Ronald W Fisher, but not related, and like him a mathematician and a Cambridge graduate, Ronald A Fisher FRS, was a British polymath and biologist who was active as a mathematician, statistician, geneticist, and academic from the Thirties onwards. Often referred to as 'the father of statistics'.
He set out the mathematical analysis of statistics especially when applied to evolution and heredity. However he was a strong believer in eugenic ideas that are now anathema, and may have been a bit choosy in what data items to accept and which to conveniently omit.
With modern interest in AI and machine recognition his work is referred to in many courses on AI. He gathered data on two types of irises, with measuremants of flower length, breadth etc. He then showed how one might in a rigorous way try to analyse a new set of samples so that a machine correctly classified them. All good 'pattern recognition' stuff.
Sepal length in cm Sepal width in cm Petal length in cm Petal width in cm Class / Species: Iris Setosa, Iris Versicolor or Iris Virginica 5.1, 3.5, 1.4, 0.2, Iris-setosa 4.9, 3.0, 1.4, 0.2, Iris-setosa 4.7, 3.2, 1.3, 0.2, Iris-setosa etc . . .
A modern take on the problem uses data about a set of 300+ penguins, of three species. A table is available giving numerical and qualitative data about all of them. An AI student may be given as an exercise the task to do the coding of analysing them into groups. It turns out that if you just use two dimensions- ie two columns of data- you will classify a significant number wrongly. Chastening, but good for the student to learn!
Species Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Adelie Penguin (Pygoscelis adeliae) 39.1 18.7 181 3750 MALE Adelie Penguin (Pygoscelis adeliae) 39.5 17.4 186 3800 FEMALE Adelie Penguin (Pygoscelis adeliae) 40.3 18.0 195 3250 FEMALE etc . . .
I'd strongly suggest looking up k-means, RA Fisher,etc on Wikipedia. The data files for irises and penguins are on-line- but you may need to edit line endings to CRLF ( I use Linux), and/or change the space/,/tab separators in data files you may find. I'll have my versions on the website.
My code takes the data set and looks at two numerical columns. These can be plotted as a scatter graph. If we ignore the ( known) ACTUAL classification fo each point we have a 'blob' of data points. I show this in all-white on a black background.
My code is told there are two ( or three) species present. It allocates each data point at random to one of these species. It finds the 'centre of gravity/mass' of each class and plots them in three colours. A large amount of overlapping, of course. It then progressively alters the allocation to a class if this moves it into a nearer centre for that colour. You quickly reach a stable solution dividing uni-coloured, single class sub-populations.
However, with hindsight, we actually KNOW the correct sub-grouping, and can instead colour each point for its CORRECT classification. The amount this differs from the machine's attempt indicates how successful we are in our automatic classifying.
. . . . . . . . . Machine's effort . . . . . . . . . . . . . . . . . . . . . . 'Correct' grouping
-
No! We are not limited to choosing only two columns of numerical data- you can do a 'Pythagoras distance' in as many dimensions as you like. And you can allocate arbitrary numbers to qualitative data like a 'colour' column. But it does show that you need to be sure of what you are doing and how good you want the result to be....
'' bluatigro 11 aug 2018 '' k means algorithm : '' put points in random set '' while point moved makes improvement ( or timeout) '' compute centroid of sets '' move each point into closest set start = time$( "seconds") global pointmax, setmax, pi, tone pointmax =329 setmax = 2 ' ie sets #0, #1 ( and #2) pi =atn( 1) *4 tone = 80 ' start of greyscale circles indicating improvinq sorting quality dim x( pointmax), y( pointmax), set( pointmax) dim setx( setmax +1), sety( setmax +1), avX( setmax +1), avY( setmax +1), count( setmax +1) global maxX, maxY, minX, minY minX =1e6: maxX =-1e6: minY =1e6: maxY =-1e6 WindowWidth = 860 WindowHeight = 860 nomainwin open "k clustering Penguins." for graphics_nsb as #m #m "trapclose quit" #m "fill black" open "penguinData.csv" for input as #fIn line input #fIn, header$ for i = 0 to 329 line input #fIn, row$ x( i ) = val( word$( row$, 1, ",")) ' length if x( i) > maxX then maxX =x( i) if x( i) <=maxX then minX =x( i) y( i ) = val( word$( row$, 2, ",")) ' depth if y( i) > maxY then maxY =y( i) if y( i) <=maxY then minY =y( i) set( i) =int( ( setmax +1) *rnd( 1)) ' allocate to a group at random next i close #fIn 'print "x", minX, maxX 'print "y", minY, maxY call showdata #m "getbmp scr 1 1 860 860" bmpsave "scr", "kMeansPenguins1a.bmp" radius =180 #m "setfocus" callDLL #kernel32, "Sleep", 5000 as long, ret as void true =1 notKnown =0 pointHasMoved = true while ( pointHasMoved = true) and ( ( time$( "seconds") -start) <500) '' end if no move, or timeout pointHasMoved = notKnown for k =0 to setmax n( k) =0 next k for i = 0 to setmax '' hold number in each set n( i) =0 next i for i = 0 to pointmax if set( i) = 0 then n( 0) =n( 0) +1 if set( i) = 1 then n( 1) =n( 1) +1 if set( i) = 2 then n( 2) =n( 2) +1 next i X0 =0: Y0 =0: X1 =0: Y1 =0: X2 =0: Y2 =0 for i = 0 to pointmax '' calc centroids ( setx( j), sety( j)) of each set if set( i) =0 then X0 =X0 +x( i): Y0 =Y0 +y( i) '' ie find mean x and mean y of group if set( i) =1 then X1 =X1 +x( i): Y1 =Y1 +y( i) if set( i) =2 then X2 =X2 +x( i): Y2 =Y2 +y( i) next i if n( 0) <>0 then setx( 0) = X0 /n( 0): sety( 0) = Y0 /n( 0) if n( 1) <>0 then setx( 1) = X1 /n( 1): sety( 1) = Y1 /n( 1) if n( 2) <>0 then setx( 2) = X2 /n( 2): sety( 2) = Y2 /n( 2) for i =0 to setmax '#m "goto " ; int( 50 +( setx( i)) *12) ; " " ; int( 640 -( sety( i )) *14) '#m "color "; tone; " "; tone; " "; tone '#m "down ; circle "; radius; " ; up" next i radius =radius -20: if radius < 20 then radius = 20 tone =tone + 8: if tone >255 then tone =255 for i = 0 to pointmax '' see which set's centroid is closest to point smallestSepnSoFar = 10000 'q' and move if necessary closestSetSoFar = -1 'q2' for s = 0 to setmax separation = dist( x( i) , y( i) , setx( s) , sety( s) ) 'q3' if separation < smallestSepnSoFar then smallestSepnSoFar = separation closestSetSoFar = s end if next s scan if closestSetSoFar <> set( i) then pointHasMoved =true set( i) =closestSetSoFar '' move point to closest set end if next i call showdata #m "getbmp scr 1 1 860 860" bmpsave "scr", "kMeans" +str$( 255 -radius) +".bmp" scan wend notice "Done!" close #m wait function dist( x1, y1, x2, y2) '' function for diagonal separation distance dist = sqr( ( x1 -x2)^2 +( y1 -y2 )^2) end function sub showdata #m "cls ; fill black" for i = 0 to pointmax #m "goto "; int( 50 +( x( i)) *12) ; " " ; int( 640 -( y( i )) *14) if tone >80 then select case set( i ) case 0 #m "backcolor red ; color red" case 1 #m "backcolor green ; color green" case else #m "backcolor blue ; color blue" end select else select case case i <153 #m "backcolor cyan ; color cyan" case i <276 #m "backcolor pink ; color pink" case i <343 #m "backcolor yellow ; color yellow" end select end if #m "down" #m "circlefilled 2" #m "up" next i callDLL #kernel32, "Sleep", 5000 as long, ret as void end sub sub quit k$ close #m end end sub
Length ,Depth,Flipper,Mass, 39.50,17.40,186.00,8.95,-24.69 40.30,18.00,195.00,8.37,-25.33 36.70,19.30,193.00,8.77,-25.32 39.30,20.60,190.00,8.66,-25.30 38.90,17.80,181.00,9.19,-25.22 39.20,19.60,195.00,9.46,-24.90 42.00,20.20,190.00,9.13,-25.09 37.80,17.10,186.00,8.63,-25.21 34.60,21.10,198.00,8.56,-25.23 38.70,19.00,195.00,9.19,-25.07 42.50,20.70,197.00,8.68,-25.14 34.40,18.40,184.00,8.48,-25.23 46.00,21.50,194.00,9.12,-24.77 37.80,18.30,174.00,8.74,-25.09 37.70,18.70,180.00,8.66,-25.06 35.90,19.20,189.00,9.22,-25.03 38.20,18.10,185.00,8.43,-25.23 38.80,17.20,180.00,9.64,-25.30 35.30,18.90,187.00,9.21,-24.36 40.60,18.60,183.00,8.94,-25.36 40.50,17.90,187.00,8.08,-25.49 37.90,18.60,172.00,8.38,-25.20 40.50,18.90,180.00,8.90,-25.12 39.50,16.70,178.00,9.70,-25.11 37.20,18.10,178.00,9.73,-25.01 39.50,17.80,188.00,9.67,-25.06 40.90,18.90,184.00,8.80,-25.15 36.40,17.00,195.00,9.18,-25.23 39.20,21.10,196.00,9.15,-25.03 38.80,20.00,190.00,9.19,-25.12 42.20,18.50,180.00,8.05,-25.50 37.60,19.30,181.00,9.41,-25.04 36.50,18.00,182.00,9.69,-24.42 36.00,18.50,186.00,9.51,-25.03 44.10,19.70,196.00,9.24,-24.53 37.00,16.90,185.00,9.36,-25.02 39.60,18.80,190.00,9.49,-24.10 36.00,17.90,190.00,9.52,-25.08 42.30,21.20,191.00,8.88,-25.19 39.60,17.70,186.00,8.47,-26.13 40.10,18.90,188.00,8.51,-26.56 35.00,17.90,190.00,8.20,-26.17 42.00,19.50,200.00,8.48,-26.31 34.50,18.10,187.00,8.42,-26.55 41.40,18.60,191.00,8.35,-26.28 39.00,17.50,186.00,8.57,-26.07 40.60,18.80,193.00,8.57,-25.99 36.50,16.60,181.00,9.08,-25.88 37.60,19.10,194.00,9.11,-25.90 35.70,16.90,185.00,8.96,-26.41 41.30,21.10,195.00,8.75,-26.38 37.60,17.00,185.00,8.58,-26.22 41.10,18.20,192.00,8.62,-26.60 36.40,17.10,184.00,8.63,-26.12 41.60,18.00,192.00,8.86,-26.09 35.50,16.20,195.00,8.56,-25.96 41.10,19.10,188.00,8.71,-25.81 35.90,16.60,190.00,8.48,-26.08 41.80,19.40,198.00,8.87,-26.06 33.50,19.00,190.00,7.89,-26.63 39.70,18.40,190.00,9.30,-25.23 39.60,17.20,196.00,8.34,-26.55 45.80,18.90,197.00,8.19,-26.46 35.50,17.50,190.00,8.71,-26.15 42.80,18.50,195.00,8.30,-26.39 40.90,16.80,191.00,8.47,-26.02 37.20,19.40,184.00,8.36,-26.45 36.20,16.10,187.00,7.82,-26.51 42.10,19.10,195.00,9.06,-25.82 34.60,17.20,189.00,7.70,-26.54 42.90,17.60,196.00,8.63,-26.23 36.70,18.80,187.00,7.88,-26.25 35.10,19.40,193.00,8.90,-26.46 37.30,17.80,191.00,8.33,-26.38 41.30,20.30,194.00,9.15,-26.10 36.30,19.50,190.00,8.57,-26.22 36.90,18.60,189.00,8.59,-26.08 38.30,19.20,189.00,9.08,-26.12 38.90,18.80,190.00,8.37,-26.11 35.70,18.00,202.00,8.47,-26.06 41.10,18.10,205.00,8.77,-25.83 34.00,17.10,185.00,8.01,-26.70 39.60,18.10,186.00,8.50,-26.42 36.20,17.30,187.00,8.91,-26.30 40.80,18.90,208.00,8.48,-26.58 38.10,18.60,190.00,8.10,-26.50 40.30,18.50,196.00,8.39,-26.01 33.10,16.10,178.00,9.04,-26.16 43.20,18.50,192.00,8.97,-26.04 35.00,17.90,192.00,8.84,-26.28 41.00,20.00,203.00,9.01,-26.38 37.70,16.00,183.00,9.22,-26.23 37.80,20.00,190.00,9.52,-25.69 37.90,18.60,193.00,9.03,-25.86 39.70,18.90,184.00,8.86,-25.80 38.60,17.20,199.00,8.77,-26.49 38.20,20.00,190.00,9.59,-25.71 38.10,17.00,181.00,9.80,-25.27 43.20,19.00,197.00,9.32,-25.45 38.10,16.50,198.00,8.44,-26.58 45.60,20.30,191.00,8.65,-26.33 39.70,17.70,193.00,9.03,-26.06 42.20,19.50,197.00,8.80,-26.41 39.60,20.70,191.00,8.81,-26.79 42.70,18.30,196.00,8.91,-26.42 38.60,17.00,188.00,9.18,-25.77 37.30,20.50,199.00,9.50,-26.37 35.70,17.00,189.00,8.96,-23.90 41.10,18.60,189.00,9.32,-26.10 36.20,17.20,187.00,9.04,-26.19 37.70,19.80,198.00,9.11,-26.43 40.20,17.00,176.00,9.31,-25.61 41.40,18.50,202.00,9.59,-25.43 35.20,15.90,186.00,8.82,-25.95 40.60,19.00,199.00,9.23,-25.61 38.80,17.60,191.00,8.88,-25.90 41.50,18.30,195.00,8.53,-26.02 39.00,17.10,191.00,9.19,-25.74 44.10,18.00,210.00,9.11,-26.01 38.50,17.90,190.00,8.98,-25.58 43.10,19.20,197.00,8.86,-26.14 36.80,18.50,193.00,8.99,-25.58 37.50,18.50,199.00,8.57,-26.49 38.10,17.60,187.00,8.72,-25.78 41.10,17.50,190.00,8.94,-26.07 35.60,17.50,191.00,8.76,-25.98 40.20,20.10,200.00,8.96,-26.33 37.00,16.50,185.00,8.62,-26.07 39.70,17.90,193.00,9.26,-25.89 40.20,17.10,193.00,9.29,-25.55 40.60,17.20,187.00,9.23,-26.02 32.10,15.50,188.00,8.80,-26.61 40.70,17.00,190.00,9.06,-25.80 37.30,16.80,192.00,9.07,-25.85 39.00,18.70,185.00,9.22,-26.03 39.20,18.60,190.00,9.11,-25.80 36.60,18.40,184.00,8.69,-25.83 36.00,17.80,195.00,8.94,-25.79 37.80,18.10,193.00,8.98,-26.03 36.00,17.10,187.00,8.93,-26.07 41.50,18.50,201.00,8.90,-26.07 46.10,13.20,211.00,7.99,-25.51 50.00,16.30,230.00,8.15,-25.39 48.70,14.10,210.00,8.15,-25.46 50.00,15.20,218.00,8.26,-25.40 47.60,14.50,215.00,8.23,-25.54 46.50,13.50,210.00,8.00,-25.33 45.40,14.60,211.00,8.25,-25.47 46.70,15.30,219.00,8.23,-25.43 43.30,13.40,209.00,8.14,-25.32 46.80,15.40,215.00,8.16,-25.38 40.90,13.70,214.00,8.20,-25.39 49.00,16.10,216.00,8.10,-25.51 45.50,13.70,214.00,7.78,-25.42 48.40,14.60,213.00,7.82,-25.48 45.80,14.60,210.00,7.80,-25.63 49.30,15.70,217.00,8.07,-25.52 42.00,13.50,210.00,7.64,-25.53 49.20,15.20,221.00,8.27,-25.00 46.20,14.50,209.00,7.84,-25.38 48.70,15.10,222.00,7.96,-25.40 50.20,14.30,218.00,7.90,-25.38 45.10,14.50,215.00,7.63,-25.47 46.50,14.50,213.00,7.90,-25.39 46.30,15.80,215.00,7.91,-25.38 42.90,13.10,215.00,7.69,-25.39 46.10,15.10,215.00,7.84,-25.43 44.50,14.30,216.00,7.97,-25.69 47.80,15.00,215.00,7.92,-25.48 48.20,14.30,210.00,7.69,-25.51 50.00,15.30,220.00,8.31,-25.19 42.80,14.20,209.00,7.63,-25.46 45.10,14.50,207.00,7.97,-25.54 59.60,17.00,230.00,7.77,-25.68 49.10,14.80,220.00,7.90,-26.63 48.40,16.30,220.00,8.04,-26.86 42.60,13.70,213.00,7.97,-26.71 44.40,17.30,219.00,8.14,-26.79 44.00,13.60,208.00,8.02,-26.68 48.70,15.70,208.00,8.15,-26.85 42.70,13.70,208.00,8.15,-26.59 49.60,16.00,225.00,8.38,-26.84 45.30,13.70,210.00,8.38,-26.73 49.60,15.00,216.00,8.27,-26.77 50.50,15.90,222.00,8.47,-26.60 43.60,13.90,217.00,8.27,-26.78 45.50,13.90,210.00,8.48,-26.62 50.50,15.90,225.00,8.66,-26.58 44.90,13.30,213.00,8.45,-26.90 45.20,15.80,215.00,8.56,-26.68 46.60,14.20,210.00,8.38,-26.86 48.50,14.10,220.00,8.40,-26.79 45.10,14.40,210.00,8.52,-27.02 50.10,15.00,225.00,8.50,-26.61 46.50,14.40,217.00,8.49,-26.83 45.00,15.40,220.00,8.63,-26.76 43.80,13.90,208.00,8.58,-26.84 45.50,15.00,220.00,8.64,-26.75 43.20,14.50,208.00,8.48,-26.86 50.40,15.30,224.00,8.75,-26.80 45.30,13.80,208.00,8.65,-26.79 46.20,14.90,221.00,8.60,-26.84 45.70,13.90,214.00,8.63,-26.60 54.30,15.70,231.00,8.50,-26.84 45.80,14.20,219.00,8.60,-26.62 49.80,16.80,230.00,8.47,-26.69 46.20,14.40,214.00,8.24,-26.82 49.50,16.20,229.00,8.50,-26.75 43.50,14.20,220.00,8.65,-26.69 50.70,15.00,223.00,8.64,-26.74 47.70,15.00,216.00,8.53,-26.73 46.40,15.60,221.00,8.35,-26.71 48.20,15.60,221.00,8.25,-26.67 46.50,14.80,217.00,8.58,-26.59 46.40,15.00,216.00,8.48,-26.95 48.60,16.00,230.00,8.60,-26.71 47.50,14.20,209.00,8.39,-26.79 51.10,16.30,220.00,8.40,-26.77 45.20,13.80,215.00,8.25,-26.65 45.20,16.40,223.00,8.20,-26.66 49.10,14.50,212.00,8.36,-26.28 52.50,15.60,221.00,8.29,-26.28 47.40,14.60,212.00,8.19,-26.24 50.00,15.90,224.00,8.20,-26.40 44.90,13.80,212.00,8.11,-26.20 50.80,17.30,228.00,8.27,-26.30 43.40,14.40,218.00,8.23,-26.19 51.30,14.20,218.00,8.15,-26.34 47.50,14.00,212.00,8.13,-26.24 52.10,17.00,230.00,8.28,-26.12 47.50,15.00,218.00,8.30,-26.09 52.20,17.10,228.00,8.37,-25.90 45.50,14.50,212.00,8.16,-26.23 49.50,16.10,224.00,8.83,-25.69 44.50,14.70,214.00,8.20,-26.17 50.80,15.70,226.00,8.27,-26.11 49.40,15.80,216.00,8.04,-26.07 46.90,14.60,222.00,7.89,-26.05 48.40,14.40,203.00,8.17,-26.14 51.10,16.50,225.00,8.21,-26.37 48.50,15.00,219.00,8.10,-26.19 55.90,17.00,228.00,8.31,-26.35 47.20,15.50,215.00,8.31,-26.22 49.10,15.00,228.00,8.66,-25.79 47.30,13.80,216.00,8.26,-26.24 46.80,16.10,215.00,8.32,-26.06 41.70,14.70,210.00,8.12,-26.45 53.40,15.80,219.00,8.41,-26.34 43.30,14.00,208.00,8.42,-26.38 48.10,15.10,209.00,8.46,-26.23 50.50,15.20,216.00,8.25,-26.18 49.80,15.90,229.00,8.29,-26.21 43.50,15.20,213.00,8.22,-26.11 51.50,16.30,230.00,8.79,-25.76 46.20,14.10,217.00,8.30,-25.96 55.10,16.00,230.00,8.08,-26.18 44.50,15.70,217.00,8.04,-26.18 48.80,16.20,222.00,8.34,-25.89 47.20,13.70,214.00,7.99,-26.21 46.80,14.30,215.00,8.41,-26.14 50.40,15.70,222.00,8.30,-26.04 45.20,14.80,212.00,8.24,-26.12 49.90,16.10,213.00,8.36,-26.16 46.50,17.90,192.00,9.04,-24.30 50.00,19.50,196.00,8.92,-24.24 51.30,19.20,193.00,9.29,-24.76 45.40,18.70,188.00,8.65,-24.63 52.70,19.80,197.00,9.01,-24.62 45.20,17.80,198.00,8.89,-24.49 46.10,18.20,178.00,8.86,-24.56 51.30,18.20,197.00,8.64,-24.84 46.00,18.90,195.00,8.47,-24.29 51.30,19.90,198.00,8.80,-24.36 46.60,17.80,193.00,8.95,-24.60 51.70,20.30,194.00,8.69,-24.39 47.00,17.30,185.00,8.72,-24.81 52.00,18.10,201.00,9.02,-24.39 45.90,17.10,190.00,9.12,-24.90 50.50,19.60,201.00,9.81,-24.73 50.30,20.00,197.00,10.02,-24.55 58.00,17.80,181.00,9.14,-24.58 46.40,18.60,190.00,9.32,-24.64 49.20,18.20,195.00,9.27,-24.64 42.40,17.30,181.00,9.35,-24.69 48.50,17.50,191.00,9.43,-24.26 43.20,16.60,187.00,9.35,-25.01 50.60,19.40,193.00,9.28,-24.97 46.70,17.90,195.00,9.74,-24.59 52.00,19.00,197.00,9.37,-24.47 50.50,18.40,200.00,8.94,-23.89 49.50,19.00,200.00,9.63,-24.35 46.40,17.80,191.00,9.37,-24.53 52.80,20.00,205.00,9.25,-24.70 40.90,16.60,187.00,9.08,-24.55 54.20,20.80,201.00,9.49,-24.60 42.50,16.70,187.00,9.37,-24.45 51.00,18.80,203.00,9.23,-24.17 49.70,18.60,195.00,9.75,-24.31 47.50,16.80,199.00,9.08,-25.15 47.60,18.30,195.00,8.84,-24.66 52.00,20.70,210.00,9.43,-24.68 46.90,16.60,192.00,9.81,-24.74 53.50,19.90,205.00,10.03,-24.91 49.00,19.50,210.00,9.53,-24.67 46.20,17.50,187.00,9.62,-24.66 50.90,19.10,196.00,10.02,-24.87 45.50,17.00,196.00,9.36,-24.66 50.90,17.90,196.00,9.44,-24.17 50.80,18.50,201.00,9.46,-24.36 50.10,17.90,190.00,9.47,-24.46 49.00,19.60,212.00,9.34,-24.45 51.50,18.70,187.00,9.69,-24.43 49.80,17.30,198.00,9.32,-24.42 48.10,16.40,199.00,9.47,-24.48 51.40,19.00,201.00,9.44,-24.36 45.70,17.30,193.00,9.42,-24.81 50.70,19.70,203.00,9.94,-24.59 42.50,17.30,187.00,9.57,-24.61 52.20,18.80,197.00,9.78,-24.56 45.20,16.60,191.00,9.62,-24.79 49.30,19.90,203.00,9.89,-24.60 50.20,18.80,202.00,9.74,-24.40 45.60,19.40,194.00,9.47,-24.66 46.80,16.50,189.00,9.65,-24.48 45.70,17.00,195.00,9.27,-24.32 55.80,19.80,207.00,9.70,-24.53 43.50,18.10,202.00,9.38,-24.41 49.60,18.20,193.00,9.46,-24.71 50.80,19.00,210.00,9.98,-24.69 50.20,18.70,198.00,9.39,-24.25