Data points are x_{i} (i=1,...,m). In the applet, they are situated on the 607×400 workdesk.
The first
principal component. 
Principal component analysis (PCA) is the most popular method for data approximation by straight lines and planes, and for dimensionality reduction. PCA was invented by Karl Pearson. In 1901 he wrote: "In many physical, statistical, and biological investigations it is desirable to represent a system of points in plane, three, or higher dimensioned space by the "bestfitting" straight line or plane." The first principal component is the straight line that minimizes the sum of the squared distances from the data points. It is the least squares approximation of the data set by a line. The closest approximation is equivalent to the widest scattering of projections because of the Pythagorean theorem. Therefore, the first principal component maximizes the variance of projections of the data points on the straight line.To find the second principal component, it is sufficient to subtract from data vectors their projections on the first principal component and then find again the best straight line approximation, and so on. In this applet, we use the simple iterative splitting algorithm for calculation of the "bestfitting" straight line.
Two initial nodes, y_{1}, y_{2}, are defined by the user before learning. The data points x_{i} (i=1,...,m) are given.
Further reading
The SelfOrganizing Map (SOM), also known as
the Kohonen network, is a computational method for
the visualization, lowdimensional approximation and analysis of
highdimensional data. It works with two spaces: a lowdimensional space with a
regular grid of nodes and the higherdimensional space of data. In our
applet, the data space is twodimensional (a plane) and the nodes form a onedimensional
array.
Radius 3 linear
Bspline 
Each node n_{i} is mapped into the dataspace. Its image, y_{i} is called the coding vector, the weight vector or the model vector. Rather often, the images of nodes in the data space are also called "nodes", with some abuse of language, and the same notation is often used for the nodes and the corresponding coding vectors. The exact sense is usually clear from the context.
The data point belongs to (or is associated with) the nearest coding vector. For each i we have to select data points owned by y_{i}. Let the set of indices of these data points be C_{i}: C_{i}={l:x_{l}y_{i}^{2}≤x_{l}y_{j}^{2} ∀i≠j} and C_{i} be the number of points in C_{i}.
The learning of SOM is a special procedure
that aims to improve the approximation of data by the coding vectors and, at
the same time, approximately preserves the neighbourhood relations between
nodes. If nodes are neighbours then the corresponding coding vectors should
also be relatively close to each other. To improve the approximation of data,
each coding vector y_{i} moves during
learning towards the mean point of the data points owned by y_{i}.
It involves in this movement the neighbours to preserve neighbourhood relations
between nodes. This Batch SOM learning rule was proposed by Kohonen. The learning algorithm has one parameter, the
"Neighbourhood radius" h_{max}.
It is necessary to evaluate the neighbourhood function and this should
be assigned before learning. For this applet, the neighbourhood function, h_{ij}, has the simple Bspline form: 1h_{ij}=ij/(h_{max}+1) if ij=h_{max} and h_{ij}=0
if ij>h_{max} (the
example is given on the right).
1D SOM as a
broken line. 
The initial location of coding vectors should be assigned before the learning starts. There are three options for SOM initializations:
· The
user can the select coding vectors randomly from datapoints;
· The
coding vectors may be selected as a regular grid on the first principal
component with the same variance;
· The
coding vectors may be assigned manually, by mouse clicks.
All learning steps are saved into "Learning history". If learning restarts then the last record of Learning
history
is used instead of the user defined initial coding vectors.
Before
learning, all C_{i} are set to the empty set (C_{i}=∅),
and the steps counter is set to zero.
If we connect the coding vectors for the nearest nodes by
intervals then SOM is represented as a broken line in the data space. To
evaluate the quality of approximation, we have to calculate the distance from
the datapoints to the broken line and evaluate the Fraction of variance unexplained (FVU).
Approximation of a spiral by GSOM; 65 nodes, Approximation of a spiral by GSOM; 152 nodes. Approximation
of a spiral by GSOM. All attempts to
approximate a spiral by 
Further reading
The
growing selforganizing map (GSOM) was developed to identify a suitable map
size in the SOM. It starts with a minimal number of nodes and grows new nodes
on the boundary based on a heuristic. At the same time, as a byproduct, we can
sometimes get better approximation properties (see an example on the right).
There are many heuristics for GSOM growing. Our version is optimized for 1D
GSOMs, the model of principal curves. GSOM method is specified by three
parameters. Their values should be set before the learning starts:
· "Neighbourhood radius"
this parameter, h_{max}, is used to evaluate the
neighbourhood function, h_{ij} (the
same as for SOM).
· "Maximum number of nodes"
this parameter restricts the size
of the map.
· "Stop when fraction of variance unexplained percent is less
than..."
The threshold for FVU value.
The
GSOM algorithm includes learning and growing phases. The learning
phase is exactly the SOM leaning algorithm. The
only difference is in the number of learning steps. For SOM we use 100 batch
learning steps after each learning start or restart, whereas for GSOM we select
20 batch learning steps in a learning loop.
The initial location of coding vectors should be assigned before the learning starts. There are three options for GSOM initializations:
· The user can select coding vectors randomly from datapoints;
· The coding vectors may be selected as a regular grid on the
first principal component with the same variance;
· The coding vectors may be assigned manually, by mouse clicks.
We use n for the number of nodes in the map.
All learning steps (vectors y_{i}) are saved into "Learning history". If learning restarts then y_{i} from the last record of the Learning history are used instead of user defined initial coding vectors.
Beforethe
start, set the step counter to zero and the sets of data points associated with
nodes to the empty sets, C_{i}=∅.
Further
reading
The applet contains several menu panels (top left), the University of Leicester label (top right), work desk (center) and the label of the Attribution 3.0 Creative Commons publication license (bottom).
To wisit the website of the University of Leicester click on the University of Leicester logo.
To read the Creative Commons publication license click on the bottom license panel.

The
first menu panel allows you to create and edit the data set. Every data point
is displayed as a small circle on the work desk.
The
first part of the Data set panel contains one big coloured rectangle and
a palette built from six small differently coloured rectangles. To select one
of the six colors click on a small rectangle. The big
coloured rectangle displays the currently selected color.
In this applet, this is just the colour of the cursor. The colours are used for
data points in the applets Kmeans and Kmedoids and KNN and Potential Energy to distinguish points from different
classes and clusters.
The
third part of the Data set panel contains six buttons. The first three
buttons change the type of cursor brush.
· The button One point switches the brush to add single
points. Every mouse click on the work desk will add a point to the data set.
· The button Scatter allows you to add several points by
one mouse click. You can choose the number of added points in the Number of points spinner at the middle part of the menu panel.
Points are scattered randomly in a circle which radius is determined by the
slider Caliber in the middle part of menu
panel.
· The button Erase switches the mouse cursor to the
eraser. When you click the eraser mouse cursor on the work desk, all points,
whose centres are covered by the cursor, are removed from the work desk.
· The button Select is not used in this applet.
· When you press the Random button several points are added
on the work desk. The number of points is defined by the Number of points spinner. The locations of new points are
generated randomly on whole work desk.
· The button Clear all completely clears the work desk. It
removes all kinds of object: data points, centroids, test results, maps and so
on.
The All methods menu panel containes
two parts.

The
right part, Learning
history, contains five buttons, one field for data input and two
information fields.
· The
leftmost button shows the first step. The second button moves to the previous
step.
· The
third button starts the slide show with delay between steps selected in
editable field Delay (ms). It goes always from
the first step to the last one.
· The
fourth button moves to the next step. The rightmost button shows the last step.
The information field Step # shows the number of the current step.
· The
information field from indicates the total number of learning steps in
the learning history.
The left part of all methods menu panels contains the name
of the method, two fields to show the number of nodes, and six standard
buttons.

· The
Random init button serves to add a coding
vector at randomly selected data point to the current set of coding vectors.
· The
PCA init button serves to add a new coding
vector on the first principal conponent line. All
coding vectors are distributed on the first principal component line
equidistantly. The variance of this equidistribution
on a segment is selected equal to the total variance of the data points.
· The
Remove initialization button removes all nodes.
· The
Options button opens options dialog for customizing method parameters.
This dialog is specific for every method.
· The
Learn button starts the learning process, specific for every method.
· The
Show error/Hide error button shows/hides the tableau, which contains the
names of the methods and the corresponding errors (Fraction
of variance unexplained,). The tableau is located at the right bottom
corner of the work desk.
Use
the menu panel PCA to work with principal component analysis.

For PCA, all points and lines are in a blue color. The points for PCA have the form of fourray stars (see left). To start PCA learning two points are needed. You can select initial points by mouse clicks at choosen locations, or by the button Random initialization. When you try add the third point, the first point is removed.
After learning is finished, use the Learning history buttons at the right part of menu panel to look at the learning history and results. You will see one fourray star, the centroid of the data set, and one line, the current approximation of the first principal component.
Use
the menu panel SOM to work with self organizing
maps.

For SOM, all points and lines are in a red color. The points for SOM have the form of nineray stars (see left). To start SOM learning, the initial positions of all coding vectors are needed. To add a point to an initial approximation, click the mouse button at the choosen location, or use the button Random init or the button PCA init. When you add a new point, it is attached to the last added point.
After learning is finished, use the Learning history buttons at the right part of menu panel to look at the learning history and results.
Use
the menu panel GSOM to work with growing self
organizing map.

For GSOM, all points and lines are in a green color. The points for GSOM have the form of nineray stars (see left). To start learning, GSOM is needed to initiate one point. To place initial point click the mouse button in the chosen location, or push the button Random init or the button PCA init. When you add a new point, the new point is joined with the last added point.
After learning is finished, use the Learning history buttons at the right part of menu panel to look at the learning history and results.
The
menu panel Std. data is needed to work with some standard data set.

· The first button Clear all comletely
clears the work desk. It removes all kinds of objects: data points, coding
vectors and so on.
· The button Scatter serves to scatter existing data
points. The scattering procedure adds a specified number of new data points in
a specifyed circle around every data point, which
exists. Options of scattering are defined in dialog (see left).
· The Number of points to add parameter defines the number
of points to add to every data point.
· The Scattering radius parameter defines the radius of
the circle around each data point where these new data points should be equidistributed.
· The last button Save data saves all data point locations
to a text file named Data for mirkes.txt. In order to add this data set
to the Std. data panel file with a short description you needed to send
this file by email to Evgeny Mirkes, emmirkes@gmail.com. Warning: To use the
button Save data you should change
settings of you browser security system, or start this applet offline without brouser.
Any
other button sends a stored ("standard") image to the work desk.
We
approximate data by a straight line (PCA) or by a broken line (SOM and GSOM).
The dimensionless least square evaluation of the error is the Fraction
of variance unexplained (FVU). It is defined as the fraction:
[The
sum of squared distances from data to the approximating line line/The sum of
squared distances from data to the mean point].
The
distance from a point to a straight line is the length of a perpendicular
dropped from the point to the line. This definition allows us to evaluate FVU
for PCA. For SOM and GSOM we need to solve the following problem. For the given
array of coding vectors {y_{i}} (i=1,2, ... m) we have to calculate the distance from
each data point x to the broken line specified by a
sequence of points {y_{1}, y_{2}, ... y_{m}}. For the data point x, its
projection onto the broken line is defined, that is, the closest point. The
square of distance between the coding vector y_{i}
and the point x is d_{i}(x)=xy_{i}^{2} (i=1,2, ... m).
Projection of a data point onto a segment 
Let us calculate the squared distance from the data point x to the segment [y_{i}, y_{i}_{+1}] (i=1,2, ... m1). For each i, we calculate l_{i}(x)=(xy_{i} , y_{i}_{+1}y_{i} ) /y_{i}_{+1}y_{i}^{2} (see left).
If 0<l_{i}(x)<1 then the point, nearest to >x on the segment [y_{i}, y_{i}_{+1}], is the internal point of the segment. Otherwise, this nearest point is one of the segment's ends.
Let 0<l_{i}(x)<1 and >c be a projection of x onto segment [y_{i}, y_{i}_{+1}]. Then c  y_{i}^{2}=(l_{i}(x)y_{i}y_{i}_{+1})^{2} and, due to Pythagorean theorem, the squared distance from x to the segment [y_{i}, y_{i}_{+1}] is r_{i}(x)=xy_{i}^{2}(l_{i}(x) y_{i}y_{i}_{+1})^{2}.
Let d(x)=min{d_{i}(x)  i=1,2, ... m} and r(x)=min{r_{i}(x)  0< l_{i}(x) <1, 0>< i< m }. Then the squared distance from x to the broken line specified by the sequence of points {y_{1}, y_{2}, ... y_{m}} is D(x)=min{d(x), r(x)}.
For the given set of the data points x_{j} and the given approximation by a broken line, the sum of the squared distances from the data points to the broken line is S=∑_{j}D(x_{j}), and the fraction of variance unexplained is S/V, where V=∑_{j}(x_{j}  X )^{2} and X is the empirical mean: X=(1/n)∑x_{i}.