Mathematics/Statistics/Normalization

From Dev Wiki
< Mathematics‎ | Statistics
Revision as of 13:23, 18 May 2020 by Brodriguez (talk | contribs) (Add video reference)
Jump to navigation Jump to search

Normalization is a method to make values fall within a "common range".
For example, in Data Mining and Neural Networks, it's common to normalize values so that they fall into the (inclusive) range of [-1, 1] or [0, 1].

Normalization keeps the ratio of values in an attribute, while ensuring that no single attribute has a significantly larger range than the others. Discrepancy in the ranges an attribute spans may cause one attribute to have more weight (and thus importance) in statistical analysis than other attributes, even when no such correlation should otherwise be expected to exist.

For example, if trying to run analysis with "weight" and "height" attributes for a population of individuals, the unit of measurement used will implicitly change how much importance each attribute has in the analysis. Alternatively, we can normalize them both to a range between [0, 1], so that both attributes have equal importance in analysis, regardless of units of measurement.

Below are some common normalization types.

Min-Max Normalization

Min-max normalization is a form of simple linear transformation, where values are simply scaled down to the desired interval.

Warn: This form of normalization will result in errors if a value added at a later date falls outside of the original data range. In such a case, it's important to re-normalize the entire dataset, using the original data.

Where:

  • is the original value of the item to adjust.
  • is the original minimum of the given attribute.
  • is the original maximum of the given attribute.
  • is the new minimum to use for the given attribute.
  • is the new maximum to use for the given attribute.


Zero-Mean Normalization

Also known as "z-score normalization".

Zero-Mean Normalization normalizes attribute values based on the mean and standard deviation.
In other words, this calculates how many standard deviations a value is above or below the mean.


Where:

  • is the original value of the individual dataset item to adjust.
  • is the attribute mean.
  • is the attribute standard deviation.


For further explanation, see this youtube video.


Normalization by Decimal Scaling

Template:ToDo