Welcome to hypercone.com on July 5 2009.
This is an internet experiment running to monitor browsing habbits of individuals through wikipedia contents.

Text normalization

From Wikipedia, the free encyclopedia

Jump to: navigation, search

Text normalization is a process by which text is transformed in some way to make it consistent in a way which it might not have been before. Text normalization is often performed before text is processed in some way, such as generating synthesized speech, automated language translation, storage in a database, or comparison.

Examples of text normalization:

  • Unicode normalization
  • converting all letters to lower or upper case
  • removing punctuation
  • removing accent marks and other diacritics from letters
  • expanding abbreviations
  • removing stopwords or "too common" words
  • stemming

While this may be done manually, and usually is in the case of ad hoc and personal documents, many programming languages support mechanisms which enable text normalization.

The text normalization is useful, for example, for comparing two sequence of characters which mean the same but are represented differently. The examples of this kind of normalization include, but not limited to, "don't" vs "do not", "I'm" vs "I am", "Can't" vs "Cannot".

Further, "1" and "one" are same, "1st" is same as "first", and so on. Instead of treating these strings as different, through text processing, one can treat them as same.

Personal tools
Languages

Visit joltnews for the latest headlines
Visit bloit.com for company information
Geed Media does computer consulting on long island.
This page viewed times. See Logs