| |
Other
Utilities
Text analyser
A text analysis utility has also been bundled with
Akhar, which is a very useful tool for researchers
working in fields of computational linguistics,
translation, natural language processing, text
processing, lexicography, optical character recognition,
information retrieval systems, speech recognition etc.
It performs quantitative analysis of text and generates
word-frequency lists, character frequency lists,
concordances and other statistics such as count of
running words and unique in a text, token by type ratio,
mean word length, percentage frequency of each word
length etc. These statistics have applications in
linguistic and stylistic analysis. As proposed by one of
the researcher, counts of the number of words of various
lengths (one-letter words, two-letter words, and so on)
could be graphed to produce a consistent fingerprint for
a writer -- as long as the samples were sufficiently
large. A comparison of novels written by the same author
shows few and small differences in word lengths while
the comparison of the word lengths of novels written by
different authors shows significant differences.
Automatic statistical routines are applied to very large
bodies of text to uncover the facts about language that
no amount of manual searching could reveal. Computer
analysis of electronic texts can make it easy to answer
a series of questions that otherwise can be answered
only by intuition, guess, or uncommonly mind-numbing
research. The ten most frequently occurring words in Sri
Guru Granth Sahib are displayed below.

Some of the main features of the text analysis utility
are:
-
It performs quantitative analysis
of text and generates word-frequency lists,
character frequency lists and other statistics such
as count of running words and unique in a text,
token by type ratio, mean word length, percentage
frequency of each word length etc.
-
It can analyse the text encoded
in UNICODE/ISCII or font encoded files stored in
RTF/DOC/HTML formats as well as plain text files.
Multiple files can also be selected
-
The word lists can be displayed
on alphabetical order, occurance in text, frequency
and word length. They can also be arranged in
ascending or descending order.
-
It can analyse both English and
Punjabi text and arrange the words in alphabetic
order according to the text’s language. The Punjabi
text, if font encoded, could be encoded in any of
the popular font.
-
A concordance utility has also
been provided (Fig. 6), which can prove very useful
for context analysis. The user can search for any
word in a document and all the occurrences of the
word will be displayed in KWIC (Key Word in Context)
format, where the main word is highlighted placed in
centre along with its neighbouring text.
-
The frequency lists and
statistics can be stored for future reference.
|
|