Lexical n-grams

This page displays frequent bigrams that contain a given word. Bigrams can be used to detect linguistic patterns or typical phraseology in the use of a word. They are, therefore, extremely useful for studies of word semantics and the general structure of a text.

The bigram-page is called from the page "Lexical details" by clicking the link "n-grams" at the top of the page (below the two links to the occurrences). The following settings can be changed in bigram search:

  1. Pattern: "xxx - word" searches for combinations in which the given word is at the second position (and v.v.).
  2. Minimal frequency indicates how often a bigram must occur to be displayed as a result.
  3. Sorting order offers two ways to sort the results:
    1. by descending frequency and ...
    2. by "interestingness". This point deserves a few more words.

      When the sorting order "Frequency" is selected, the top entries are in most cases very frequent, but also very uninteresting bigrams, in which the searched word is combined with some high-frequency term such as ca, tad or eva. You may test this for the word rājan!. Simply inverting the sorting order does not generate more interesting bigrams (you may check this, too). Therefore, the interestingness of two-word combinations is calculated using the mathematical measure of mutual information. In short, mutual information emphasizes a combination if its components occur together in most cases. An example may clarify this idea.

      Assume we have a text of 1000 words and two words A and B that occur 30 and 20 times in this text. In addition, A and B occur 19 times together in a bigram. The mutual information of this bigram is

      Now, assume that the same two words occur only twice in a bigram in another text of 1000 words. In this case, mutual information is calculated as

      This value is clearly lower than the first one. In the first text, the two words formed bigrams in almost all possible cases (19 of 20), while they were combined only rarely in the second text. Ranking by mutual information detects cases such as the first one and is therefore useful for finding typical semantic patterns of a word.