Detecting parallel passages in the DCSYou can show information about parallel or similar passages in the corpus in two ways:
- Display the citations graph using the menu command "Parallels".
- Activate the option "Show parallels" on the text page to show parallels for a given chapter.
To show the parallels between two texts, click on the link that connects the two texts (may be a bit difficult in some cases; move the mouse pointer with care and patience!).
Sketch of the algorithm:
- Set minimum sentence length LMIN and maximum sentence length LMAX.
- Load all sentences that contain between LMIN and LMAX lexical units after "stop words" (ca, eva, as, ...), hapax legomena and duplicate lexemes have been removed.
- Order the remaining lexemes by a sort key.
- Perform pairwise comparison of the sentences (runtime: O(n+m) per comparison).
- If two sentences sn and sm of lengths n and m have more than (min(n,m)-1) lexical units in common, create a link between the sentences. The weight w(nm) of this link is calculated as (sn ∩ sm)/(sn ∪ sm).
- Create a link diagram using GraphViz (GVedit). Each text is interpreted as a vertex, and citations are inserted as links between vertices. Link weights are calculated by dividing the number of passages that are shared by two texts by the sum of the counts of the lexical units in both texts. The sums of w(mn) are printed as labels next to each link.
- Calculate the centrality of each text (= vertex), which can be interpreted as a measure of how frequently a text cites from other texts or is cited by other texts. Centrality is expressed graphically by increasingly green color.