Aug 05 2011
Tool: Corpex – Wikipedia Corpora Explorer

- the ten most frequent words that start with the typed sequence of letters (as a barcharts and a piechart), and
- the most frequent letter following the already typed sequence of letters (again, as a barchart and a piechart).
Additionally, the ten most frequent following words of any input word are visualized (as a barcharts and a piechart).
This can be used for many applications where the occurence of words in different language editions of Wikipedia is of use. An API is also provided for easy use of the data.
Corpex is currently available in the following languages: German (de), English (en), Spanish (es), French (fr), Hungarian (hr), Romanian (ro), Albanian (sq), Bulgarian (bg), Czech (cs), Italian (it), Swedish (sv), Serbian (sr), Croatian (hr), Serbo-Croatian (sh), Bosnian (bs), and simple English (simple). It is further available for the Brown Corpus (brown). Further languages are being prepared.
Corpex is still under development. The source code is fully open source, and all the data is also freely available. Feedback, and especially suggestions for cooperation, is welcome.
Try it out at render-project.eu/tools/Corpex