Modern organizations are accumulating huge volumes of textual documents. To turn archives into valuable know- ledge sources, textual content must become explicit and queryable. Semantic tagging with markup languages such as XML satisfies both requirements. We thus introduce the DIAsDEM* framework for extra ting semantics from structural text units (e.g., sentences), assigning XML tags to them and deriving a flat XML DTD for the archive. DIAsDEM focuses on archives characterized by a peculiar terminology and by an implicit structure such as court filings and company reports. In the knowledge discovery phase, text units are iteratively clustered by similarity of their content. Each iteration outputs clusters satisfying a set of quality criteria. Text units contained in these clusters are tagged with semi- automatically determined luster labels and XML tags respectively. Additionally, extracted named entities (e.g.,per- sons) serve as attributes of XML tags. We apply the frame- work in a case study on the German Commercial Register.
Citation:
Henner Graubitz, Myra Spiliopoulou, Karsten Winkler, "The DIAsDEM Framework for Converting Domain-Specific Texts into XML Documents with Data Mining Techniques," icdm, pp.171, First IEEE International Conference on Data Mining (ICDM'01), 2001