loading...
XStruct: Efficient Schema Extraction from Multiple and Large XML Documents
Atlanta, Georgia April 03-April 07
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICDEW.2006.16622nd International Conference on Data ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Jan Hegewald, Humboldt-Universitat zu Berlin
Felix Naumann, Humboldt-Universitat zu Berlin
Melanie Weis, Humboldt-Universitat zu Berlin
XML is the de facto standard format for data exchange on the Web. While it is fairly simple to generate XML data, it is a complex task to design a schema and then guarantee that the generated data is valid according to that schema. As a consequence much XML data does not have a schema or is not accompanied by its schema. In order to gain the benefits of having a schema - efficient querying and storage of XML data, semantic verification, data integration, etc.- this schema must be extracted.

In this paper we present an automatic technique, XStruct, for XML Schema extraction. Based on ideas of [5], XStruct extracts a schema for XML data by applying several heuristics to deduce regular expressions that are 1-unambiguous and describe each element?s contents correctly but generalized to a reasonable degree. Our approach features several advantages over known techniques: XStruct scales to very large documents (beyond 1GB) both in time and memory consumption; it is able to extract a general, complete, correct, minimal, and understandable schema for multiple documents; it detects datatypes and attributes. Experiments confirm these features and properties.

Citation:
Jan Hegewald, Felix Naumann, Melanie Weis, "XStruct: Efficient Schema Extraction from Multiple and Large XML Documents," icdew, pp.81, 22nd International Conference on Data Engineering Workshops (ICDEW'06), 2006
Usage of this product signifies your acceptance of the Terms of Use.


Suggestions