ISBN-13: 9783639026740 / Angielski / Miękka / 2008 / 108 str.
ISBN-13: 9783639026740 / Angielski / Miękka / 2008 / 108 str.
Content in numerous data sources are not directly amenable to machine processing. This book describes techniques for automated semantic analysis of schematic content which are characterized by being populated from backend databases. Starting with a seed set of hand-labeled instances of semantic concepts in a set of HTML documents, a technique is devised that bootstraps an annotation process for automatic identification of concept instances present in other documents. The technique exploits the observation that semantically related items in schematic HTML documents exhibit consistency in presentation style and spatial locality to learn statistical concept models, using light-weight semantic features. This model directs the annotation of diverse Web documents possessing similar content semantics. The power of these techniques is demonstrated through applications developed for real-life problems that include audio-based assistive browsing for non-visual Web access, focused browsing on handhelds with semantic bookmarks, text data cleaning, and accurate identification of remote homologs of biological protein sequences.