Using Comparable Corpora for Under-Resourced Areas of Machine Translation » książka

zaloguj się | załóż konto

topmenu

Szukaj

Książki na zamówienie

Wyszukiwanie zaawansowane

Pusty koszyk

Bezpłatna dostawa dla zamówień powyżej 40 zł

Kategorie główne

• Nauka

[2950464]

• Literatura piękna

[1818042]

więcej...

Kategorie szczegółowe BISAC

Using Comparable Corpora for Under-Resourced Areas of Machine Translation

Name: Using Comparable Corpora for Under-Resourced Areas of Machine Translation
Brand: Springer
Price: 563.56 PLN
Availability: InStock

ISBN-13: 9783319990033 / Angielski / Twarda / 2019 / 323 str.

Inguna Skadiņa; Robert Gaizauskas; Bogdan Babych

Using Comparable Corpora for Under-Resourced Areas of Machine Translation

ISBN-13: 9783319990033 / Angielski / Twarda / 2019 / 323 str.

Inguna Skadiņa; Robert Gaizauskas; Bogdan Babych

cena 563,56
(netto: 536,72 VAT: 5%)

Najniższa cena z 30 dni: 539,74

Termin realizacji zamówienia:
ok. 16-18 dni roboczych.

Darmowa dostawa!

This book provides an overview of how comparable corpora can be used to overcome the lack of parallel resources when building machine translation systems for under-resourced languages and domains.

Kategorie:

Informatyka, Bazy danych

Kategorie BISAC:

Computers > Artificial Intelligence - Natural Language Processing
Language Arts & Disciplines > Linguistics - General
Computers > Data Science - Data Analytics

Wydawca:

Springer

Seria wydawnicza:

Theory and Applications of Natural Language Processing

Język:

Angielski

ISBN-13:

9783319990033

Rok wydania:

2019

Wydanie:

2019

Ilość stron:

323

Waga:

0.66 kg

Wymiary:

23.5 x 15.5

Oprawa:

Twarda

Wolumenów:

Dodatkowe informacje:

Komentarz
Wydanie ilustrowane

1 Introduction

2 Cross-language comparability and its applications for MT

2.1 Introduction: Definition and use of the concept of comparability

2.2 Development and calibration of comparability metrics on parallel corpora

2.2.1 Application of corpus comparability: Selecting coherent parallel corpora for domain-specific MT training

2.2.2 Methodology

2.2.2.1 Description of calculation method

2.2.2.2 Symmetric vs. asymmetric calculation of distance

2.2.2.3 Calibrating the distance metric

2.2.3 Validation of the scores: cross-language agreement for source vs. target sides of TMX files

2.2.4 Discussion

2.3 Exploration of comparability features in document-aligned comparable corpora: Wikipedia

2.3.1 Overview: Wikipedia as a source of comparable corpora

2.3.2 Previous work on using Wikipedia as a linguistic resource

2.3.3 Methodology

2.3.3.1 Document pre-processing

2.3.3.2 Similarity measures

2.3.3.3 Eliciting human judgments

2.3.4 Results and analysis

2.3.4.1 Responses to the questionnaire

2.3.4.2 Inter-assessor agreement

2.3.4.3 Correlation of similarity measures to human judgments

2.3.4.4 Classification task

2.3.5 Discussion

2.3.5.1 Features of ‘Similar’ articles

2.3.5.2 Measuring cross-language similarity

2.3.6 Section conclusions

2.4 Metrics for identifying comparability levels in non-aligned documents

2.4.1 Using parallel and comparable corpora for MT

2.4.2 Related work

2.4.3 Comparability metrics

2.4.3.1 Lexical mapping based metric

2.4.3.2 Keyword based metric

2.4.3.3 Machine translation based metrics

2.4.4 Experiments and evaluation

2.4.4.1 Data sources

2.4.4.2 Experimental results

2.4.5 Metric application to equivalent extraction

2.4.6 Discussion

2.4.6.1 Advantages and disadvantages of the metrics

2.4.6.2 Using semi-parallel equivalents in MT systems

2.4.7 Conclusion

3 Collecting comparable corpora

3.1 Introduction

3.2 Previous work in collecting comparable corpora

3.2.1 Web crawling

3.2.2 Identifying comparable text

3.3 ACCURAT techniques to collect comparable documents

3.3.1 Comparable corpora collection from Wikipedia

3.3.1.1 Extracting comparable articles

3.3.1.2 Measuring similarity in inter-language linked documents

3.3.2 Comparable corpora collection from news articles

3.3.3 Comparable corpora collection from narrow domains

3.3.3.1 Acquiring comparable documents

3.3.3.2 Aligning comparable document pairs

4 Extracting data from comparable corpora

4.1 Introduction

4.2 Term extraction, tagging, and mapping for under-resourced languages

4.2.1 Related work

4.2.2 Term Extraction, tagging, and mapping with the ACCURAT toolkit

4.2.2.1 Term candidate extraction with CollTerm

4.2.2.1.1 Linguistic filtering

4.2.2.1.2 Minimum frequency filter

4.2.2.1.3 Statistical ranking

4.2.2.1.4 Cut-off method

4.2.2.2 Term tagging in documents

4.2.2.3.1 Term tagging evaluation for Latvian and Lithuanian

4.2.2.3.2 Term tagging evaluation for Croatian

4.2.2.3 Term mapping

4.2.2.4 Comparable corpus term mapping task

4.2.2.5 Discussion

4.2.3 Experiments with English and Romanian term extraction

4.2.3.1 Single-word term extraction

4.2.3.2 Multi-word term extraction

4.2.3.3 Experiments and results

4.2.4 Multi-word term extraction and context-based mapping for English-Slovene

4.2.4.1 Resources and tools used

4.2.4.1.1 Comparable corpus

4.2.4.1.2 Seed lexicon

4.2.4.1.3 LUIZ

4.2.4.1.4 ccExtractor

4.2.4.2 Experimental setup

4.2.4.2.1 Term extraction

4.2.4.2.2 Term mapping

4.2.4.2.3 Extension of the Seed lexicon

4.2.4.3 Evaluation of the results

4.2.4.3.1 Evaluation of term extraction

4.2.4.3.2 Evaluation of term mapping

4.2.4.4 Discussion

4.3 Named entity recognition using TildeNER

4.3.2 Annotated corpora

4.3.3 System design

4.3.3.1 Feature function selection

4.3.3.2 Data pre-processing

4.3.3.3 NER model bootstrapping

4.3.3.4 Refinement methods

4.3.4 Evaluation

4.3.4.1 Non-comparative evaluation

4.3.4.2 Experimental comparative evaluation

4.3.5 Discussion

4.4 Lexica extraction

4.4.1 Related work

4.4.2 Experiments on bilingual lexicon extraction

4.4.2.1 Experimenting with key parameters

4.4.2.2 Corpus size and comparability

4.4.2.3 Seed lexicons

4.4.2.4 Vector building and comparison

4.4.2.5 Evaluation of results

4.4.3 Bilingual lexicon extraction for closely related languages

4.4.3.1 Building the comparable corpus

4.4.3.2 Building the Seed lexicon

4.4.3.3 Extending the Seed lexicon with cognates

4.4.3.4 Extending the Seed Lexicon with first translations

4.4.3.5 Combining cognates and first translations of the most frequent words to extend the Seed lexicon

4.4.3.6 Re-ranking of translation candidates with cognate clues

4.4.3 Discussion

5 Mapping and aligning units from comparable corpora

5.1 Introduction

5.2 Related work

5.3 Document alignment in comparable corpora

5.3.1 EMACC

5.3.2 EMACC evaluations

5.4 Parallel sentence mining from comparable corpora

5.4.1 LEXACC

5.4.1.1 Indexing target sentences

5.4.1.2 Finding translation candidates for source sentences

5.4.1.3 Filtering

5.4.1.4 The PEXACC translation similarity measure

5.4.1.4.1 Features

5.4.1.4.2 Learning the optimal weights

5.4.1.5 Evaluations

5.4.1.5.1 Experimental setting

5.4.1.5.2 Search engine efficiency

5.4.1.5.3 Filtering efficiency

5.4.1.5.4 Translation similarity efficiency

5.4.1.5.5 SMT experiments

5.4.2 PEXACC

5.4.2.1 The algorithm

5.4.2.2 Evaluations

5.4.2.2.1 Computing P, R, and F1

5.4.2.2.2 Comparison with the state of the art

5.4.2.2.3 Running PEXACC on real-world data

5.5 Parallel phrase mining from comparable corpora

5.5.1 Parallel phrase mining with SVM

5.5.1.1 Phrase pair generation

5.5.1.1.1 Training example extraction

5.5.1.1.2 Test instance generation

5.5.1.2 SVM classifier

5.5.1.2.1 Cognate-based methods for translation purposes

5.5.1.3 Experiments

5.5.1.3.1 Data sources

5.5.1.3.2 Phrase extraction for classifier training and testing

5.5.1.3.3 Phrase extraction from comparable corpora

5.5.1.3.4 Results

5.5.2 Parallel phrase mining with PEXACC

6 Training, enhancing, evaluating and using MT systems with comparable data

6.1 Introduction

6.2 Enriching general domain SMT systems with data from comparable corpora

6.2.1 Data used for experiments

6.2.2 Methodology

6.2.3 Experiments with data extracted from comparable corpora

6.2.4 Staggered experiments

6.3 Human evaluation of MT output

6.3.1 Evaluation methodology and the interface

6.3.2 Experiment set-up

6.3.3 Human evaluation results

6.4 MT adaptation for under-resourced domains

6.4.1 Initial extraction and alignment of terms and named entities

6.4.2 Comparable corpora collection

6.4.3 Extraction of term pairs from comparable corpus

6.4.4 Baseline system training

6.4.5 SMT system adaptation

6.5 MT adaptation to a narrow domain in case of resource-rich languages

6.5.1 Evaluation objects: Narrow-domain-tuned MT systems

6.5.2 Evaluation data

6.5.3 Evaluation methodology

6.5.5 Evaluation tools

6.5.6 Evaluation results

6.5.7 Conclusion

6.6 Application of Machine Translation in web authoring

6.6.1 The role of translation and MT in web authoring

6.6.2 Characteristics and requirements for translation in web authoring

6.6.3 MT systems enhanced with comparable corpora in web authoring – a use case

6.6.4 Conclusion

6.7 Systems for computer aided translation

6.7.1 Collecting and processing a comparable corpus

6.7.2 Building SMT systems

6.7.3 Automatic and comparative evaluation

6.7.4 Evaluation in localisation task

6.7.5 Discussion

7 New areas of application of comparable corpora

7.1 Automatic dictionary expansion using a Seed lexicon and non-parallel corpora

7.1.1 Motivation: Improving algorithms and boosting performance via cross-language transitivity

7.1.2 Approach

7.1.3 Language resources

7.1.4 Results

7.1.5 Discussion

7.2 Identifying word translations from comparable documents without a Seed lexicon

7.2.1 Motivation

7.2.2 Approach

7.2.2.1 Pre-processing steps

7.2.2.2 Alignment steps

7.2.2.3 Vocabularies

7.2.3 Evaluation setup

7.2.4 Results and evaluation

7.2.4.1 Comparison with other work

7.2.4.2 Application to other languages

7.2.5 Discussion

7.3 Chinese-Japanese parallel sentence extraction from quasi-comparable and comparable corpora

7.3.1 Motivation

7.3.2 Parallel sentence extraction system

7.3.3 Binary classification of parallel sentence identification

7.3.3.1 Training and testing

7.3.3.2 Features

7.3.4 Experiments

7.3.4.1 Data

7.3.4.2 Classification experiments

7.3.4.3 Extraction and translation experiments on quasi-comparable corpora

7.3.4.4 Extraction experiments on comparable corpora

7.3.5 Related work

7.3.6 Conclusion and future work

8 Appendices

8.1 Introduction

8.2 Tools for building a comparable corpus from the web

8.2.1 A workflow based corpora crawler

8.2.2 Focussed Monolingual Crawler (FMC)

8.2.3 Wikipedia retrieval tool

8.2.4 News information downloader using RSS feeds

8.2.5 News text crawler and RSS feed gatherer

8.2.6 News article alignment and downloading tool

8.3 Parallel data mining workflow

8.3.1 Tools to identify comparable documents and to extract parallel sentences and/or phrases from them

8.3.1.1 ComMetric: A toolkit for measuring comparability of comparable documents

8.3.1.2 DictMetric: A toolkit for measuring comparability of comparable documents

8.3.1.3 Features extractor and document pair classifier

8.3.1.4 EMACC: A textual unit aligner for comparable corpora using Expectation-Maximisation

8.3.2 Tools to extract parallel sentences and/or phrases from comparable documents

8.3.2.1 PEXACC: A parallel phrase extractor from comparable corpora

8.3.2.2 LEXACC: Fast parallel sentence mining from comparable corpora

8.4 The workflow for named entity and terminology extraction and mapping

8.4.1 Tools for named entity recognition

8.4.1.1 TildeNER

8.4.1.2 OpenNLP wrapper

8.4.1.3 NERA1: Named entity recognition for English and Romanian

8.4.2 Tools for terminology extraction

8.4.2.1 CollTerm – A tool for term extraction

8.4.2.2 Tilde’s wrapper system for CollTerm

8.4.2.3 KEA wrapper

8.4.2.4 Terminology extraction for English and Romanian

8.4.3 Tools for named entity and terminology mapping

8.4.3.1 Multi-lingual named entity and terminology mapper

8.4.3.2 NERA2: Language-independent named entity mapper

8.4.3.3 A language-independent terminology aligner

8.4.3.4 P2G: A tool to extract term candidates from aligned phrases

8.5 Sisyphos-II: MT-Evaluation tools

8.6 Conclusions and related information

Prof. Inguna Skadiņa has been working on language technologies for over 25 years. Her research interests are in machine translation, human-computer interaction, and language resources and tools for under-resourced languages. She has coordinated and participated in many national and international projects related to human language technologies, and has authored or co-authored more than 60 peer-reviewed research papers.

Bogdan Babych is an Associate Professor of Translation Studies at the University of Leeds, UK. He holds a PhD in machine translation and in Ukrainian linguistics. Dr. Babych was a coordinator of the EU FP7 Marie Curie project HyghTra, and received a Leverhulme Early Career Fellowship for his project Translation Strategies in Comparable Corpora. He previously worked as a computational linguist at L&H Speech Products, Belgium.

Robert Gaizauskas is a Professor of Computer Science and head of the Natural Language Processing group, Department of Computer Science, University of Sheffield, UK. His research interests are in computational semantics, information extraction, text summarization and machine translation. He holds a DPhil from the University of Sussex, UK (1992), and has published more than 150 papers in peer-reviewed journals and conference proceedings.

Nikola Ljubešić is an Assistant Professor at the Department of Information Science, University of Zagreb, Croatia, and researcher at the "Jožef Stefan" Institute in Ljubljana, Slovenia. His main research interests are in language technologies for South Slavic languages, linguistic processing of non-standard texts, author profiling and social media analytics.

Prof. Dan Tufiș, director of RACAI and full member of the Romanian Academy, has been active in computational and corpus linguistics for more than 30 years. His expertise is in tagging, word alignment, multilingual WSD, SMT, QA in open domains, lexical ontologies, language resource annotation and encoding. He has authored or co-authored more than 250 peer-reviewed papers, book chapters and books.

Andrejs Vasiļjevs is a co-founder and chairman of the board of Tilde, a leading European language technology and localization company. His expertise is in terminology management, machine translation and human computer interaction. He initiated and coordinated the ACCURAT project as well as several other international research and innovation projects. He holds a PhD in computer sciences from the University of Latvia and a Dr.h. from the Latvian Academy of Sciences.

This book provides an overview of how comparable corpora can be used to overcome the lack of parallel resources when building machine translation systems for under-resourced languages and domains. It presents a wealth of methods and open tools for building comparable corpora from the Web, evaluating comparability and extracting parallel data that can be used for the machine translation task. It is divided into several sections, each covering a specific task such as building, processing, and using comparable corpora, focusing particularly on under-resourced language pairs and domains.

The book is intended for anyone interested in data-driven machine translation for under-resourced languages and domains, especially for developers of machine translation systems, computational linguists and language workers. It offers a valuable resource for specialists and students in natural language processing, machine translation, corpus linguistics and computer-assisted translation, and promotes the broader use of comparable corpora in natural language processing and computational linguistics.

Krainaksiazek.pl w programie rzetelna firma

Krainaksiaze.pl - płatności przez paypal

Czytaj nas na:

Zobacz:

1997-2026 DolnySlask.com Agencja Internetowa