ISBN-13: 9786208225414 / Angielski / Miękka / 2024 / 204 str.
This study presents a novel approach to extract parallel data from a comparable English-Punjabi corpus, addressing the scarcity of parallel corpora for this language pair. Unlike previous research, this approach focuses on creating high-precision parallel data using minimal resources. The data is sourced from diverse domains, including Wikipedia articles, TDIL's noisy parallel sentences, and Gyan Nidhi reports. The methodology consists of three phases: extracting and aligning documents, translating Punjabi texts into English using OpenNMT-py, and calculating content similarity through three measures-Euclidean Distance, Cosine, and Jaccard. These algorithms are run individually, and then their results are integrated to improve accuracy. By combining the scores of all three measures, the system achieves a precision of 93% and an accuracy of 86%. This integrated approach significantly enhances parallel data extraction for English-Punjabi corpora and holds potential for improving Statistical Machine Translation (SMT) models.