This study presents a novel approach to extract parallel data from a comparable English-Punjabi corpus, addressing the scarcity of parallel corpora for this language pair. Unlike previous research, this approach focuses on creating high-precision parallel data using minimal resources. The data is sourced from diverse domains, including Wikipedia articles, TDIL's noisy parallel sentences, and Gyan Nidhi reports. The methodology consists of three phases: extracting and aligning documents, translating Punjabi texts into English using OpenNMT-py, and calculating content similarity through three...
This study presents a novel approach to extract parallel data from a comparable English-Punjabi corpus, addressing the scarcity of parallel corpora fo...