2024 HPCC Systems Summit: Using NLP++ to build a Brazilian address cleaner + Enhancing Legal...

Просмотров: 159 | Загружено: 9 мес.

HPCC Systems

4

Скачать

Всё видео пользователя: AM-HPCC Systems

Подробнее о видео

This joint presentation includes two talks involving NLP++ with HPCC Systems.

Using NLP++ to build a Brazilian address cleaner in HPCC Systems - Guilherme da Silva, LexisNexis Risk Solutions

NLP++ is a new programming language specially designed to build deep text parsers. The main objective of this approach is to build a Brazilian address analyzer and cleaner that is capable of improving the current cleaning process, with the advantage of being a transparent process with easy problem identification and correction, demonstrating great potential for future use in production.

---------------
Enhancing Legal Assistance through Data Enrichment with HPCC Systems - Nihar Mandahas, Skanda P R, Manvith L B, Pratheek Rao MP, Arya Hariharan & Dr. Jyoti Shetty, RV College of Engineering

Lawyers and legal professionals often utilize digital databases to carry out research to build strong cases, provide accurate legal advice and stay informed about legal changes. However, the vast amount of data can be overwhelming, requiring strong research skills to find relevant references. To address this issue, we propose an application that enhances legal research through data enrichment using HPCC Systems.

The application leverages Natural Language Processing (NLP) to extract keywords from a case abstract entered by the user via Named Entity Recognition (NER). The application can recognize up to 12 different legal entities in any given abstract. These keywords are then used to search for case statements in a main cases dataset. The extracted keywords and their corresponding case statements are stored in a separate dataset containing 1,200,000 keywords. This dataset was sprayed onto the HPCC Systems cluster, where it was indexed for efficient reference searches. A full keyed join was performed between the keywords and the main indexed dataset to retrieve relevant references with one or more matches. The distributed architecture of HPCC Systems enabled parallelism, thus improving search efficiency. Users can interact with the application through a web interface and after the search and retrieval process, relevant references are displayed to the user through the same interface.

The application developed demonstrates state-of-the-art performance compared to multithreaded processes in Python and Hadoop. When tested on the same dataset using a single-node local cluster, the application using HPCC Systems had an average latency of 1.7 seconds while Hadoop had an average latency of 6 seconds. A multithreaded Python application was significantly slower, averaging 13 seconds to search a dataset of 300 records using five keywords.

The proposed application enhances legal research efficiency and accuracy by using NLP for keyword extraction and leveraging HPCC Systems for rapid data retrieval, ensuring quick and relevant reference searches. Among those who will benefit from this application are lawyers who wish to simplify the task of finding relevant legal references, academics and law students who usually conduct extensive research, and legal organizations overall.

© 2024 LexisNexis Risk Solutions

2024 HPCC Systems Summit: Using NLP++ to build a Brazilian address cleaner + Enhancing Legal...

Похожие видео