English Español français rss
> Accueil > Programme > Colloques parallèles

Coll. UIMA : DKPro-UGD : A Flexible Data-Cleansing Approach to Processing User-Generated Discourse

Traduction(s) de cet article : français
Intervenant(s) :Richard Eckart de Castilho, Iryna Gurevych
Type d'événement :Conférence
Niveau :Confirmé
Date :Jeudi 9 juillet 2009
Horaire :10h50
Durée :20 minutes
Langue :English
Lieu :Salle D202 - Ireste

User-generated discourse from Web 2.0 poses particular challenges to natural language processing (NLP) due to its noise and error proneness. A data cleansing step preceding the analysis steps in an NLP pipeline can reduce the problems. While recent efforts provide general-purpose collections of UIMA-based analysis components, data cleansing seems not yet to be covered.

The five-stage data cleansing approach proposed here offers a maximum of flexibility in identifying problematic artifacts, deciding how to deal with them and analysing cleansed data. Simultaneously, it allowed us to create reusable UIMA-based components for the actual data cleaning and for mapping annotations created on the clean data back to the original representation. These components are released as part of the Darmstadt Knowledge Processing Software Repository (DKPro) by the name of DKPro-UDG.

PDF - 90.8 ko
Article soumis
PDF - 801.2 ko
Support de présentation