Design of regular expressions to analyze and extract information from PDF and Word files

lunes, 11 de febrero de 2019

Comprensión Lectora con Freeling

In the next entry we present how a reading comprehension application was created by using Freeling. Freeling is an open source library for linguistic analysis. It allows analyzing the structure of a sentence, assigning to each word a label that helps to recognize what kind of word it is (verb, noun, pronoun, adjective, etc.). This feature is really useful when analyzing texts.

The application has the ability to read a text and answer questions about it. The texts that are used are in English. The programming language that was used to create the application is Phyton.

In Figure 1 a scheme of the operation of the program is presented:

Fig. 1: Operation_reading comprehension application

1. Select the directory

You have to select the directory in which the files that you wanto to analize are. These files must be in .txt format. The directory can contain any number of files. It is important that each file has the set of questions that you want to answers will be given. These questions should be at the end of the text

2. Text analysis

The analysis is mainly based on determining which of all sentences in the text have the greatest semantic load in relation to the question that is asked. So, a percentage is given and the probable response is determined.

To carry out the code it is important to take into account the grammatical structure of both the questions and the answers:

The questions begin with the so-called W words (what, who, where, when and why).
The sentences have a subject, a verb and a complement.
The question and all possible answers must be analyzed. An example is described below:

For the question: Who is Christopher Robin?

Freeling Analysis

Freeling detects the words in the following way: W-word (Who) as WP, it means interrogative pronoun; "is" as VBZ, it means personal verb in third person; "Christopher Robin" as NP, it means proper name and finally the " ? " as Fit, it means question mark.

Una vez analizada palabra por palabra, es más fácil buscar las respuestas, la respuesta a esta pregunta debe venir nada de la siguiente manera: NP+VBZ+C (complemento). La aplicación dará mayor prioridad a las respuestas que cumplan con este criterio.

Once the application analyzed word by word, it is easier to find the answers. The answer to this question should be: NP + VBZ + C (complement). The application will give higher priority to the answers that meet this criterion.

3. Answers

Una vez analizado todo el texto, se mostrará en la ventada todas las posibles respuestas a la pregunta dada. Junto a cada respuesta se encuentra un porcentaje, entre más alto es el porcentaje la respuesta es la más probable.

Es importante mencionar, que al existir una gran cantidad de estructuras gramaticales en el idioma inglés, no siempre la respuesta con mayor porcentaje será la correcta.

Conclusión

Freeling es una herramienta de gran utilidad en el análisis de textos. Es fácil de implementar. Para crear cualquier aplicación es importante contar con las herramientas adecuadas, sin embargo es más importante aún saber como utilizarla. Para ello la funcionalidad que debe tener una aplicación debe ser clara, de esta manera se irá programando de manera lógica y racional.

La aplicación creada cumple con el objetivo principal de leer textos y dar respuestas a preguntas planteadas. Existe cierto margen de error debido a que las respuestas se basan en la coincidencia de palabras que se tiene con cada pregunta. Pero los resultados de las pruebas realizadas muestran que en general las respuestas obtenidas con mayor porcentaje son las correctas.

domingo, 2 de diciembre de 2018

Integrative project: Regular Expressions

An appropriate way to implement the usefulness of regular expressions is through the creation of applications. This blog explains how to develop an application that allows read files in pdf or docx format. The application allows analyzing and extracting regular patterns and texts that meet a given criterion.

To develop the application we have to take in account the next criterias:

Fig. 1 : Criteria of devolepment

The figure 2, shows a schematic diagram that explains how the proposal was developed.

Fig. 2: Schematic diagram

The programming language used was java with the implementation of different libraries. In figure 3, we have the final result of the app:

Fig. 3: Application

Firstly, we have the button upload file, when you click on this button, the window to find a file opens. You can choose a file pdf or docx (Figure 4). We used the “jxDocument” (pdf files) and “Apache poi” (docx files) libraries for read the files. Once the file is selected, in the left block appears the file and in the center one appears the text of the file. You can select the number of files that we want (Figure 5).

Fig. 4: Upload file

Fig. 5: Selected file

Another function that we have added is to read text. In this case, you have to select the part of text that you want to be read and the click on in the button read (Figure 6). To achieve this, we used the library "freetts".

Fig. 6: Read text

Other function that we have in this application is FIND. With this function, you can find different parameters like e-mails, dates, names, decimal numbers and telephone numbers (Figura 7). For this we used regular expressions:

Fig. 7: Analyze parameters

· The regular expression for the date is this one, in this part it is validated that the number of days must go from 01 to 31. These can be separated by dashes or slashes. Likewise, the months must go from 01 to 12. The months can also be written with letters like “de enero del” or “de febrero del” to “de diciembre del”. On the other hand, the years are validated until 2018 (Figure 8):

· The regular expression for the e-mails: in this section you can use any uppercase or lowercase letter. Also you can use middle script, underscore or dots. Then you have to insert the @ (at). There may be different domains, which have been between 2 and 15 letters. Finally here comes a domain, which has to have 2, 3 o 4 uppercase or lowercase letters:

· The regular expression for the names: a uppercase letter between A and Z, here you can aggregate some especial letters like vowels with tilde. Next to the uppercase letter comes a lowercase letter between a and z or vowels with tilde. Finally, the name must have four or more letters:

· The regular expression for the decimal numbers: first we can aggregate or not the negative symbol. The number can begin with cero or a number between 1 and 9. Also the number can begin directly with the – (minus) and then comes the decimal point. Then comes a number between the cero and nine and finally a number between one and nine:

· In the regular expression for the telephone number, we can start with a colon, a space or with cero. Then comes number between the one and nine. The other numbers can be between cero and nine. In this case, we have a total of numbers between ten and eleven:

You can select all the parameters that we want to analyze. The application search in the text all the terms that comply with the regular expression and generate a report that shows how many times the terms are in the file (Figure 8).

Fig. 8: Selecting all the parameters

The report in pdf format is saved in this folder. In the case of the pdf file we have a graphic that shows that there are only names in the text (Figure 9) . And in the case of the docx file we can see that exist dates, e-mails, decimal numbers, telephone numbers and names (Figure 10).

Fig. 9: PDF report

Fig. 10: Docx report

Finally, we have the button eliminate, we choose the file that we want to remove and then we click on the button (Figure 11).

Fig. 11:Button eliminate

In this way an application is obtained that allows us to analyze documents in pdf or docx format, under the deversive parameters described through regular expressions. In addition, it is verified how you can add other functions such as adding or removing files or reading texts.

The next video shows how the application works: