domingo, 2 de diciembre de 2018

Integrative project: Regular Expressions


An appropriate way to implement the usefulness of regular expressions is through the creation of applications. This blog explains how to develop an application that allows read files in pdf or docx format. The application allows analyzing and extracting regular patterns and texts that meet a given criterion.

To develop the application we have to take in account the next criterias:
Fig. 1 : Criteria of devolepment

The  figure 2, shows a schematic diagram that explains how the proposal was developed.
Fig. 2: Schematic diagram

The programming language used was java with the implementation of different libraries. In figure 3, we have the final result of the app:

Fig. 3: Application

Firstly, we have the button upload file, when you click on this button, the window to find a file opens. You can choose a file pdf or docx (Figure 4). We used the “jxDocument” (pdf files)  and “Apache poi” (docx files) libraries for read the files. Once the file is selected, in the left block appears the file and in the center one appears the text of the file.  You can select the number of files that we want (Figure 5).
Fig. 4: Upload file

Fig. 5: Selected file

Another function that we have added is to read text. In this case, you have to select the part of text that you want to be read and the click on in the button read (Figure 6). To achieve this, we used the library "freetts".

Fig. 6: Read text

Other function that we have in this application is FIND. With this function, you can find different parameters like e-mails, dates, names, decimal numbers and telephone numbers (Figura 7). For this we used regular expressions: 
Fig. 7: Analyze parameters


·         The regular expression for the date is this one, in this part it is validated that the number of days must go from 01 to 31. These can be separated by dashes or slashes. Likewise, the months must go from 01 to 12. The months can also be written with letters like “de enero del” or “de febrero del” to “de diciembre del”. On the other hand, the years are validated until 2018 (Figure 8):



·      The regular expression for the e-mails: in this section you can use any uppercase or lowercase letter. Also you can use middle script, underscore or dots. Then you have to insert the @ (at). There may be different domains, which have been between 2 and 15 letters. Finally here comes a domain, which has to have 2, 3 o 4 uppercase or lowercase letters:


·     The regular expression for the names: a uppercase letter between A and Z, here you can aggregate some especial letters like vowels with tilde. Next to the uppercase letter comes a lowercase letter between a and z or vowels with tilde. Finally, the name must have four or more letters:


·      The regular expression for the decimal numbers: first we can aggregate or not the negative symbol. The number can begin with cero or a number between 1 and 9. Also the number can begin directly with the – (minus) and then comes the decimal point. Then comes a number between the cero and nine and finally a number between one and nine:


·      In the regular expression for the telephone number, we can start with a colon, a space or with cero. Then comes number between the one and nine. The other numbers can be between cero and nine. In this case, we have a total of numbers between ten and eleven:


You can select all the parameters that we want to analyze. The application search in the text all the terms that comply with the regular expression and generate a report that shows how many times the terms are in the file (Figure 8).

Fig. 8: Selecting all the parameters


The report in pdf format is saved in this folder. In the case of the pdf file we have a graphic that shows that there are only names in the text (Figure 9) . And in the case of the docx file we can see that exist dates, e-mails, decimal numbers, telephone numbers and names (Figure 10).

Fig. 9: PDF report

Fig. 10: Docx report 

Finally, we have the button eliminate, we choose the file that we want to remove and then we click on the button (Figure 11).

Fig. 11:Button eliminate


In this way an application is obtained that allows us to analyze documents in pdf or docx format, under the deversive parameters described through regular expressions. In addition, it is verified how you can add other functions such as adding or removing files or reading texts.

The next video shows how the application works:

No hay comentarios:

Publicar un comentario