An appropriate way to implement the usefulness of regular expressions is through the creation of applications. This blog explains how to develop an application that allows read files in pdf or docx format. The application allows analyzing and extracting regular patterns and texts that meet a given criterion.
To develop the application we have to take in account the next criterias:
Fig. 1 : Criteria of devolepment
The figure 2, shows a schematic diagram that explains how the proposal was developed.
Fig. 2: Schematic diagram
The programming language used was java with the implementation of different libraries. In figure 3, we have the final result of the app:
Fig. 3: Application
Firstly, we have the button upload file, when you click on this button, the
window to find a file opens. You can choose a file pdf or docx (Figure 4). We used the “jxDocument” (pdf files) and “Apache poi” (docx files) libraries for read the files. Once the file is selected, in the left block appears the file and in the center one appears the text of the file. You can select the number of files that we
want (Figure 5).
Fig. 4: Upload file
Fig. 5: Selected file
Another function that we have
added is to read text. In this case, you have to select the part of text that you want to be read and the click on in the button read (Figure 6). To achieve this, we used the
library "freetts".
Fig. 6: Read text
Other function that we have in
this application is FIND. With this function, you can find different parameters like
e-mails, dates, names, decimal numbers and telephone numbers (Figura 7). For this we used
regular expressions:
Fig. 7: Analyze parameters
·
The regular expression for the date is this one, in
this part it is validated that the number of days must go from 01 to 31. These
can be separated by dashes or slashes. Likewise, the months must go from 01 to
12. The months can also be written with letters like “de enero del” or “de
febrero del” to “de diciembre del”. On the other hand, the years are validated
until 2018 (Figure 8):
· The regular expression for the e-mails: in this
section you can use any uppercase or lowercase letter. Also you can use middle
script, underscore or dots. Then you have to insert the @ (at). There may be
different domains, which have been between 2 and 15 letters. Finally here comes
a domain, which has to have 2, 3 o 4 uppercase or lowercase letters:
· The regular expression for the names: a uppercase
letter between A and Z, here you can aggregate some especial letters like vowels
with tilde. Next to the uppercase letter comes a lowercase letter between a and
z or vowels with tilde. Finally, the name must have four or more letters:
· The regular expression for the decimal numbers: first
we can aggregate or not the negative symbol. The number can begin with cero or
a number between 1 and 9. Also the number can begin directly with the – (minus)
and then comes the decimal point. Then comes a number between the cero and nine
and finally a number between one and nine:
· In the regular expression for the telephone number, we
can start with a colon, a space or with cero. Then comes number between the one
and nine. The other numbers can be between cero and nine. In this case, we have
a total of numbers between ten and eleven:
You can select all the parameters that we want to
analyze. The application search in the text all the terms that comply with the
regular expression and generate a report that shows how many times the terms
are in the file (Figure 8).
Fig. 8: Selecting all the parameters
The report in pdf format is saved
in this folder. In the case of the pdf file we have a graphic that shows that
there are only names in the text (Figure 9) . And in the case of the docx file we can see
that exist dates, e-mails, decimal numbers, telephone numbers and names (Figure 10).
Fig. 9: PDF report
Fig. 10: Docx report
Finally, we have the button eliminate,
we choose the file that we want to remove and then we click on the button (Figure 11).
Fig. 11:Button eliminate
In this way an application is obtained that allows us to analyze documents in pdf or docx format, under the deversive parameters described through regular expressions. In addition, it is verified how you can add other functions such as adding or removing files or reading texts.
The next video shows how the application works:
The next video shows how the application works:
No hay comentarios:
Publicar un comentario