A NeuralMind, a daughter company of Unicamp located in the Scientific and Technological Park, works with artificial intelligence and has just made available, in an unprecedented and free way, a Google algorithm trained for the Brazilian Portuguese language.
In this case, it is the platform's open source, called Bidirectional Encoder Representations from Transformers (BERT), which was released in December and aims to make searches more accurate by processing natural language. In other words, the tool better understands what users want to find with their keywords based on this new process.
According to Google, 15% of searches carried out on its platform per day are new, which justifies the development of the algorithm to offer more accurate results. This is just one of the applicability of the code for artificial intelligence, as explained by professor at the Faculty of Electrical and Computer Engineering (FEEC) at Unicamp and technical director of NeuralMind, Roberto Lotufo.
“Google’s precise search example is just one of the many applications for using BERT. For example, at NeuralMind, we use BERT in other natural language processing tasks such as data extraction, be it people's names, addresses, institutions and dates”, explains Lotufo.
Despite the benefits of the code, Google distributed the algorithm with training only in English, Mandarin and multilingual, a generic version used for other languages not covered. As the generic version is not as effective as training in a specific language, several entities around the world decided to train the tool in their own language.
“In Brazil, we train BERT Portuguese, as it presents better results than if we used BERT-multilingual. Now, the algorithm is available free of charge to disseminate the technology in Brazil and other Portuguese-speaking countries, which can contribute to the advancement of research and development of products in this area, such as chatbots”, highlights the professor about the unprecedented feat in the country.
In training, the daughter company had to use an extensive text in Brazilian Portuguese, using the free text corpus Brazilian Web as Corpus (BrWaC). Lotufo recalls that the training “was a Herculean effort, involving several days of Google Cloud machines, in addition to several weeks of data preparation”, but with positive results.
Today, companies or developers who wish to adopt the solution can access it on NeuralMind GitHub, a source code hosting platform used by the daughter company.
More information on the website of company.