Lab: Inferring the language of intranet pages

From zero to hero

ene 29, 2025

The other day a colleague asked me if I had any script to detect the language of a website, he wanted to check the language in which the intranet pages were written to check that there was a Catalan version of all of them.

Unfortunately, I had never done anything like this before (although I have played with similar needs before). So, before answering him I said to myself.: if I manage to have something in less than 2 hours, even if it is “from scratch”, it would not be nearly the equivalent of answering him “yes, look I have this”.

Challenge accepted! And with this I have a new Lab to demonstrate what technological moment we are in, where very useful things can be built in a very short time and that a long time ago, pre-LLMs, you couldn't invest time in this because it was too expensive.

A little context

The website to be reviewed is a Google Sites, a kind of CMS from Google that is very, very user-friendly, but has serious shortcomings of all kinds, especially, it does not have any API or anything to consume content in a friendly way (that I know of, and I think the product is already more than 5 years old, Google-style...).
I have to deal with a Google Login, which is quite "difficult", and since we don't have an API for Sites, all the Machine 2 Machine communication bibliography with Google credentials is useless. Initially, I thought that sending requests with the cookies of the session of another browser where I was "logged in" would be enough, but no.

What we will do

Bypass Google login
Download the HTML of all intranet pages
Clean up infected HTML from Google Sites, at the Microsoft Word level, and save it to .txt files
Use a text language recognition library (no LLMs needed, this already exists)
And finally a little manual work, to put it nicely, because automating it would take longer.

Let's begin

I ask ChatGPT o1 to make me a Python script that uses Playwright to enter the intranet home page and navigate through Google login screens.

All said and done, at first I have a "bot" that overcomes the Google login (after seeing that making requests wouldn't get me anywhere), now it's time to do scrapping.

Back to o1, I ask him to expand the functionality to download all pages sequentially and to make sure it doesn't go to external links on the intranet. In a couple of messages I have a version of the script that I like, I run it, and bingo! We already have the "bot" clicking everywhere and downloading the HTML files.

Here I want to pause. This way of "downloading" a website is very inefficient, after all, you have a running Chromium opening all the pages, one by one, one after the other. But the website didn't have that many pages, haha 350 made my beard grow, and it simply worked: skip the login, grab the pages, no one is in any hurry. And for when we are in a hurry, it's not very difficult to lower the times. There are a thousand scrapping tools, but I already have some experience and skipping the Google login is not trivial.

While the "bot" was running, running, I wasn't fiddling around looking, I knew what the next steps were and I kept asking GePeTo to have them ready.

A Python script to loop through all HTML files in a directory, clean up the HTML in each file, and save it as a .txt equivalent in another destination. Succeeds at the first prompt, using BeautifulSoup4.

A script to read all .txt files in a directory and generate a file resultat_idiomes.md with a Markdown table with the detected language and the file name. First time success.

And that's it, I already had what they asked me for. Now, you know, passing a Markdown to a manager is not “polite” ;-) so let's “Excelcify it”. There are probably tools for that too, but GePeTo embroiders these things.

A Python script to convert a Markdown table with this pin to .csv. First time success

I pass the .csv to XLS, I upload the XLS to Google Drive to pass it to Google Sheets and have the convenience of sharing it, I do 4 sets of formatting the table in GSheets and it's ready to consume.

Conclusions

Total fishing time? Less than 2 hours, what took longer? The automatic and unassisted downloading of the pages (it took more than 2 hours, but I was already on to something else).

Now, however, a lot of my productivity here comes from knowing which tools to use and "watch out!", putting creativity into it, it's likely that there are better ones, but we work with what we know.

I had already solved Google login problems in the past, and I spent hours messing around until I came to the conclusion that a “bot” with Playwright is worth more than a thousand magical scraping tools.

We also need to know the capabilities of our tools. LLMs can do countless things, and one of the things they are best at is manipulating text. If you can reduce your problem, or part of it, to manipulating text, you are well on your way.

Learn things, try things, play with things... Sooner or later, this experience will come back to help you do more and better.

AI Artisan

Discusión sobre este post

Por supuesto, sigue adelante.