Python Simple Web Scraper

broken image


AUTOMATE BORING STUFF

Basic web scraping in python

Automating annoying stuff is essentially what developers do for themselves and others everyday. An example of annoying stuff for me was to check on the web the meaning of words while I was reading a slightly difficult book. It was really frustrating to browse the web every time. Then I had the idea: create a Telegram bot that sending a word sends back a message with its pronunciation and meaning.

I started by looking for a dictionary REST API with the purpose of reusing a simple Node.js project that I have developed one year ago but I was not able to find any free dictionary API service. With this in mind, I understood that web scraping was the right way.

WHAT IS WEB SCRAPING AND WHY USE PYTHON

Web scraping is a technique used to extract data from websites. It could be really useful and powerful. An example could be a program that notifies you when a new Thinkpad appears on Ebay or when the price of a product in your Amazon wishlist decreases.

Jan 09, 2020 web scraping with python. Web scraping with python is very easy and simple, I thought this is the best way to learn how to extract the data from different websites. There are so many libraries for web scraping in python like Scrapy, Beautiful soup, urlib, requests, etc. In this webs scraping with python tutorial, we use BeautifulSoup. Oct 10, 2019 BeautifulSoup is an amazing parsing library in Python that enables the web scraping from HTML and XML documents. BeautifulSoup automatically detects encodings and gracefully handles HTML documents even with special characters.

Actually I have tried to develop an Amazon price tracking service but, unfortunately, Amazon has a vested interest in protecting its data and some basic anti-scraping measures in place. It's possible to avoid bot detection but it needs a lot of effort, hundreds of proxy servers, and honestly I do not really like the cost benefits.

I used Python because it is easy to use and read, and, not only in my opinion, the best tool to develop a web scraping program.

LET'S START SCRAPING

In order to start scraping the web we have to create a Python project and import the following libraries: requests for HTTP requests, pprint to prettify our debug logs and BeautifulSoup, we will use it to parse HTML pages.

Once we have installed and imported our libraries, the first thing we have to do is define a function that will replicate the following user behavior:

  1. Browse a dictionary website.
  1. Search for the unknown word and start reading the word meaning in the page result.

Now, let's take a look at the page result link:

If you are slightly smart you should have already realized how this URL is composed.

The red portion is responsible for the magic trick. Whatever we write after the static portion will be interpreted by the website as a research, like the one described before. I have made some attempts, we can just add the word after the static portion and avoid the ?q=blabla thing. Having said that, let's go back to our Python project. We are finally ready to define our function.

I decided to call it getMeaning, it will receive the unknown word as parameter and will be responsible of:

  1. Creating the link we will use for the request
  2. Making the HTTP request
  3. Receiving an HTML response
  4. Retrieving pronunciation and meaning

Let's try to build our program step by step. Firstly, we will cover all the first three point mentioned before. I used a static string variable called word as parameter after have stripping and lowering it, this will be really useful when we will receive it as user input.

In the following GIF image we can see the Python program in action.

The pprint(soup) command at line 12 is logging the resulting HTML page of the GET request with link:

Now that we have the HTML result, we need to identify the point where the meaning and the pronunciation are located. To do so we can use the awesome function provided by our browser called 'Inspect element' .

The meaning is located into a span with class='def', the pronunciation into a div with class='sound audio_play_button pron-uk icon-audio', this div has a property called data-src-mp3 that will provide us an mp3 file with the word pronunciation!

It is something more complicated but essentially similar to the following:

To retrieve the text into the span and the link into the data-src-mp3 property, we will use the BeautifulSoup method called find. You can see the implementation after line 12.

As excepted, this is the output with the meaning and the pronunciation (the link to the mp3 file):

HOW TO CREATE A TELEGRAM BOT

Creating a Telegram Bot is stupidly easy. Assumed that we already have a Telegram account, we need to start a chat with @BotFather that will immediately reply us with a command list. At this point we can type /newbot and follow its instructions.

Once we have decided the bot name and its username (make sure it ends in bot or _bot) we receive a token that will be useful in a following step to access the HTTP API.

Python Web Scraping Beautifulsoup

I have decided to use a library called telepot. In order to convert our web scraper in a Telegram bot, let's take a look at the following code. It is a basic async bot (for telepot projects), it just sends to the user the dump of the JSON received from the client.

We have to import some needed libraries: asyncio, telepot, telepot.aio and MessageLoop. Do not be afraid if you do not understand everything is written down here, it is not needed to understand all lines of code, the most important part is the handle function because it is exactly where we will implement our web scraping program.

This is our Python console in action when our bot receive a message:

What we have to do now is integrate our web scraper in order to convert it into a telegram bot.

Web

Python Simple Web Scraper Free

We start by adding the three missing libraries: pprint, BeautifulSoup and requests. At this point we can copy our getMeaning function and paste it there. Then we turn it into an async function and finally instead of printing the pronunciation link and the meaning in our console we will send them to the user, one as file and the other as text. I decided to define a global variable called chat_id (to reserve a better communication within the two functions) and checked the user input. Now we can call our async function from the handle and we are finally ready to test our Telegram bot!

Finally, we can see our Telegram bot in action!

THAT'S IT!

Python simple web scraper code

Automating annoying stuff is essentially what developers do for themselves and others everyday. An example of annoying stuff for me was to check on the web the meaning of words while I was reading a slightly difficult book. It was really frustrating to browse the web every time. Then I had the idea: create a Telegram bot that sending a word sends back a message with its pronunciation and meaning.

I started by looking for a dictionary REST API with the purpose of reusing a simple Node.js project that I have developed one year ago but I was not able to find any free dictionary API service. With this in mind, I understood that web scraping was the right way.

WHAT IS WEB SCRAPING AND WHY USE PYTHON

Web scraping is a technique used to extract data from websites. It could be really useful and powerful. An example could be a program that notifies you when a new Thinkpad appears on Ebay or when the price of a product in your Amazon wishlist decreases.

Jan 09, 2020 web scraping with python. Web scraping with python is very easy and simple, I thought this is the best way to learn how to extract the data from different websites. There are so many libraries for web scraping in python like Scrapy, Beautiful soup, urlib, requests, etc. In this webs scraping with python tutorial, we use BeautifulSoup. Oct 10, 2019 BeautifulSoup is an amazing parsing library in Python that enables the web scraping from HTML and XML documents. BeautifulSoup automatically detects encodings and gracefully handles HTML documents even with special characters.

Actually I have tried to develop an Amazon price tracking service but, unfortunately, Amazon has a vested interest in protecting its data and some basic anti-scraping measures in place. It's possible to avoid bot detection but it needs a lot of effort, hundreds of proxy servers, and honestly I do not really like the cost benefits.

I used Python because it is easy to use and read, and, not only in my opinion, the best tool to develop a web scraping program.

LET'S START SCRAPING

In order to start scraping the web we have to create a Python project and import the following libraries: requests for HTTP requests, pprint to prettify our debug logs and BeautifulSoup, we will use it to parse HTML pages.

Once we have installed and imported our libraries, the first thing we have to do is define a function that will replicate the following user behavior:

  1. Browse a dictionary website.
  1. Search for the unknown word and start reading the word meaning in the page result.

Now, let's take a look at the page result link:

If you are slightly smart you should have already realized how this URL is composed.

The red portion is responsible for the magic trick. Whatever we write after the static portion will be interpreted by the website as a research, like the one described before. I have made some attempts, we can just add the word after the static portion and avoid the ?q=blabla thing. Having said that, let's go back to our Python project. We are finally ready to define our function.

I decided to call it getMeaning, it will receive the unknown word as parameter and will be responsible of:

  1. Creating the link we will use for the request
  2. Making the HTTP request
  3. Receiving an HTML response
  4. Retrieving pronunciation and meaning

Let's try to build our program step by step. Firstly, we will cover all the first three point mentioned before. I used a static string variable called word as parameter after have stripping and lowering it, this will be really useful when we will receive it as user input.

In the following GIF image we can see the Python program in action.

The pprint(soup) command at line 12 is logging the resulting HTML page of the GET request with link:

Now that we have the HTML result, we need to identify the point where the meaning and the pronunciation are located. To do so we can use the awesome function provided by our browser called 'Inspect element' .

The meaning is located into a span with class='def', the pronunciation into a div with class='sound audio_play_button pron-uk icon-audio', this div has a property called data-src-mp3 that will provide us an mp3 file with the word pronunciation!

It is something more complicated but essentially similar to the following:

To retrieve the text into the span and the link into the data-src-mp3 property, we will use the BeautifulSoup method called find. You can see the implementation after line 12.

As excepted, this is the output with the meaning and the pronunciation (the link to the mp3 file):

HOW TO CREATE A TELEGRAM BOT

Creating a Telegram Bot is stupidly easy. Assumed that we already have a Telegram account, we need to start a chat with @BotFather that will immediately reply us with a command list. At this point we can type /newbot and follow its instructions.

Once we have decided the bot name and its username (make sure it ends in bot or _bot) we receive a token that will be useful in a following step to access the HTTP API.

Python Web Scraping Beautifulsoup

I have decided to use a library called telepot. In order to convert our web scraper in a Telegram bot, let's take a look at the following code. It is a basic async bot (for telepot projects), it just sends to the user the dump of the JSON received from the client.

We have to import some needed libraries: asyncio, telepot, telepot.aio and MessageLoop. Do not be afraid if you do not understand everything is written down here, it is not needed to understand all lines of code, the most important part is the handle function because it is exactly where we will implement our web scraping program.

This is our Python console in action when our bot receive a message:

What we have to do now is integrate our web scraper in order to convert it into a telegram bot.

Python Simple Web Scraper Free

We start by adding the three missing libraries: pprint, BeautifulSoup and requests. At this point we can copy our getMeaning function and paste it there. Then we turn it into an async function and finally instead of printing the pronunciation link and the meaning in our console we will send them to the user, one as file and the other as text. I decided to define a global variable called chat_id (to reserve a better communication within the two functions) and checked the user input. Now we can call our async function from the handle and we are finally ready to test our Telegram bot!

Finally, we can see our Telegram bot in action!

THAT'S IT!

Online Web Scraper

All things considered, writing a simple web scraper Telegram bot using Python is really quick and interesting. I hope you enjoyed this reading and that this little project will intrigue your mind and will help you in creating something similar or even better. Feel free to contact me if you struggle in understanding something tricky or just to let me know what you developed.





broken image