Web Scraping Instagram with Selenium Python

A. M. Lani Anhar
Analytics Vidhya
Published in
10 min readApr 7, 2021

--

Selenium is one of the many tools that can be used to scrape a website. And now I want to explain how we can scrape the Instagram account from the website.

Why using Selenium?

As we know that many tools can be used to scraping data from a website, and the three most popular from them are Scrapy, Beautifulsoup, and Selenium. However, each of them has the special ability for their action to scrape a website.

You can search about Scrapy, Beautifulsoup, and many other tools that can be used to scraping in many other places, but now I want to explain how we choose Selenium for scraping this time here.

Selenium is a powerful tool for scraping. It can handle automation in a complex way. For example, we need to log in to our Instagram account to scraping Instagram’s website. And surprisingly, selenium can handle it such as log in to our Instagram account automatically.

Secondly, selenium can scrape the website using a timer that we can set as required. It very helpful since Instagram’s team has banned automatically scraper bot for its website. Yes, we can set the timer for our automatic scraper. So, our scraper doesn’t act rapidly which is can reduce the risk from banned by Instagram’s team.

Requirements

  1. You must finish reading this article first, and then doing the practice technically.
  2. Familiar with a python programming language, especially the theory of OOP.
  3. We need a code editor and python that have been installed on your PC/Laptop.
  4. The browser, in this case, is Google Chrome, so the options that will be mentioned in this article are available on Google Chrome.

What you’ll learn

You will have knowledge about selenium python theory (auto login, auto click, auto scroll page down, auto download).

Download webdriver or geckodriver

One of the tools that we must prepare to run the selenium program is webdriver (for Chrome) or geckodriver (for Firefox). You can download it from here (for Chrome user) or here (for Firefox user).

Installing the required libraries

First, we must install a selenium library on our terminal such as the code below:

pip install selenium

Once it has been done, then we must install some python libraries required such as time and requests like the code below:

pip install time

and

pip install requests

Great! Our scraping environment has been prepared, and let’s code!

Importing the libraries

Here the code about importing the required libraries for scraping using selenium:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time, urllib.request
import requests

In this tutorial, I use Chrome for scraping. So, we must import the webdriver library from selenium as the code above.

Setting the PATH code

The PATH code is the code that aims to connect our code with the browser. Here the code logic about PATH is below:

PATH = r"C:\download\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(PATH)

The code above represents that the chromedriver package that has been downloaded is in the download folder. And then the PATH variable code must be the same as the chromedriver package directory in the PC.

Get the Instagram’s website

After coding the PATH variable, then we must get Instagram’s website which is our scraping target. So, the code is below:

driver.get("https://www.instagram.com/")

Then you can save this code, and run it! And see…..

Well done! Our automatic bot is succeed in accessing Instagram’s website automatically.

Log in to our Instagram account

The next step is to log in to our Instagram account. In this case, I recommend you to use your second Instagram account to try this activity. I think it will reduce the risk of losing your Instagram account.

First, we must set the time to sleep from the time library, then we create the code about login to our Instagram account such as the code below:

#login
time.sleep(5)
username=driver.find_element_by_css_selector("input[name='username']")
password=driver.find_element_by_css_selector("input[name='password']")
username.clear()
password.clear()
username.send_keys("xxxxxx")
password.send_keys("123456")
login = driver.find_element_by_css_selector("button[type='submit']").click()

However, 5 is the time of waiting time for the next code to run. The time is represented by seconds.

Then we must know about the element for the username box which is used to Log In to our Instagram account by inspecting the element (CTRL+SHIFT+I) like the image below:

whereas the elements column as the image below:

As we know that the first step to creating the code to log in automatically is to search the username box. It aims to fill the blank username with our Instagram username account automatically.

According to the image above that, the selector of the username box is input. Whereas the name of the username box is username. And then the code is username=driver.find_element_by_css_selector(“input[name=’username’]”).

The same thing is applied to the password variable which is as the image below:

and the elements column is in below:

So we can get the type CSS selector for password box namely input, and the name of it named password. And then we must type the code to password=driver.find_element_by_css_selector(“input[name=’password’]”).

After that, we must input our username and password for our Instagram account by the send_keys option in those lines. In this case, I use my username namely “xxxxxx” and my password is “123456”.

The next step is login clicking automatically. It can be typed as login = driver.find_element_by_css_selector(“button[type=’submit’]”).click() where the thing is similar about create the username and password variable before. But for this action, we must add the click option in the last of the line for clicking automatically named .click().

Well, let’s run the code, and see!

Great job! We have done it!

Skip the pop-up automatically

And then, the next step after login into our Instagram account is by clicking the text named Not Now or Save Info. In this case, I use the Not Now option. Actually, you can choose the Save Info option if you want.

Then, like the same things as creating the username, password, and login variable, so we must type the code to create the Not Now variable as below:

#save your login info?
time.sleep(10)
notnow = driver.find_element_by_xpath("//button[contains(text(), 'Not Now')]").click()
#turn on notif
time.sleep(10)
notnow2 = driver.find_element_by_xpath("//button[contains(text(), 'Not Now')]").click()

The code above explains that the first step that we must code to skip the pop-up named Saved Your Login Info is to create the Not Now variable such as notnow = driver.find_element_by_xpath(“//button[contains(text(), ‘Not Now’)]”).click().

In this case, we try to get the element of the Not Now variable by find_element_by_xpath as explained in the selenium python website here. When we use the xpath method to search the location of the element, we can use the text targetted. For example, we use the text named Not Now when we want to get the location of the element Not Now in the pop-up notification by xpath method as the image below:

whereas the elements column like as the image below:

Let’s run it! And if you have done it, then you can find the second pop-up notification such as the image below:

The image above is the pop-up notification part two. It has appeared when the first pop-up notification named Saved Your Login Info is clicked. Then we will give the name for this pop-up notification part two namely Turn on Notifications pop-up.

So, we must create the variable the Turn on Notifications pop-up named the notnow2 variable. And we must do the same things about getting the variable of notnow2 by searching the location of its element as the image in below:

and the elements column as below:

so, the code is notnow2 = driver.find_element_by_xpath(“//button[contains(text(), ‘Not Now’)]”).click().

Let’s check the code! Run it, and see! (if this step has finished, you will find the primary page of your Instagram account).

Searchbox handling

Since we’ve found the primary page of our Instagram account named home, then we must go to the Instagram account target by type the name of our Instagram account target in the search box located at the top of the display as an image in below:

Then, we must get the element of the search box to fill the blank box automatically. In this case, we use the Instagram account public named host.py. (just click it to see the account)

Thus, the code that we can create by doing the same things with handling the pop-up automatically in the previous step is as the image below:

#searchbox
time.sleep(5)
searchbox=driver.find_element_by_css_selector("input[placeholder='Search']")
searchbox.clear()
searchbox.send_keys("host.py")
time.sleep(5)
searchbox.send_keys(Keys.ENTER)
time.sleep(5)
searchbox.send_keys(Keys.ENTER))

The code above explains that creating the searchbox variable is the first step that we must be doing to handle the search box automatically. So, we just take the same things about it which is to create the variables that we finished in the previous step.

And we must create the searchbox variable with the code searchbox=driver.find_element_by_css_selector(“input[placeholder=’Search’]”) where the type of this box is input. Whereas the type of the selector is placeholder named Search, like as image in below (you can find it by inspecting the elements):

After it, we must use the send_keys option from the library to type the Instagram account target to the search box automatically such as searchbox.send_keys(“host.py”).

Then, we must create the enter button automatically by fill the code with the Keys.ENTER option. (Notes: we must create it twice, and we should check it on the Instagram account website to debug it)

Lastly, if we have finished and success to run the code, then we will find the Instagram profile target like as image below:

Well done! You got it!

Scroll down the profile

Since we have the profile page for the target user, we must think that we have already scraped this page soon. However, we must scroll down the page automatically first before. Here the code:

#scroll
scrolldown=driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var scrolldown=document.body.scrollHeight;return scrolldown;")
match=False
while(match==False):
last_count = scrolldown
time.sleep(3)
scrolldown = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var scrolldown=document.body.scrollHeight;return scrolldown;")
if last_count==scrolldown:
match=True

Then let’s check the code by running it, and see! (it must be scroll down the page automatically if you have succeeded)

Get the URL posts

Now, time to get these URL posts which are posted in https://instagram.com/host.py/!

As we know that there are 11 posts as images in there. So, we can create the code like as image below:

According to the image above, we must create the empty box which is used to accommodate all the URL posts named posts, and we can type the code like below:

posts = []

Then, we create the links variable which is to get all the elements that have the tag name “a”. Thus, create the for loop function to get all the URL posts by following the code as the image above, and print!

Now, the code to get all URL posts has been created. We can check it out, and see! (just wait for a few minutes)

Exactly! The image above represents all the URL posts on the page, which shows 11 URLs in there. Good job!

Download all of the posts

Lastly, we must download all of the posts on there, and save them to our directory where the script has been saved on. So, the code is in below:

The image above explains that we must create a variable to accommodate all the names of images that we want to download named shortcode firstly. For example, when the URL’s name of the first post is https://www.instagram.com/p/CNMnQ0JAPfA/ so we can give the name CNMnQ0JAPfA.jpg to the file. Then it must be applied to the others files.

I assume that we have understood how we can get the selector for the download_url variable that has been written in the image above.

Conclusion

Finally, we have got all about the code completely in here. So what are we waiting for? Here the code:

Let’s run and see!

Hope this helps!

Let’s connect https://github.com/arrlanyhars

--

--