Tutorials
2022-07-28

How to run Selenium on Abstra Cloud

The Selenium library is one of Python’s most ingenious resources. The open-source framework allows users to instantiate browsers and simulate user behaviors on a web page. This can be used for the most varied purposes, like automatically testing web applications, collecting data and creating navigation routines.

In this example, we’ll showcase how to use Selenium in Abstra Cloud for web scraping, building a robot that automatically downloads headlines into a .csv file or makrdown list. This can be done via Forms or Jobs, depending on which way you would like to run your script and retrieve the output you need. Let’s get into it.

First, if you haven’t yet, sign up for free to access your workspace, where you can create your own projects.

Getting started

Let’s begin by importing the required packages: Selenium itself, Abstra Cloud’s own Hackerforms lib for UI, Os for environment variables, Pandas to organize the csv file and Datetime to deal with dates.


from hackerforms import *
from os import getenv
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
from datetime import date

Setting up Selenium

Now let’s setup Selenium.

You’ll need use our webdriver’s environment variable for access. Don’t worry, it’s available in every workspace!

We’ll instruct the webdriver’s settings - a Chrome browser with the required arguments - using the <options> variable, then call <webdriver.Remote> to start up the browser instance.


ABSTRA_SELENIUM_URL = getenv('ABSTRA_SELENIUM_URL')

options = webdriver.ChromeOptions()
options.add_argument("--no-sandbox")

driver = webdriver.Remote(command_executor=ABSTRA_SELENIUM_URL, options=options)

Next, using the <driver.get> command, we’ll open up the website we want to scrape on the browser. For this example we’ll go with Hackernoon.


driver.get("https://hackernoon.com/")

Scraping the website

As we’ve mentioned before, Selenium simulates a real user’s behavior navigating a webpage: finding an element to read, hovering the cursor or clicking, for example. Since we want to retrieve the webpage’s headlines, we’ll use the find_elements command.

Take a look at Hackernoon’s HTML structure.

It’s clear the data we’re looking for is stored in the “title-wrapper” HTML class, so we’ll add that as an argument.


elements = driver.find_elements(By.CLASS_NAME, "title-wrapper")

Now, let’s select the info we really need and make sure it’s organized correctly. For each headline on the webpage, we’d like to collect the title and correspondent link. The headline’s link and title are nested in the <a> tag, which is in turn nested inside the <title-wrapper> class and <h2> tag.

So, let’s create an empty list entitled <headlines>. Then, for each item in the <title-wrapper> class, we’ll narrow down what’s inside the <a> tag, and store that in the <link> variable.

To store the link’s URL and title separately, we’ll organize each list entry as a dictionary. Using the .append command and Python’s dictionary syntax, we’ll get the <href> attribute as the headline’s URL and the <innerHTML> text as the headline’s title.


headlines = []
for element in elements:
    link = element.find_element(By.TAG_NAME, 'h2').find_element(By.TAG_NAME, 'a')
    headlines.append({"url": link.get_attribute("href"),"title": link.get_attribute("innerHTML")
    })
    

By now, your webscraper is up and running. But how can we store and display this data?

Storing the results

Let’s check out 2 options.

First, we can generate a markdown list with links.

Using markdown syntax, we’ll format a heading, entitled “Hackernoon headlines” and for each headline, a clickable link. Then, just call Abstra Cloud’s display_markdown widget to show your results on the screen!


markdown = "# Hackernoon headlines: \n"
for headline in headlines:
    markdown += f"- [{headline['title']}]({headline['url']})\n"

display_markdown(markdown)

Another useful option is to use the Pandas library to generate a CSV file with the collected data.

It’s super simple. We’ll start by creating a dataframe with the headlines. Since this script might be used several times or even daily, we can use the datetime lib to create a distinct file name with today’s date. Finally, the df.to_csv command creates the csv file we need and Abstra Cloud’s display_file generates a beautiful UI for you to download your results!


df = pd.DataFrame(headlines)
filename = f"headlines-{date.today()}.csv"
df.to_csv(filename, index = False)
display_file(filename)

To finalize your work, let’s close the browser by calling the close() method:


driver.close()

Sharing your webscraper

Publish the finished version and share in a click. Your form’s link is now live, and anyone can access and run your script with it. Your webscraper is already a hit! 🎉

Check out the full code:


from hackerforms import *
from os import getenv
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
from datetime import date

# Setup selenium
ABSTRA_SELENIUM_URL = getenv('ABSTRA_SELENIUM_URL')

options = webdriver.ChromeOptions()
options.add_argument("--no-sandbox")

driver = webdriver.Remote(command_executor=ABSTRA_SELENIUM_URL, options=options)

# Opens website
driver.get("https://hackernoon.com/")


# Get each headline and its url from the homepage
elements = driver.find_elements(By.CLASS_NAME, "title-wrapper")

headlines = []
for element in elements:
    link = element.find_element(By.TAG_NAME, 'h2').find_element(By.TAG_NAME, 'a')
    headlines.append({"url": link.get_attribute("href"), "title": link.get_attribute("innerHTML")})


# Generate a markdown list with links
markdown = "# Hackernoon headlines: \n"
for headline in headlines:
    markdown += f"- [{headline['title']}]({headline['url']})\n"

# Display that list
display_markdown(markdown)

# Generate a CSV file with the collected data
df = pd.DataFrame(headlines)
filename = f"headlines-{date.today()}.csv"
df.to_csv(filename, index = False)
display_file(filename)

# Closes the driver
driver.close()

If you want to run your script regularly, you can turn it into a Job by using our scheduler.

Since it will run on our backend, there’s no need for UI. Just skip the displays and generate a CSV straight from the results. The results will be stored in your workspace’s Files straight away. Check out the Job’s code:


from hackerforms import *
from os import getenv
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
from datetime import date

# Setup selenium
ABSTRA_SELENIUM_URL = getenv('ABSTRA_SELENIUM_URL')

options = webdriver.ChromeOptions()
options.add_argument("--no-sandbox")

driver = webdriver.Remote(command_executor=ABSTRA_SELENIUM_URL, options=options)

# Opens website
driver.get("https://hackernoon.com/")

# Get each headline and its url from the homepage
elements = driver.find_elements(By.CLASS_NAME, "title-wrapper")

headlines = []
for element in elements:
    link = element.find_element(By.TAG_NAME, 'h2').find_element(By.TAG_NAME, 'a')
    headlines.append({"url": link.get_attribute("href"), "title": link.get_attribute("innerHTML")})

# Generate a CSV file with the collected data
df = pd.DataFrame(headlines)
filename = f"headlines-{date.today()}.csv"
df.to_csv(filename, index = False)

# Closes the driver
driver.close()

Want to see this example in action? Check it out.

Login to our console right now to start creating your own Selenium projects.