Web Scraping Exercises Python

Hey data hackers! Looking for a rapid way to pull down unstructured data from the Web? Here’s a 5-minute analytics workout across two simple approaches to how to scrape the same set of real-world web data using either Excel or Python. All of this is done with 13 lines of Python code or one filter and 5 formulas in Excel.

Web Scraping is a powerful tool to gather information from a website. To scrape multiple URLs, we can use a Python library called Newspaper3k. Use an HTML Parser for Web Scraping in Python Although regular expressions are great for pattern matching in general, sometimes it’s easier to use an HTML parser that’s explicitly designed for parsing out HTML pages. There are many Python tools written for this purpose, but the Beautiful Soup library is a good one to start with.

All of the code and data for this post are available at GitHub here. Never scraped web data in Python before? No worries! The Jupyter notebook is written in an interactive, learning-by-doing style that anyone without knowledge of web scraping in Python through the process of understanding web data and writing the related code step by step. Stay tuned for a streaming video walkthrough of both approaches.

Huh? Why scrape web data?

If you’ve not found the need to scrape web data yet, it won’t be long … Much of the data you interact with daily on the web is not structured in a way that you could easily pull it down for analysis. Reading your morning feed of news and catch a great data table in the article? Unless the journalist links to machine-readable data, you’ll have to scrape it straight from the article itself! Looking to find the best deal across multiple shopping websites? P0848 honda. Best believe that they’re not offering easy ways to download those data and compare! In this small instance, we’ll explore augmenting one’s Kindle library with Audible audiobooks.

Audiobooks whilst traveling …

When in transit during travel (ahh, travel … I remember that …) I listen to podcasts and audiobooks. I was, the other day, wondering what the total cost would be to add Audible audiobook versions of every Kindle book that I own, where they are available. The clever people at Amazon have anticipated just such a query, and offer the Audible Matchmaker tool, which scans your Kindle library and offers Audible versions of Kindle books you own.

Sadly, there is not an option to “buy all” or even the convenience function of a “total cost” calculation anywhere. “No worries,” I thought, “I’ll just knock this into a quick Excel spready and add it up myself.”

Part I: Web Scraping in Excel

Excel has become super friendly to non-spreadsheet data in the past years. To wit, I copied the entire page (after clicking through all of the “more” paging button until all available titles were shown on one page) and simply pasted this into a tab in the spready.

Removing all of the images left us with a column of a mishmash of text, only some of which is useful to our objective of calculating the total purchase price for all unowned Audible books. Filtering on the repeating “Audible Upgrade Price” text reduces the column down to the values we are after.

Perfect! All that remains is a bit of text processing to extract the prices as numbers we can sum. All of these steps are detailed in the accompanying spreadsheet and data package to this post.

$465.07 … not a horrible price for 60 audiobooks … just about $8 each. (As an Audible subscriber, each month’s book credit costs $15, so this is roughly half of that cost and not a bad deal …) NB – we could have used Excel’s cool Web Query capabilities to import the data from the website, though we’ll cover that separately as we cover scraping Web data requiring logins elsewhere.

I’m sure my Kindle library is like yours, in that it significantly expands with each passing week, so it occurred to me that this won’t be the last time I have this question. Let’s make this repeatable by coding a solution in Python.

Part II: Web Scraping in Python

The approach in Python is quite similar, conceptually, to the Excel-based approach.

Pull the data from the Audible Matchmaker page
Parse it into something mathematically useful & sum audiobook costs

Copy the data from the Audible Matchmaker page

The BeautifulSoup library in Python provides an easy interface to scraping Web data. (It’s actually quite a bit more useful than that, but let’s discuss that another time.) We simply load the BeautifulSoup class from the bs4 module, and use it to parse a request object made by calling the get() method of the requests module.

Parse the data into something mathematically useful

Similar to our approach in Excel, we’ll use the BeautifulSoup module to filter the page elements down to the price values we’re after. We do this in 3 steps:

Find all of the span elements which contain the price of each Audible book
Convert the data in the span elements to numbers
Sum them up!

Donezo! See the accompanying Jupyter Notebook for a detailed walkthrough of this code.

There it is – scraping web data in 5 minutes using both Excel and Python! Stay tuned for a streamed video walkthrough of both approaches.

Web Scraping with Python: BeautifulSoup, Requests & Selenium. With the help of this course you can Web Scraping and Crawling with Python: Beautiful Soup, Requests & Selenium.

Python Web Scraping Pdf

This course was created by GoTrained Academy & Waqar Ahmed. It was rated 4.4 out of 5 by approx 5758 ratings. There are approx 77080 users enrolled with this course, so don’t wait to download yours now. This course also includes 8 hours on-demand video, 8 Articles, 19 Supplemental Resources, 1 Coding exercise, Full lifetime access, Access on mobile and TV, Assignments & Certificate of Completion.

What Will You Learn?

Python Web Scraping Sample

Python Refresher: Review of Data Structures, Conditionals, File Handling
How Websites are Hosted on Servers; Basic Calls to Server (GET, POST Methods)
Web Scraping with Python Beautiful Soup and Requests
Using Selenium to handle JavaScript and AJAX
Diverse Web Scraping Exercises
Source codes (*.py files) for all Exercises can be downloaded
Q&A board to send your questions and get them answered quickly

Web Scraping (also termed Screen Scraping, Web Data Extraction, Web Harvesting, etc.) is a technique for extracting large amounts of data from websites and save the the extracted data to a local file or to a database.

In this course, you will learn how to perform web scraping using Python 3 and the Beautiful Soup, a free open-source library written in Python for parsing HTML.

We will use lxml, which is an extensive library for parsing XML and HTML documents very quickly; it can even handle messed up tags. We will also be using the Requests module instead of the already built-in urllib2 module due to improvements in speed and readability.

Finally, we will use Selenium alongside Beautiful Soup to crawl AJAX & JavaScript driven pages.

The course cover the following topics: accessing web pages programmatically; scraping web pages to extract the required data using Beautiful Soup to parse web pages; interacting with web pages to do different things with them programmatically; and using Selenium for web scraping and when we need it.

By the end of this course, you will be able to understand how websites and servers function, diverse data extraction techniques, and methods of handling and organizing data.

This Web Scraping course covers the following topics:

Web Scraping Exercises Python Pdf

Review of data structures (Lists, Dictionaries, Tuples, File Handling)
How websites are hosted on servers
Calls to the server (GET, POST methods)
Review of HTML and CSS
Requests Module and BeautifulSoup Module overview
Parsing HTML using BeautifulSoup
Filtering elements using BeautifulSoup and navigating the Parse Tree
JavaScript and AJAX overview
Selenium and the need for it
Selecting elements using Selenium
CSS selectors
XPath selectors
Navigating pages using Selenium
Practical Projects

Python Web Scraping Tutorial

Rating:
4.3