Geekflare is supported by our audience. We may earn affiliate commissions from buying links on this site.
In AI Last updated: August 25, 2023
Share on:
Invicti Web Application Security Scanner – the only solution that delivers automatic verification of vulnerabilities with Proof-Based Scanningā„¢.

A detailed guide to web scraping using ChatGPT Code Interpreter and its plugins.

If you’re not into creating some novelty, chances are you need some prerequisite information to begin. Or, you might want to look into the competition for valuable input. In addition, there can be countless reasons for someone to be interested in a specific website’s content.

Web scraping is the process that serves such use cases.

And there are a few ways to go about that. There are heavy-weight tools you can subscribe to for professional scraping of big websites. Alternatively, you may require a specific setup for on-premise processing.

Either way, the approach is expensive, time-consuming, and tedious for beginners, especially for scraping a few web pages.

Overview of ChatGPT for Web Scraping

I’m not supposed to introduce ChatGPT to you. Am I?

In short, ChatGPT is a generative AI that responds like humans. You get a chat interface for asking it to complete various tasks, such as inquiring about historical events, writing essays, summarizing, translating, coding, etc.

ChatGPT replies in text. However, there are ChatGPT plugins that enhance its capabilities in many ways. And we’ll be using one such plugin. In addition, we’ll use its Code Interpreter for scraping websites having complicated webpage structures or with active anti-scraping protocols.

Please know that ChatGPT has free and paid versions. But you’ll need the paid subscription (currently, $20 a month) for using the web scraper plugin or its Code Interpreter engine.

In further sections, I’ll illustrate the process step-by-step.

Disclaimer: Before proceeding yourself, please confirm that the subject website allows scraping their content. If not, you can contact their admin and see if they permit it for you to avoid any legal troubles.

Web Scraping Using ChatGPT Plugin

Login to your OpenAI account, hover over GPT-4 (its current paid version) and click Plugins.

chatgpt plugin

Next, click No plugins enabled, scroll down, and click Plugin Store.

plugin-store

Please note that instead of No plugins enabled, you’ll have a plugin icon if one is active. In that case, you need to click that icon to open the drop-down and click the Plugin store at the bottom.

This will open the Plugin store. Search for Scraper and hit Install.

scraper

Select this plugin in the ChatGPT interface.

scraper activate

Once this is selected, one must prompt ChatGPT, mentioning the subject URL and the content for scraping.

I have done this for a few websites. Check this out.

Scraping a Publication

We are a tech-focussed publication, and I have chosen our home page, geekflare.com/ for this illustration.

Here’s the prompt:

check this webpage: https://geekflare.com/ and prepare a table indicating the article title, author, publication date, and excerpt for the top 10 articles.
geekflare scrape

One can also re-prompt to convert the data into CSV format, paste it in a text file with .csv extension, and open it in a spreadsheet application like MS Excel.

Scraping a Deal or Coupon Webpage

The Geekflare deals section is where we have handpicked some offers on top-tech projects. How about fetching every deal in a tabular format?

Prepare a list of deals from this webpage: https://geekflare.com/deals/. present the result in a tabular format.
geekflare deals

Scraping Wikipedia

Summarize in tabular format the latest news from the "in the news" section from this wikipedia page: https://en.wikipedia.org/wiki/Main_Page
wikipedia scraping

Scraping E-commerce Stores

Lastly, I tried scraping Amazon.com for the laptops by applying a few filters and feeding the URL to ChatGPT. This is what I got:

amazon scraping blocker

The problem is this isn’t a single case. You’ll find many such instances where the websites have anti-scraping measures. In this situation, you’ll need to find an alternative for getting the data if subscribing to industry-standard scrapers isn’t an option.

The following sections entail one such solution.

Web Scraping Using ChatGPT Code Interpreter

Code Interpreter is a newly launched ChatGPT engine to cater to programming-related tasks. While the default engine heavily relies on text responses, Code Interpreter can help visualize outputs, parse, debug, & execute code, integrate with software binaries, and do a lot more programming-centric things.

chatgpt-code-interpreter

In this process, we will download the source HTML, upload it to ChatGPT Code Interpreter, and proceed with the scraping.

I have taken this page for extraction:

sample amazon page

We will begin by saving the webpage as HTML. For that, go to the webpage and press Ctrl+S.

save html

Now we have the file for scraping. Let’s figure out the prompt.

prompt

In addition to the text prompt, you can see I have given it sample elements to fast-track the scraping. Since Amazon’s web page structures are complex, without these samples, the scraping attempt might fail or result in nothing.

And getting these elements is fairly easy. Right-click anywhere on the subject webpage and click Inspect from the pop-over.

inspect element

First, click the topmost icon (marked as 1). This will highlight the details while you select elements from the page. Next, select the container element for any specific product.

selecting the element

Please ensure to select the innermost container. You can hover along, and it will keep highlighting. The moment you get the last shell covering that block, you can click and go over to the right side to copy the element’s div class.

Similarly, select the samples for other elements.

upload-html

Finally, upload the HTML and prompt similar to this:

check out this webpage html and extract the laptop titles, price, and ratings. present the result in a tabular format within this chat interface and also give the results in a CSV to download.

div class="s-card-container s-overflow-hidden aok-relative puis-include-content-margin puis puis-vfcg1duwvmpo42mcln9ojhiljk s-latency-cf-section s-card-border"
sample title element: span class="a-size-medium a-color-base a-text-normal"
sample price element: span class="a-price-whole"
sample ratings element: span class="a-size-base puis-bold-weight-text"

This will take some time while ChatGPT Code Interpreter does its work. You will have a few details, whereas everything will be in the embedded CSV file.

web scraping chatgpt

You can observe that the table has a few entries not present on the original web page, especially at the start. In such cases, you need to double-check and clean the data for any redundancies.

If there are any, you can re-prompt ChatGPT to get a clean CSV.

Final Thoughts

ChatGPT does many things, and basic web scraping is one of them. Agreed, it might not be suitable for someone scraping hundreds of pages. Still, it’ll get you started in the right direction and ideal for a short scraping session.

In this guide, we have used one of its scraping plugins and Code Interpreter. While plugins work on many standard websites, the second method is for custom webpage structures or if the page has dynamic elements (endless scroll, read more, etc.).

And to reiterate, go through the subject website terms before scraping.

PS: Check out these cloud scraping solutions and our own Geekflare scraping API.

  • Hitesh Sant
    Author
    Hitesh works as a senior writer at Geekflare and dabbles in cybersecurity, productivity, games, and marketing. Besides, he holds master’s in transportation engineering. His free time is mostly about playing with his son, reading, or lying… read more
Thanks to our Sponsors
More great readings on AI
Power Your Business
Some of the tools and services to help your business grow.
  • Invicti uses the Proof-Based Scanningā„¢ to automatically verify the identified vulnerabilities and generate actionable results within just hours.
    Try Invicti
  • Web scraping, residential proxy, proxy manager, web unlocker, search engine crawler, and all you need to collect web data.
    Try Brightdata
  • Monday.com is an all-in-one work OS to help you manage projects, tasks, work, sales, CRM, operations, workflows, and more.
    Try Monday
  • Intruder is an online vulnerability scanner that finds cyber security weaknesses in your infrastructure, to avoid costly data breaches.
    Try Intruder