Process dynamic web content in R language: identify and use API to obtain data-HTML Tutorial-php.cn

Table of Contents

Limitations of traditional HTML parsing

Identify API sources for dynamic content

Use API to directly obtain data

Step 1: Build the API request URL

Things to note

Summarize

Home

Web Front-end

HTML Tutorial

Process dynamic web content in R language: identify and use API to obtain data

Barbara Streisand

Aug 27, 2025 pm 08:24 PM

Process dynamic web content in R language: identify and use API to obtain data

This tutorial aims to solve the challenges encountered when crawling dynamically loading web content using the R language rvest package. When traditional HTML parsing methods cannot obtain the data rendered by JavaScript, the core strategy is to identify the API interface called behind the web page. We will demonstrate how to efficiently and accurately extract the required information by directly requesting these APIs and parsing the JSON data they return, thereby overcoming the limitations of front-end dynamic rendering.

In modern web design, JavaScript is widely used to load and render content dynamically, which means that most of the data users see in the browser is not directly embedded in the initially loaded HTML document. When trying to crawl such websites using traditional HTML parsing tools such as rvest, you often encounter problems that you cannot get the expected data, such as returning empty character vectors. This is usually due to the fact that the target data is fetched from the backend API via an asynchronous JavaScript call (AJAX or Fetch API) and then dynamically inserted into the page DOM on the browser side.

Limitations of traditional HTML parsing

When the HTML document loaded by rvest::read_html does not contain the specific data you are looking for (such as nested links), it indicates that the content is likely to be dynamically generated through client JavaScript after the page is loaded. In this case, it is invalid to just parse the static HTML. For example, the following code tries to extract the link from the Thrive Market page, but gets an empty result:

 library(rvest)
library(xml2)

page1_url %
# rvest::html_nodes('body') %>%
# xml2::xml_find_all("//div[contains(@class, 'd85qmy-0 kRbsKs')]") %>%
# rvest::html_attr('href')
# print(page1_urls) # The result is usually empty

This indicates that the link we are looking for does not exist in the HTML source code originally obtained by read_html.

Identify API sources for dynamic content

The key to solving this problem is to identify the actual source of the data—usually the backend API of the website. We can observe network requests through the browser's developer tools (usually pressing F12 to open).

Open Developer Tools: On the landing page, press F12 to open Developer Tools.
Switch to the Network tab: This displays all HTTP requests made when the browser loads the page.
Reload the page: After clearing the network request list, refresh the page to capture all requests.
Filter and check requests: Focus on requests of XHR (XMLHttpRequest) or Fetch type. These requests are usually used to load data asynchronously. Check their URL, request method (GET/POST), and response content. The response content is usually in JSON or XML format, containing data displayed dynamically on the page.

In this way, we can find that many dynamically loaded product lists or details are actually obtained through a specific API endpoint, such as https://thrivemarket.com/api/v1/products.

Use API to directly obtain data

Once the API endpoint is identified, we can bypass the front-end JavaScript rendering, directly send an HTTP request to the API and parse the data it returns. This approach is usually more efficient and stable than emulating browser behavior.

Step 1: Build the API request URL

Based on the API request information captured in the developer tool, we can build the corresponding API URL. These URLs usually contain query parameters, which are used to control paging, filtering conditions, etc. For example, the product data API of Thrive Market may contain parameters such as page_size and cur_page:

 # Build API request URL
api_url <h4> Step 2: Send HTTP request and parse JSON response</h4><p> The http package is a powerful tool in R language for sending HTTP requests. httr::GET() is used to send GET requests, while the httr::content() function can intelligently parse response content, especially JSON data.</p><pre class="brush:php;toolbar:false"> library(httr)
library(dplyr)
library(jsonlite) # explicitly load jsonlite to process JSON data# Build API request URL
api_url  0) {
    product_data %
      bind_rows() %>%
      as_tibble()

    print(product_data)
  } else {
    message("Product data not found or JSON structure does not meet expectations.")
  }
} else {
  message(paste("API request failed or returns non-JSON data. Status code:", status_code(response)))
}

Sample output (part):

 # A tibble: 60 x 2
   product url                         
   <chr> <chr>                       
 1 Organic Extra Virgin Olive Oil https://thrivemarket.com/p/~
 2 Grass-Fed Collagen Peptides https://thrivemarket.com/p/~
 3 Grass-Fed Beef Sticks, Original https://thrivemarket.com/p/~
 4 Organic Dry Roasted & Salted Cashews https://thrivemarket.com/p/~
 5 Organic Vanilla Extract https://thrivemarket.com/p/~
 6 Organic Raw Cashews https://thrivemarket.com/p/~
 7 Organic Coconut Milk, Regular https://thrivemarket.com/p/~
 8 Organic Robust Maple Syrup, Grade A, Value Size https://thrivemarket.com/p/~
 9 Organic Coconut Water https://thrivemarket.com/p/~
10 Non-GMO Avocado Oil Potato Chips, Himalayan Salt https://thrivemarket.com/p/~
# ... with 50 more rows</chr></chr>

With this approach, we successfully extract the product name and corresponding URL from the dynamically loaded web page without having to deal with complex JavaScript renderings.

Things to note

API Structure and Stability: The API structure of a website may change over time. If your data crawling script suddenly fails, first check if the API endpoint URL or JSON response structure has been updated.
Request frequency and limitations: Most APIs have request frequency limits (Rate Limit). Frequent or excessive requests may result in your IP being temporarily or permanently banned, or API access is restricted. Be sure to comply with the robots.txt file and terms of use of the website.
Authentication and authorization: Some APIs require authentication (such as API Key, OAuth token). In this case, you need to include the corresponding authentication information in the HTTP request header. httpr::add_headers() can help you add this information.
Error handling: In actual applications, robust error handling mechanisms should be added, such as checking HTTP response status code (response$status_code), handling network connection errors or parsing JSON failures.
Data analysis complexity: The structure of JSON data may be very complex, and it is necessary to carefully analyze its nesting hierarchy and flexibly use tools such as lapply, purrr::map series functions or jsonlite::flatten() for recursive parsing.
Law and Ethics: When conducting any data crawling activities, be sure to comply with the target website's terms of service, privacy policies, and local laws and regulations.

Summarize

When traditional HTML parsing-based web crawling methods (such as rvest) encounter bottlenecks when dealing with dynamic loading of content, identifying and directly interacting with the website's backend API is a powerful and efficient alternative. By analyzing network requests with the browser's developer tools, we can find the API endpoint where the data comes from, and then use http to send the HTTP request and parse the returned JSON data. This method not only allows you to obtain data more accurately, but also improves the efficiency and stability of the crawling. It is an important skill in R language for advanced web data acquisition. However, always remember to abide by the website’s usage regulations and ethics.

The above is the detailed content of Process dynamic web content in R language: identify and use API to obtain data. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undress AI Tool

Undress images for free

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to report an impersonation account on Instagram

3 weeks ago By 下次还敢

How to Change ChatGPT Personality in Settings (Cynic, Robot, Listener, Nerd)

3 weeks ago By DDD

Wuchang: Fallen Feathers - Dragon Emperor Zhu Youjian Boss Fight Guide

4 weeks ago By DDD

How to Fight Eris in Neon Abyss

3 weeks ago By Jack chen

Pokémon TCG Scarlet & Violet: Black Bolt Elite Trainer Box Review

4 weeks ago By Jack chen

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

PHP Tutorial

1598

276

Related knowledge

How to create an unordered list in HTML? Jul 30, 2025 am 04:50 AM

To create an HTML unordered list, you need to use a tag to define a list container. Each list item is wrapped with a tag, and the browser will automatically add bullets; 1. Create a list with a tag; 2. Each list item is defined with a tag; 3. The browser automatically generates default dot symbols; 4. Sublists can be implemented through nesting; 5. Use the list-style-type attribute of CSS to modify the symbol style, such as disc, circle, square, or none; use these tags correctly to generate a standard unordered list.

How to embed a PDF document in HTML? Aug 01, 2025 am 06:52 AM

Using tags is the easiest and recommended method. The syntax is suitable for modern browsers to embed PDF directly; 2. Using tags can provide better control and backup content support, syntax is, and provides download links in tags as backup solutions when they are not supported; 3. It can be embedded through Google DocsViewer, but it is not recommended to use widely due to privacy and performance issues; 4. In order to improve the user experience, appropriate heights should be set, responsive sizes (such as height: 80vh) and PDF download links should be provided so that users can download and view them themselves.

How to add an icon to your website title tab in HTML Aug 07, 2025 pm 11:30 PM

To add an icon to the website title bar, you need to link a favicon file in part of the HTML. The specific steps are as follows: 1. Prepare a 16x16 or 32x32 pixel icon file. It is recommended to use favicon.ico to name it and place it in the website root directory, or use modern formats such as PNG and SVG; 2. Add link tags to HTML, such as PNG or SVG formats, adjust the type attribute accordingly; 3. Optionally add high-resolution icons for mobile devices, such as AppleTouchIcon, and specify different sizes through the sizes attribute; 4. Follow best practices, place the icon in the root directory to ensure automatic detection, clear the browser cache after update, and check the correctness of the file path.

Using HTML `input` Types for User Data Aug 03, 2025 am 11:07 AM

Choosing the right HTMLinput type can improve data accuracy, enhance user experience, and improve usability. 1. Select the corresponding input types according to the data type, such as text, email, tel, number and date, which can automatically checksum and adapt to the keyboard; 2. Use HTML5 to add new types such as url, color, range and search, which can provide a more intuitive interaction method; 3. Use placeholder and required attributes to improve the efficiency and accuracy of form filling, but it should be noted that placeholder cannot replace label.

How to create a search input field in an HTML form Aug 02, 2025 pm 04:44 PM

Usetheelementwithinatagtocreateasemanticsearchfield.2.Includeaforaccessibility,settheform'sactionandmethod="get"attributestosenddatatoasearchendpointwithashareableURL.3.Addname="q"todefinethequeryparameter,useplaceholdertoguideuse

Why is my HTML image not showing up? Aug 16, 2025 am 10:08 AM

First, check whether the src attribute path is correct, and ensure that the relative or absolute path matches the HTML file location; 2. Verify whether the file name and extension are spelled correctly and case-sensitive; 3. Confirm that the image file actually exists in the specified directory; 4. Use appropriate alt attributes and ensure that the image format is .jpg, .png, .gif or .webp widely supported by the browser; 5. Troubleshoot browser cache issues, try to force refresh or directly access the image URL; 6. Check server permission settings to ensure that the file can be read and not blocked; 7. Verify that the img tag syntax is correct, including the correct quotes and attribute order, and finally troubleshoot 404 errors or syntax problems through the browser developer tool to ensure that the image is displayed normally.

How to use the HTML abbr tag for abbreviations Aug 05, 2025 pm 12:54 PM

Using HTML tags can improve the accessibility and clarity of content; 1. Mark abbreviations or acronyms with abbreviations; 2. Add title attributes to unusual abbreviations to provide a complete explanation; 3. Use when the document first appears, avoiding duplicate annotations; 4. You can customize the style through CSS, and the default browser usually displays dotted underscores; 5. It helps screen reader users understand terms and enhance user experience.

How to add an icon to a button in HTML Aug 07, 2025 pm 11:09 PM

Using FontAwesome can quickly add icons by introducing CDN and adding icon classes to buttons, such as Like; 2. Using labels to embed custom icons in buttons, the correct path and size must be specified; 3. Embed SVG code directly to achieve high-resolution icons and keep them consistent with the text color; 4. Spacing should be added through CSS and aria-label should be added to the icon buttons to improve accessibility; in summary, FontAwesome is most suitable for standard icons, pictures are suitable for custom designs, while SVG provides the best scaling and control, and methods should be selected according to project needs. FontAwesome is usually recommended.

See all articles