Home Backend Development Golang Web Scraping a Go

Web Scraping a Go

Sep 10, 2024 pm 02:30 PM

Primeros pasos

En primer lugar debemos de tener instalado Go, Instrucciones para descargar e instalar Go.

Creamos una nueva carpeta para el proyecto, nos movemos al directorio y ejecutamos el siguiente comando:

go mod init scraper

? El comando go mod init se utiliza para inicializar un nuevo módulo Go en el directorio donde se ejecuta y crea un archivo go.mod para rastrear las dependencias del código. Gestión de dependencias

Ahora instalemos Colibri:

go get github.com/gonzxlez/colibri

? Colibri es un paquete Go que nos permite rastrear y extraer datos estructurados en la web usando un conjuntos de reglas definidas en JSON. Repositorio


Reglas de extracción

Definimos las reglas que usara colibri para extraer los datos que necesitamos. Documentación

Vamos a realizar una petición HTTP a la URL https://pkg.go.dev/search?q=xpath la cual contiene los resultados de una consulta de paquetes Go relacionados con xpath en Go Packages.

Usando las herramientas de desarrollo incluidas en nuestro navegador web, podemos inspeccionar la estructura HTML de la página. ¿Cuáles son las herramientas de desarrollo del navegador?

Web Scraping en Go

<div class="SearchSnippet">
   <div class="SearchSnippet-headerContainer">
      <h2>
         <a href="/github.com/antchfx/xpath" data-gtmc="search result" data-gtmv="0" data-test-id="snippet-title">
         xpath
         <span class="SearchSnippet-header-path">(github.com/antchfx/xpath)</span>
         </a>
      </h2>
   </div>
   <div class="SearchSnippet-infoLabel">
      <a href="/github.com/antchfx/xpath?tab=importedby" aria-label="Go to Imported By">
      <span class="go-textSubtle">Imported by </span><strong>143</strong>
      </a>
      <span class="go-textSubtle">|</span>
      <span class="go-textSubtle">
      <strong>v1.2.5</strong> published on <span data-test-id="snippet-published"><strong>Oct 26, 2023</strong></span>
      </span>
      <span class="go-textSubtle">|</span>
      <span data-test-id="snippet-license">
      <a href="/github.com/antchfx/xpath?tab=licenses" aria-label="Go to Licenses">
      MIT
      </a>
      </span>
   </div>
</div>

Fragmento de la estructura HTML que representa un resultado de la consulta.

Entonces necesitamos un selector “packages” que encontrará todos los elementos div en el HTML con la clase SearchSnippet, de esos elementos un selector “name” tomará el texto del elemento a dentro de un elemento h2 y un selector “path” tomará el valor del atributo href del elemento a dentro de un elemento h2. En otras palabras, “name” tomará el nombre del paquete Go y “path” la ruta del paquete :)

{
    "method": "GET",
    "url":    "https://pkg.go.dev/search?q=xpath",
    "timeout": 10000,
    "selectors": {
        "packages": {
            "expr": "div.SearchSnippet",
            "all": true,
            "type": "css",
            "selectors": {
                "name": "//h2/a/text()",
                "path": "//h2/a/@href"
            }
        }
    }
}
  • method: especifica el método HTTP (GET, POST, PUT, ...).
  • url: URL de la solicitud.
  • timeout: límite de tiempo en milisegundos para la solicitud HTTP.
  • selectors: selectores.
    • “packages”: es el nombre del selector.
      • expr: expresión del selector.
      • all: especifica que se deben encontrar todos los elementos que coincidan con la expresión.
      • type: el tipo de expresión, en este caso un selector CSS.
      • selectors: selectores anidados.
        • “name” y “path” son los nombre de los selectores y sus valores son expresiones, en este caso expresiones XPath.

Código en Go

Estamos listos para crear un archivo scraper.go, importar los paquetes necesarios y definir la función main:

package main

import (
    "encoding/json"
    "fmt"

    "github.com/gonzxlez/colibri"
    "github.com/gonzxlez/colibri/webextractor"
)

var rawRules = `{
    "method": "GET",
    "url":    "https://pkg.go.dev/search?q=xpath",
    "timeout": 10000,
    "selectors": {
        "packages": {
            "expr": "div.SearchSnippet",
            "all": true,
            "type": "css",
            "selectors": {
                "name": "//h2/a/text()",
                "path": "//h2/a/@href"
            }
        }
    }
}`

func main() {
    we, err := webextractor.New()
    if err != nil {
        panic(err)
    }

    var rules colibri.Rules
    err = json.Unmarshal([]byte(rawRules), &rules)
    if err != nil {
        panic(err)
    }

    output, err := we.Extract(&rules)
    if err != nil {
        panic(err)
    }

    fmt.Println("URL:", output.Response.URL())
    fmt.Println("Status code:", output.Response.StatusCode())
    fmt.Println("Content-Type", output.Response.Header().Get("Content-Type"))
    fmt.Println("Data:", output.Data)
}

? WebExtractor son interfaces predeterminadas para Colibri listas para comenzar a rastrear o extraer datos en la web.

Usando la función New de webextractor, generamos una estructura Colibri con lo necesario para comenzar a extraer datos.

Luego convertimos nuestras reglas en JSON a una estructura Rules y llamamos al método Extract enviando como argumento las reglas.

Obtenemos la salida y se imprimen en pantalla la URL de la respuesta HTTP, el código de estado HTTP, el tipo de contenido de la respuesta y los datos extraídos con los selectores. Consulte la documentación de la estructura Output.

Ejecutamos el siguiente comando:

go mod tidy

? El comando go mod tidy se asegura de que las dependencias en el go.mod coinciden con el código fuente del módulo.

Finalmente compilamos y ejecutamos nuestro código en Go con el comando:

go run scraper.go

Conclusión

En este post, hemos aprendido cómo realizar Web Scraping en Go utilizando el paquete Colibri, definiendo reglas de extracción con selectores CSS y XPath. Colibri emerge como una herramienta para aquellos que buscan automatizar la recopilación de datos web en Go. Su enfoque basado en reglas y su facilidad de uso la convierten en una opción atractiva para desarrolladores de todos los niveles de experiencia.

En definitiva, el Web Scraping en Go es una técnica poderosa y versátil que puede utilizarse para extraer información de una amplia gama de sitios web. Es importante destacar que el Web Scraping debe realizarse de manera ética, respetando los términos y condiciones de los sitios web y evitando sobrecargar sus servidores.

The above is the detailed content of Web Scraping a Go. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undress AI Tool

Undress AI Tool

Undress images for free

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Strategies for Integrating Golang Services with Existing Python Infrastructure Strategies for Integrating Golang Services with Existing Python Infrastructure Jul 02, 2025 pm 04:39 PM

TointegrateGolangserviceswithexistingPythoninfrastructure,useRESTAPIsorgRPCforinter-servicecommunication,allowingGoandPythonappstointeractseamlesslythroughstandardizedprotocols.1.UseRESTAPIs(viaframeworkslikeGininGoandFlaskinPython)orgRPC(withProtoco

Understanding the Performance Differences Between Golang and Python for Web APIs Understanding the Performance Differences Between Golang and Python for Web APIs Jul 03, 2025 am 02:40 AM

Golangofferssuperiorperformance,nativeconcurrencyviagoroutines,andefficientresourceusage,makingitidealforhigh-traffic,low-latencyAPIs;2.Python,whileslowerduetointerpretationandtheGIL,provideseasierdevelopment,arichecosystem,andisbettersuitedforI/O-bo

Is golang frontend or backend Is golang frontend or backend Jul 08, 2025 am 01:44 AM

Golang is mainly used for back-end development, but it can also play an indirect role in the front-end field. Its design goals focus on high-performance, concurrent processing and system-level programming, and are suitable for building back-end applications such as API servers, microservices, distributed systems, database operations and CLI tools. Although Golang is not the mainstream language for web front-end, it can be compiled into JavaScript through GopherJS, run on WebAssembly through TinyGo, or generate HTML pages with a template engine to participate in front-end development. However, modern front-end development still needs to rely on JavaScript/TypeScript and its ecosystem. Therefore, Golang is more suitable for the technology stack selection with high-performance backend as the core.

How to completely and cleanly uninstall Golang from my system? How to completely and cleanly uninstall Golang from my system? Jun 30, 2025 am 01:58 AM

TocompletelyuninstallGolang,firstdeterminehowitwasinstalled(packagemanager,binary,source,etc.),thenremoveGobinariesanddirectories,cleanupenvironmentvariables,anddeleterelatedtoolsandcaches.Beginbycheckinginstallationmethod:commonmethodsincludepackage

How to marshal a golang struct to JSON with custom field names? How to marshal a golang struct to JSON with custom field names? Jun 30, 2025 am 01:59 AM

In Go, if you want the structure field to use a custom field name when converting to JSON, you can implement it through the json tag of the structure field. 1. Use the json: "custom_name" tag to specify the key name of the field in JSON. For example, Namestringjson: "username"" will make the Name field output as "username"; 2. Add, omitempty can control that the output is omitted when the field is empty, such as Emailstringjson: "email,omitempty""

How to install Go How to install Go Jul 09, 2025 am 02:37 AM

The key to installing Go is to select the correct version, configure environment variables, and verify the installation. 1. Go to the official website to download the installation package of the corresponding system. Windows uses .msi files, macOS uses .pkg files, Linux uses .tar.gz files and unzip them to /usr/local directory; 2. Configure environment variables, edit ~/.bashrc or ~/.zshrc in Linux/macOS to add PATH and GOPATH, and Windows set PATH to Go in the system properties; 3. Use the government command to verify the installation, and run the test program hello.go to confirm that the compilation and execution are normal. PATH settings and loops throughout the process

How to fix 'go: command not found' after installation? How to fix 'go: command not found' after installation? Jun 30, 2025 am 01:54 AM

"Go:commandnotfound" is usually caused by incorrect configuration of environment variables; 1. Check whether Go has been installed correctly and use whichgo to confirm the path; 2. Manually add Go's bin directory (such as /usr/local/go/bin) to the PATH environment variable; 3. Modify the corresponding shell's configuration file (such as .bashrc or .zshrc) and execute source to make the configuration take effect; 4. Optionally set GOROOT and GOPATH to avoid subsequent module problems. After completing the above steps, run government and verify whether it is repaired.

Resource Consumption (CPU/Memory) Benchmarks for Typical Golang vs Python Web Services Resource Consumption (CPU/Memory) Benchmarks for Typical Golang vs Python Web Services Jul 03, 2025 am 02:38 AM

Golang usually consumes less CPU and memory than Python when building web services. 1. Golang's goroutine model is efficient in scheduling, has strong concurrent request processing capabilities, and has lower CPU usage; 2. Go is compiled into native code, does not rely on virtual machines during runtime, and has smaller memory usage; 3. Python has greater CPU and memory overhead in concurrent scenarios due to GIL and interpretation execution mechanism; 4. Although Python has high development efficiency and rich ecosystem, it consumes a high resource, which is suitable for scenarios with low concurrency requirements.

See all articles