Regex vs GoQuery vs net/html — Go HTML Parsing Performance Benchmark

Parsing the Web

For as long as the web has existed, we have been writing parsers to extract and categorize its data. Today, this effort has scaled into a massive wave of bot traffic. Whether it is search engines like Googlebot, AI agents feeding LLMs, or specialized scrapers, the goal remains the same: transforming raw HTML into structured, actionable information.

While the specific requirements vary by industry, the fundamental need for data sourcing is universal. E-commerce platforms track competitor pricing; AI systems extract plain text to train models; outreach tools scrape contact details and addresses. At scale, the bottleneck isn’t the data itself-it’s how efficiently your parser can chew through megabytes of potentially broken HTML.

Implementation Approaches

The requirement for data extraction has led to various tooling options. These range from managed services to the low-level libraries used in custom development.

Python is a common choice for this work, supported by a mature set of libraries for building functional parsers. For systems requiring higher concurrency, Go is an effective alternative. It allows for the development of parsers that manage large datasets with efficient resource utilization.

Technical Context

EdgeComet is implemented in Go. Within our rendering service, we maintain a module for extracting page titles and metadata. While these are standard operations, the choice of implementation method involved specific technical trade-offs.

The initial version utilized regular expressions for their low overhead. This decision sparked a heated discussion within the team. Some members strongly disagreed with using regex, advocating for the goquery library to improve robustness. Their primary concern was the handling of malformed HTML, while my focus remained on performance: regex should be significantly faster than parsing a full page and building a DOM tree in memory.

The Regex Implementation

Using regex seems straightforward. After rendering a page in Headless Chrome, we have the full page content in a string. I simply fed it into a couple of regex calls.

The catch is that modern webpages can be huge. We aren’t talking about tens of kilobytes anymore; I often see pages weighing in at a couple of megabytes.

package extractor

import (
    "regexp"
    "strings"
)

// Patterns for HTML tag and attribute extraction
var (
    titlePattern       = regexp.MustCompile(`(?is)<title[^>]*>(.*?)</title>`)
    metaTagPattern     = regexp.MustCompile(`(?i)<meta\s+[^>]*>`)
    nameAttrPattern    = regexp.MustCompile(`(?i)\bname\s*=\s*(?:"([^"]*)"|'([^']*)'|([^\s>]+))`)
    contentAttrPattern = regexp.MustCompile(`(?i)\bcontent\s*=\s*(?:"([^"]*)"|'([^']*)'|([^\s>]+))`)
    blockingPattern    = regexp.MustCompile(`(?i)\b(noindex|none)\b`)
)

// ExtractTitle returns the page title, truncated to 200 characters
func ExtractTitle(html string) string {
    matches := titlePattern.FindStringSubmatch(html)
    if matches == nil || len(matches) < 2 {
            return ""
    }
    title := strings.TrimSpace(matches[1])
    if runes := []rune(title); len(runes) > 200 {
            return string(runes[:200])
    }
    return title
}

// IsNoindex checks if page is blocked by meta robots or googlebot tags
func IsNoindex(html string) bool {
    metaTags := metaTagPattern.FindAllString(html, -1)

    var robotsContent, googlebotContent string
    for _, tag := range metaTags {
            name := strings.ToLower(extractAttr(tag, nameAttrPattern))
            content := extractAttr(tag, contentAttrPattern)
            switch name {
            case "robots":
                    robotsContent = content
            case "googlebot":
                    googlebotContent = content
            }
    }

    // Googlebot directive takes priority over robots
    if googlebotContent != "" {
            return blockingPattern.MatchString(googlebotContent)
    }
    return blockingPattern.MatchString(robotsContent)
}

func extractAttr(tag string, pattern *regexp.Regexp) string {
    matches := pattern.FindStringSubmatch(tag)
    for i := 1; i < len(matches); i++ {
            if matches[i] != "" {
                    return matches[i]
            }
    }
    return ""
}

// Usage:
// title := ExtractTitle(html)
// blocked := IsNoindex(html)

One of the biggest issues with regex parsing is fragility. The pages on the internet, do not follow strict standards or best practices. In the wild, you will encounter edge cases you can’t even imagine.

A <title> tag might contain 5 MB of data. A meta tag might be unclosed and bleed into the <body>. These scenarios make regex complicated to maintain. I found myself writing not just one, but a chain of regex patterns just to extract a single piece of text while handling various edge cases.

We all remember the old saying: “If you have a problem and you decide to use regex, now you have two problems.”

GoQuery

https://github.com/PuerkitoBio/goquery

goquery is a staple in the Go ecosystem. It’s more than 6 years old and enjoys massive popularity. Essentially, it’s a wrapper around the Go native library net/html.

It implements a tokenizer and parser under the hood. Basically, it reads the HTML document, parses it into tokens, and builds a full tree of the document in memory. If you are thinking about processing thousands of pages per second, that sounds like a huge overhead-parsing every document, building a tree, and allocating all those objects.

But simultaneously, it makes life much easier. I don’t need to care about broken tags, non-standard compliance, or weird nesting. The library offers a clean interface similar to jQuery, making data extraction easy and frankly, quite enjoyable.

package extractor

import (
    "io"
    "net/url"
    "regexp"
    "strings"

    "github.com/PuerkitoBio/goquery"
)

var blockingPattern = regexp.MustCompile(`(?i)\b(noindex|none)\b`)

// Document wraps goquery.Document for metadata extraction
type Document struct {
    doc *goquery.Document
}

// NewDocument parses HTML from reader and returns a Document
func NewDocument(r io.Reader) (*Document, error) {
    doc, err := goquery.NewDocumentFromReader(r)
    if err != nil {
            return nil, err
    }
    return &Document{doc: doc}, nil
}

// Title returns the page title, truncated to 200 characters
func (d *Document) Title() string {
    title := strings.TrimSpace(d.doc.Find("head title").First().Text())
    if runes := []rune(title); len(runes) > 200 {
            return string(runes[:200])
    }
    return title
}

// IsNoindex checks if page is blocked by meta robots/googlebot tags
func (d *Document) IsNoindex() bool {
    var googlebotContent, robotsContent string

    d.doc.Find("head meta").Each(func(_ int, s *goquery.Selection) {
            name := strings.ToLower(s.AttrOr("name", ""))
            content := s.AttrOr("content", "")
            if name == "googlebot" {
                googlebotContent = content
            } else if name == "robots" {
                robotsContent = content
            }
    })

    // Googlebot directive takes priority
    if googlebotContent != "" {
            return blockingPattern.MatchString(googlebotContent)
    }
    return blockingPattern.MatchString(robotsContent)
}

// Usage:
// doc, err := extractor.NewDocument(strings.NewReader(html))
// if err != nil {
//   return err
// }
//
// title := doc.Title()
// blocked := doc.IsNoindex()

However, goquery and the net/html library are not a universal solution. In production environments, we encountered scenarios where parsing became problematic, particularly with unusually large <head> sections. This typically occurs when web pages include hundreds of kilobytes of inline JavaScript and CSS directly in the head for “optimization” purposes.

Comparison

After our internal “hot discussion” between the regex and goquery camps, I decided to settle it with testing. The plan was to parse and download some number of different pages to test the implementation and compare which method was actually faster in practice.

I grabbed the top 1000 websites from dataforseo.com and created a simple parser benchmark.

Test Methodology

I built a benchmark tool that:

Fetches HTML from each website using a Chrome user-agent.
Extracts metadata (title, meta description, h1/h2 counts, link counts) using three methods: goquery, regex, and Go’s native net/html parser.
Measures execution time for each method.
Also benchmarks script cleaning (removing executable scripts while preserving JSON-LD).

Out of 1000 domains, 673 returned valid HTML responses. The rest failed due to HTTP errors (403 Forbidden), timeouts, or SSL issues. I ran all tests with 15 concurrent workers and a 10-second timeout per request.

Results

Benchmark Results Summary

Performance comparison across 673 websites from top 1000 domains

Operation	Method	P50	P90	P99	Max
Full Metadata Extraction (title, description, h1/h2, links)
	Regex	1.35ms	5.28ms	14.1ms	36.2ms
	net/html	1.60ms	5.29ms	12.8ms	18.4ms
	GoQuery	1.80ms	5.82ms	14.7ms	22.4ms
Title-Only Extraction (includes parse time)
	Regex	8us	14us	29us	4.46ms
	net/html	1.59ms	5.30ms	12.7ms	21.0ms
	GoQuery	1.63ms	5.52ms	13.4ms	18.6ms
Script Cleaning (remove executable, preserve JSON-LD)
	Regex	3.89ms	22.3ms	82.4ms	171.2ms
	net/html	1.92ms	6.24ms	16.1ms	50.5ms

Test environment: macOS, Go 1.21+, 15 concurrent workers, 10-second HTTP timeout per domain.
327 domains failed due to HTTP errors (403, timeouts, SSL issues), leaving 673 successful measurements.
Title-only extraction includes full parse time for fair comparison – regex is ~200x faster for single-element extraction.

Full Metadata Extraction

For extracting title, description, heading counts, and link counts from HTML:

Method	P50	P90	P99	Max
Regex	1.35ms	5.28ms	14.1ms	36.2ms
net/html	1.60ms	5.29ms	12.8ms	18.4ms
GoQuery	1.80ms	5.82ms	14.7ms	22.4ms

Full Metadata Extraction Performance

Time to extract title, description, h1/h2 counts, and links from HTML (673 websites)

GoQuery

Regex

net/html

Values in milliseconds (ms). Lower is better. Based on 673 successful website fetches from top 1000 domains.

Regex leads at P50 with 1.35ms. The net/html parser is close at 1.60ms, while goquery trails at 1.80ms. However, at P99, net/html shows the most predictable worst-case performance (12.8ms vs 14.1ms for regex).

Title-Only Extraction

When extracting just the title tag:

Method	P50	P90	P99
Regex	8μs	14μs	29μs
net/html	1.59ms	5.30ms	12.7ms
GoQuery	1.63ms	5.52ms	13.4ms

Title-Only Extraction Performance

Time to parse HTML and extract just the title tag (673 websites)

GoQuery

Regex

net/html

Values in milliseconds (ms). Lower is better. Includes parse time for DOM methods.

~200x faster: Regex doesn’t parse the document – it scans the HTML string until it finds the pattern and stops. DOM parsers must tokenize and build the full document structure before any extraction.

Note: DOM parser times include full document parsing. Once parsed, DOM traversal to find the title takes ~0μs-but you still pay the parsing cost upfront.

Regex is ~200x faster for single-element extraction. This is the key insight: regex doesn’t parse the document. It scans the HTML string until it finds the pattern and stops. DOM parsers must tokenize and build the full document structure before any extraction can happen.

For extracting a single element, regex wins decisively. But if you need multiple elements, the DOM parsing cost gets amortized across all extractions-you parse once, then traverse for essentially free.

Script Cleaning

I also tested removing executable scripts while preserving JSON-LD and template scripts:

Method	P50	P90	P99
Regex	3.89ms	22.3ms	82.4ms
net/html	1.92ms	6.24ms	16.1ms

Script Cleaning Performance

Time to remove executable scripts while preserving JSON-LD and data scripts (673 websites)

Regex

net/html

Values in milliseconds (ms). Lower is better. Script cleaning involves removing executable scripts while preserving JSON-LD and template scripts.

Note the significant P99 spike for regex (82.4ms vs 16.1ms). This happens with pages containing complex nested scripts or unusual patterns that cause regex backtracking. The net/html parser provides more predictable performance.

Here the results flip. The net/html parser is 2x faster at P50 and 5.1x faster at P99. The regex approach shows significant P99 spikes (82.4ms vs 16.1ms) due to backtracking on pages with complex nested scripts or unusual patterns.

Conclusions

Single Element Extraction: Regex Works

For extracting just one tag-a title, a specific meta tag-regex is a reasonable choice. It’s fast, has no dependencies, and the pattern is easy to understand. The 200x speed advantage over DOM parsers for title extraction is real (8 microseconds vs 1.6 milliseconds). Regex doesn’t parse the document; it scans until it finds the pattern and stops. If you’re building a high-throughput crawler that only needs the page title, regex delivers.

Multiple Extractions: net/html Wins

The picture changes when you need several pieces of data from the same page. With net/html (or goquery), you pay the parsing cost once. After that, accessing any element is essentially free-my benchmarks showed ~0 microseconds for title extraction after the DOM was built.

If you need title, description, canonical URL, Open Graph tags, and heading structure from each page, the DOM approach becomes more efficient. The upfront parsing cost amortizes across all extractions, and you avoid running multiple regex patterns over megabytes of HTML.

Developer Experience Matters

goquery wraps net/html with a jQuery-like API. It’s slightly slower than raw net/html, but the code is cleaner:

// GoQuery - readable and maintainable
title := doc.Find("title").Text()
canonical := doc.Find(`link[rel="canonical"]`).AttrOr("href", "")

// vs regex - faster but harder to extend
matches := titleRegex.FindStringSubmatch(html)

When throughput isn’t your primary constraint, the maintainability advantage of goquery may outweigh the performance difference. New team members understand doc.Find("title") immediately.

Beyond Speed: Robustness and Maintainability

In practical applications, raw speed isn’t everything. Consider:

Robustness – DOM parsers handle malformed HTML gracefully. Regex patterns can break on unexpected markup, missing quotes, or nested tags.
Maintainability – Adding a new field to extract is one line with goquery, but might require crafting and testing a new regex pattern.
Debugging – When extraction fails, DOM-based code is easier to debug. You can inspect the parsed tree. With regex, you’re staring at pattern matching failures.
Extraction Reliability – As noted earlier, even net/html has limits. Extremely large documents or <head> sections filled with massive inline CSS and JS can cause the parser to fail or return inconsistent results.

Our Decision

Initially, I used regex for EdgeComet’s metadata extraction. After internal discussion and running these benchmarks, I decided to switch to net/html.

Why? We extract a variety of data from each page-not just the title, but description, canonical URLs, robots directives, Open Graph tags, and more. Accurate extraction of these elements is fundamental for many edge SEO optimization workflows. EdgeComet works with a huge variety of HTML pages from sources we don’t control and can’t predict. I’ve seen everything from perfectly valid HTML5 to decade-old markup with unclosed tags and invalid nesting.

Robustness and reliability won over raw speed. When the rendering process itself takes 3, 5, or 10 seconds in Headless Chrome, adding 20-50 milliseconds for extraction is negligible. The DOM parser handles edge cases I haven’t even encountered yet, and the code is easier to extend when we need to extract new fields.

The Takeaway

Choose based on your actual requirements:

One element, maximum speed: Regex
Multiple elements from same page: net/html or goquery
Maintainability over raw performance: goquery
Unknown/messy HTML sources: DOM parsers

The 22% speed difference at P50 might matter at scale, or it might be negligible compared to your network latency. Profile your specific use case rather than following general advice.

If you are dealing with JavaScript rendering challenges and would rather focus on your project tasks, EdgeComet might be the right fit. We’ve built our service to handle these exact scenarios at scale, managing the JS rendering so you don’t have to.

— Leo Richter, Dev lead

Regex vs GoQuery vs net/html: A Performance Benchmark

Parsing the Web

Implementation Approaches

Technical Context

The Regex Implementation

GoQuery

Comparison

Test Methodology

Results

Benchmark Results Summary

Full Metadata Extraction

Full Metadata Extraction Performance

Title-Only Extraction

Title-Only Extraction Performance

Script Cleaning

Script Cleaning Performance

Conclusions

Single Element Extraction: Regex Works

Multiple Extractions: net/html Wins

Developer Experience Matters

Beyond Speed: Robustness and Maintainability

Our Decision

The Takeaway

Try EdgeComet free for 14 days

Parsing the Web

Implementation Approaches

Technical Context

The Regex Implementation

GoQuery

Comparison

Test Methodology

Results

Benchmark Results Summary

Full Metadata Extraction

Full Metadata Extraction Performance

Title-Only Extraction

Title-Only Extraction Performance

Script Cleaning

Script Cleaning Performance

Conclusions

Single Element Extraction: Regex Works

Multiple Extractions: net/html Wins

Developer Experience Matters

Beyond Speed: Robustness and Maintainability

Our Decision

The Takeaway

Try EdgeComet free for 14 days

Related Articles

How Googlebot Crawls, Renders, and Indexes JavaScript: A Developer’s Guide

generateMetadata: The Performance Optimization You Don’t Need