crawler

package module

v0.7.2 Latest Latest Go to latest Published: Jan 27, 2026 License: MIT Imports: 22 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/maniakalen/crawler

Links

Open Source Insights

README ¶

Crawler

This module is used for crawling the web page given and return Page object for each status that has configured channel for it.

You can configure new channels using the Channels map like

	chans := crawler.Channels{
		404: make(chan crawler.Page),
		200: make(chan crawler.Page),
	}

Documentation ¶

Index ¶

type Channels
type Config
type Crawler
- func NewCrawler(config Config, queue queue.QueueInterface) (*Crawler, error)
- func (c *Crawler) Start()
type Headless
- func NewHeadless() *Headless
type Page
type PageResponse

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Channels ¶

type Channels map[int]chan Page

Channels is a Page channels map where the index is the response code so we can define different behavior for the different resp codes

type Config ¶ added in v0.1.6

type Config struct {
	StartURL               string
	AllowedDomains         []string // Domains to stay within
	UserAgents             []string
	CrawlDelay             time.Duration // Delay between requests to the same domain
	MaxDepth               int           // Maximum crawl depth
	MaxRetries             int           // Max retries for a failed request
	RequestTimeout         time.Duration
	QueueIdleTimeout       time.Duration
	ProxyURL               string // e.g., "http://user:pass@host:port"
	RobotsUserAgent        string // User agent to use for robots.txt checks
	ConcurrentRequests     int    // Number of concurrent fetch workers
	Channels               Channels
	Headers                map[string]string
	LanguageCode           string
	Filters                []func(*Page, *Config) bool
	MaxIdleConnsPerHost    int
	MaxIdleConns           int
	Proxies                []string
	RequireHeadless        bool
	ProcessSitemaps        bool
	DisableUserAgentHeader bool
}

Config holds crawler configuration

type Crawler ¶

type Crawler struct {
	// contains filtered or unexported fields
}

Crawler represents the web crawler

func NewCrawler ¶

func NewCrawler(config Config, queue queue.QueueInterface) (*Crawler, error)

NewCrawler initializes a new Crawler

func (*Crawler) Start ¶ added in v0.6.0

func (c *Crawler) Start()

Start begins the crawling process

type Headless ¶ added in v0.7.0

type Headless struct {
}

func NewHeadless ¶ added in v0.7.0

func NewHeadless() *Headless

type Page ¶

type Page struct {
	URL  *url.URL      // Page url
	Resp *PageResponse // Page response as returned from the GET request
	Body string        // Response body string
}

Page is a struct that carries the scanned url, response and response body string

type PageResponse ¶ added in v0.7.0

type PageResponse struct {
	StatusCode int
	Headers    map[string]any
	Body       io.Reader
}

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
queue
robots

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL