Documentation
¶
Index ¶
- type Geziyor
- func (g *Geziyor) Do(req *client.Request, callback func(g *Geziyor, r *client.Response))
- func (g *Geziyor) Get(url string, callback func(g *Geziyor, r *client.Response))
- func (g *Geziyor) GetRendered(url string, callback func(g *Geziyor, r *client.Response))
- func (g *Geziyor) Head(url string, callback func(g *Geziyor, r *client.Response))
- func (g *Geziyor) Start()
- type Options
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Geziyor ¶
type Geziyor struct {
Opt *Options
Client *client.Client
Exports chan interface{}
// contains filtered or unexported fields
}
Geziyor is our main scraper type
func NewGeziyor ¶
NewGeziyor creates new Geziyor with default values. If options provided, options
func (*Geziyor) GetRendered ¶
GetRendered issues GET request using headless browser Opens up a new Chrome instance, makes request, waits for rendering HTML DOM and closed. Rendered requests only supported for GET requests.
type Options ¶
type Options struct {
// AllowedDomains is domains that are allowed to make requests
// If empty, any domain is allowed
AllowedDomains []string
// Chrome headless browser WS endpoint.
// If you want to run your own Chrome browser runner, provide its endpoint in here
// For example: ws://localhost:3000
BrowserEndpoint string
// Cache storage backends.
// - Memory
// - Disk
// - LevelDB
Cache cache.Cache
// Policies for caching.
// - Dummy policy (default)
// - RFC2616 policy
CachePolicy cache.Policy
// Response charset detection for decoding to UTF-8
CharsetDetectDisabled bool
// Concurrent requests limit
ConcurrentRequests int
// Concurrent requests per domain limit. Uses request.URL.Host
// Subdomains are different than top domain
ConcurrentRequestsPerDomain int
// If set true, cookies won't send.
CookiesDisabled bool
// For extracting data
Exporters []export.Exporter
// Disable logging by setting this true
LogDisabled bool
// Max body reading size in bytes. Default: 1GB
MaxBodySize int64
// Maximum redirection time. Default: 10
MaxRedirect int
// Scraper metrics exporting type. See metrics.Type
MetricsType metrics.Type
// ParseFunc is callback of StartURLs response.
ParseFunc func(g *Geziyor, r *client.Response)
// If true, HTML parsing is disabled to improve performance.
ParseHTMLDisabled bool
// Request delays
RequestDelay time.Duration
// RequestDelayRandomize uses random interval between 0.5 * RequestDelay and 1.5 * RequestDelay
RequestDelayRandomize bool
// Called before requests made to manipulate requests
RequestMiddlewares []middleware.RequestProcessor
// Called after response received
ResponseMiddlewares []middleware.ResponseProcessor
// Which HTTP response codes to retry.
// Other errors (DNS lookup issues, connections lost, etc) are always retried.
// Default: []int{500, 502, 503, 504, 522, 524, 408}
RetryHTTPCodes []int
// Maximum number of times to retry, in addition to the first download.
// Set -1 to disable retrying
// Default: 2
RetryTimes int
// If true, disable robots.txt checks
RobotsTxtDisabled bool
// StartRequestsFunc called on scraper start
StartRequestsFunc func(g *Geziyor)
// First requests will made to this url array. (Concurrently)
StartURLs []string
// Timeout is global request timeout
Timeout time.Duration
// Revisiting same URLs is disabled by default
URLRevisitEnabled bool
// User Agent.
// Default: "Geziyor 1.0"
UserAgent string
}
Options is custom options type for Geziyor
Directories
¶
| Path | Synopsis |
|---|---|
|
Package cache provides a http.RoundTripper implementation that works as a mostly RFC-compliant cache for http responses.
|
Package cache provides a http.RoundTripper implementation that works as a mostly RFC-compliant cache for http responses. |
|
diskcache
Package diskcache provides an implementation of cache.Cache that uses the diskv package to supplement an in-memory map with persistent storage
|
Package diskcache provides an implementation of cache.Cache that uses the diskv package to supplement an in-memory map with persistent storage |
|
leveldbcache
Package leveldbcache provides an implementation of cache.Cache that uses github.com/syndtr/goleveldb/leveldb
|
Package leveldbcache provides an implementation of cache.Cache that uses github.com/syndtr/goleveldb/leveldb |
Click to show internal directories.
Click to hide internal directories.