wikiparse

package module

v0.0.0-...-8a5a28a Latest Latest Go to latest Published: Apr 13, 2016 License: MIT Imports: 16 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/dvirsky/go-wikiparse

Links

Open Source Insights

README ¶

go-wikiparse

If you're like me, then you enjoy playing with lots of textual data and scour the internet for sources of it.

mediawiki's dumps are a pretty awesome chunk that's fun to work with.

Installation

go get github.com/dustin/go-wikiparse

Usage

The parser takes any io.Reader as a source assuming it's a complete XML dump and lets you pull wikiparse.Page objects out of it. These typically arrive as bzip2 files, so I make my program open the file and set up a bzip reader over it and all that. But you don't need to do that if you want to read off of stdin. Here's a complete example that emits page titles from a decompressing stream on stdin:

package main

import (
	"fmt"
	"os"

	"github.com/dustin/go-wikiparse"
)

func main() {
	p, err := wikiparse.NewParser(os.Stdin)
	if err != nil {
		fmt.Fprintf(os.Stderr, "Error setting up parser", err)
		os.Exit(1)
	}

	for err == nil {
		var page *wikiparse.Page
		page, err = p.Next()
		if err == nil {
			fmt.Println(page.Title)
		}
	}
}

Example invocation:

bzcat enwiki-20120211-pages-articles.xml.bz2 | ./sample

Geographical Information

Because it's interesting to me, I wrote a parser for the wikiproject geographical coordinates that are found on many pages. Use this on the page's content to find out if it's a place or not. Then go there.

Documentation ¶

Overview ¶

Package wikiparse is library to understand the wikipedia xml dump format.

The dumps are available from the wikimedia group here:

http://dumps.wikimedia.org/

In particular, I've worked mostly with the enwiki dumps from here:

http://dumps.wikimedia.org/enwiki/

See the example programs in subpackages for an idea of how I've made use of these things.

Index ¶

Variables
func FindFiles(text string) []string
func FindLinks(text string) []string
func URLForFile(name string) string
type Contributor
type Coord
- func ParseCoords(text string) (Coord, error)
type IndexEntry
- func (i IndexEntry) String() string
type IndexReader
- func NewIndexReader(r io.Reader) *IndexReader
- func (ir *IndexReader) Next() (IndexEntry, error)
type IndexSummaryReader
- func NewIndexSummaryReader(r io.Reader) (rv *IndexSummaryReader, err error)
- func (isr *IndexSummaryReader) Next() (offset int64, count int, err error)
type IndexedParseSource
type Page
type Parser
- func NewIndexedParser(indexfn, datafn string, numWorkers int) (Parser, error)
- func NewIndexedParserFromSrc(src IndexedParseSource, numWorkers int) (Parser, error)
- func NewParser(r io.Reader) (Parser, error)
type ReadSeekCloser
type Redirect
type Revision
type SiteInfo

Constants ¶

This section is empty.

Variables ¶

View Source

var ErrNoCoordFound = errors.New("no coord data found")

ErrNoCoordFound is returned from ParseCoords when there's no coordinate date found.

Functions ¶

func FindFiles ¶

func FindFiles(text string) []string

FindFiles finds all the File references from within an article body.

This includes things in comments, as many I found were commented out.

func FindLinks ¶

func FindLinks(text string) []string

FindLinks finds all the links from within an article body.

func URLForFile ¶

func URLForFile(name string) string

URLForFile gets the wikimedia URL for the given named file.

Types ¶

type Contributor ¶

type Contributor struct {
	ID       uint64 `xml:"id"`
	Username string `xml:"username"`
}

A Contributor is a user who contributed a revision.

type Coord ¶

type Coord struct {
	Lon, Lat float64
}

Coord is Longitude/latitude pair from a coordinate match.

func ParseCoords ¶

func ParseCoords(text string) (Coord, error)

ParseCoords parses geographical coordinates as specified in http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Geographical_coordinates

type IndexEntry ¶

type IndexEntry struct {
	StreamOffset int64
	PageOffset   int
	ArticleName  string
}

An IndexEntry is an individual article from the index.

func (IndexEntry) String ¶

func (i IndexEntry) String() string

type IndexReader ¶

type IndexReader struct {
	// contains filtered or unexported fields
}

An IndexReader is a wikipedia multistream index reader.

func NewIndexReader ¶

func NewIndexReader(r io.Reader) *IndexReader

NewIndexReader gets a wikipedia index reader.

func (*IndexReader) Next ¶

func (ir *IndexReader) Next() (IndexEntry, error)

Next gets the next entry from the index stream.

This assumes the numbers were meant to be incremental.

type IndexSummaryReader ¶

type IndexSummaryReader struct {
	// contains filtered or unexported fields
}

IndexSummaryReader gets offsets and counts from an index.

If you don't want to know the individual articles, just how many and where, this is for you.

func NewIndexSummaryReader ¶

func NewIndexSummaryReader(r io.Reader) (rv *IndexSummaryReader, err error)

NewIndexSummaryReader gets a new IndexSummaryReader from the given stream of index lines.

func (*IndexSummaryReader) Next ¶

func (isr *IndexSummaryReader) Next() (offset int64, count int, err error)

Next gets the next offset and count from the index summary reader.

Note that the last returns io.EOF as an error, but a valid offset and count.

type IndexedParseSource ¶

type IndexedParseSource interface {
	OpenIndex() (io.ReadCloser, error)
	OpenData() (ReadSeekCloser, error)
}

An IndexedParseSource provides access to a multistream xml dump and its index.

This is typically downloaded as two files, but a seekable interface such as HTTP with range requests can also serve.

type Page ¶

type Page struct {
	Title     string     `xml:"title"`
	ID        uint64     `xml:"id"`
	Redir     Redirect   `xml:"redirect"`
	Revisions []Revision `xml:"revision"`
	Ns        uint64     `xml:"ns"`
}

A Page in the wiki.

type Parser ¶

type Parser interface {
	// Get the next page from the parser
	Next() (*Page, error)
	// Get the toplevel site info from the stream
	SiteInfo() SiteInfo
}

A Parser emits wiki pages.

func NewIndexedParser ¶

func NewIndexedParser(indexfn, datafn string, numWorkers int) (Parser, error)

NewIndexedParser gets an indexed/parallel wikipedia dump parser from the given index and data files.

func NewIndexedParserFromSrc ¶

func NewIndexedParserFromSrc(src IndexedParseSource, numWorkers int) (Parser, error)

NewIndexedParserFromSrc creates a Parser that can parse multiple pages concurrently from a single source.

func NewParser ¶

func NewParser(r io.Reader) (Parser, error)

NewParser gets a wikipedia dump parser reading from the given reader.

type ReadSeekCloser ¶

type ReadSeekCloser interface {
	io.ReadSeeker
	io.Closer
}

ReadSeekCloser is io.ReadSeeker + io.Closer.

type Redirect ¶

type Redirect struct {
	Title string `xml:"title,attr"`
}

A Redirect to another Page.

type Revision ¶

type Revision struct {
	ID          uint64      `xml:"id"`
	Timestamp   string      `xml:"timestamp"`
	Contributor Contributor `xml:"contributor"`
	Comment     string      `xml:"comment"`
	Text        string      `xml:"text"`
}

A Revision to a page.

type SiteInfo ¶

type SiteInfo struct {
	SiteName   string `xml:"sitename"`
	Base       string `xml:"base"`
	Generator  string `xml:"generator"`
	Case       string `xml:"case"`
	Namespaces []struct {
		Key   string `xml:"key,attr"`
		Case  string `xml:"case,attr"`
		Value string `xml:",chardata"`
	} `xml:"namespaces>namespace"`
}

SiteInfo is the toplevel site info describing basic dump properties.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
tools
cbload command Load a wikipedia dump into CouchBase	Load a wikipedia dump into CouchBase
couchload command Load a wikipedia dump into CouchDB	Load a wikipedia dump into CouchDB
esload command Load a wikipedia dump into ElasticSearch	Load a wikipedia dump into ElasticSearch
mgoload command
traverse command Sample program that finds all the geo data in wikipedia pages.	Sample program that finds all the geo data in wikipedia pages.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL