Scxrapper — The highly automated Python scraping engine

About Quote When to Scrape Workflows Discretion Concurrency Work Offline Text Processing Database Storage API

About

Scxrapper (skrap-per) is a scraping engine written in Python 3.2. It can run on virtually any platform. As a purely GUI-less (aka "headless") application toolkit, it can be easily made to fit a massive variety of situations.

A huge range of tasks can be performed by Scxrapper, from pushing updates to a website form, to indexing a catalog of information for analysis, to gathering files from a tediously alphabetized list of pages.

Quote

Free of charge or commitment, please submit your contact information via the form below, or call +1.707.633.8446.

When to Scrape

Scraping engines are good at finding patterns of repeating information.

Ideally a page neatly lays out its information, usually in repeating rows of data. Also, when pages follow a general template, where tidbits of information appear in predictable locations, the scraping task is much easier.

On the other hand, automated engines like Scxrapper don't work very well when you need to collect static, unique data points from multiple sites. Because each site is unquestionably unique, it's cumbersome to build a program to find the same information you already located by hand.

Workflows

Sites built with common frameworks tend to utilize familiar traversal patterns. Scxrapper contains a number of helpful processing functions that make easier the task of grabbing predictable values, such as VIEWSTATEs, form actions, next-page URLs, and

Discretion

Because people work hard to run good websites, Scxrapper includes several features that both keep a low profile as data is indexed, and provide caches and checkpoints that avoid having to repeatedly ask the website for the same data over and over again.

Since sparing a few extra minutes or hours usually isn't a big deal, Scxrapper can easily rate-limit itself to make regular or random pauses between requests for uncached page content.

Concurrency

Some projects are just too big for a single instance of the project to complete all by itself. By using a central datastore for concurrent project copies, and by enabling custom application signals that allow communication between those copies, every Scxrapper project has a default multi-process mode.

Offline Processing

Scxrapper can optionally store remote pages locally on your hard drive, making it easy to do multiple passes without careless undue traffic on the remote host. If collection specifications change, one need only go as far as the hard drive for another pass at the data. Offline examinations are incredibly fast by comparison.

Built-in programming components allow for easy conditional requests when offline versions are missing or out of date.

Text Processing

Unresolved HTML entities, extraneous whitespace, and unimportant markup tags can make tiresome work out of searching for patterns, but Scxrapper provides a library of pluggable functions to deal with any combination of circumstances at any required processing point.

Since there will undoubtedly be situations that require custom filtering and pre- and post-processing, Scxrapper provides a simple interface for creating new filters and chaining them together as required.

Database Storage

Scxrapper supports any native Python database driver, which grants easy access to popular engines such as MySQL, MSSQL, and SQLite. Built-in Python tools also make easy work out of writing data to CSV spreadsheet files.

Alternatively, Scxrapper can communicate over other HTTP-based protocols (like SOAP or REST or other RPC interfaces) or lower-level socket-based connections.

API

Because Scxrapper is at its core no more than a programming library, its functionality is not tied to any specific UI layout, or limited to only the standard processing functions.

Scxrapper is built to interpret statically declared source files, such as a Python script or a YAML or XML document, but fundamentally also supports programmatic construction of your application logic, enabling one to take whatever complex actions are required to get the job done.