你可以獲得命令行幫助通過pyspider --help和pyspider all --help來獲得幫助。 全局的選項適用于所有的子命令
```
Usage: pyspider [OPTIONS] COMMAND [ARGS]...
A powerful spider system in python.
Options:
-c, --config FILENAME a json file with default values for subcommands.
{“webui”: {“port”:5001}}
--logging-config TEXT logging config file for built-in python logging
module [default: pyspider/pyspider/logging.conf]
--debug debug mode
--queue-maxsize INTEGER maxsize of queue
--taskdb TEXT database url for taskdb, default: sqlite
--projectdb TEXT database url for projectdb, default: sqlite
--resultdb TEXT database url for resultdb, default: sqlite
--message-queue TEXT connection url to message queue, default: builtin
multiprocessing.Queue
--amqp-url TEXT [deprecated] amqp url for rabbitmq. please use
--message-queue instead.
--beanstalk TEXT [deprecated] beanstalk config for beanstalk queue.
please use --message-queue instead.
--phantomjs-proxy TEXT phantomjs proxy ip:port
--data-path TEXT data dir path
--version Show the version and exit.
--help Show this message and exit.
```
配置文件是一個(帶有(全局或者子命令)的配置值)JSON文件
{
"taskdb": "mysql+taskdb://username:password@host:port/taskdb",
"projectdb": "mysql+projectdb://username:password@host:port/projectdb",
"resultdb": "mysql+resultdb://username:password@host:port/resultdb",
"message_queue": "amqp://username:password@host:port/%2F",
"webui": {
"username": "some_name",
"password": "some_passwd",
"need-auth": true
}
}
隊列大小限制,0就是沒有限制
```
mysql:
mysql+type://user:passwd@host:port/database
sqlite:
# relative path
sqlite+type:///path/to/database.db
# absolute path
sqlite+type:////path/to/database.db
# memory database
sqlite+type://
mongodb:
mongodb+type://[username:password@]host1[:port1][,host2[:port2],...[,hostN[:portN]]][/[database][?options]]
more: http://docs.mongodb.org/manual/reference/connection-string/
sqlalchemy:
sqlalchemy+postgresql+type://user:passwd@host:port/database
sqlalchemy+mysql+mysqlconnector+type://user:passwd@host:port/database
more: http://docs.sqlalchemy.org/en/rel_0_9/core/engines.html
local:
local+projectdb://filepath,filepath
type:
should be one of `taskdb`, `projectdb`, `resultdb`.
```
phantomjs代理地址,你需要一個安裝包和運行phantomjs代理用命令:pyspider phantomjs
SQLite數據庫 和 計算器轉儲文件 的保存路徑
```
Usage: pyspider all [OPTIONS]
Run all the components in subprocess or thread
Options:
--fetcher-num INTEGER instance num of fetcher
--processor-num INTEGER instance num of processor
--result-worker-num INTEGER instance num of result worker
--run-in [subprocess|thread] run each components in thread or subprocess.
always using thread for windows.
--help Show this message and exit.
```
```
Usage: pyspider one [OPTIONS] [SCRIPTS]...
One mode not only means all-in-one, it runs every thing in one process
over tornado.ioloop, for debug purpose
Options:
-i, --interactive enable interactive mode, you can choose crawl url.
--phantomjs enable phantomjs, will spawn a subprocess for phantomjs
--help Show this message and exit.
```
注意:WebUI不是在one模式下運行的
在one模式下,結果默認是寫在標準輸出上。你可以捕捉他們通過pyspider one > result.txt
項目的腳本文件路徑。當項目在運行的時候,比率和并發(fā)可以通過腳本命令去設置。
# rate: 1.0
# burst: 3
當腳本設置好了,默認任務數據庫和結構數據庫將使用一個內存數據庫(可以通過全局的配置--taskdb``--resltdb虛擬出來)。on_start回調將會觸發(fā)在啟動時候。
使用交互式的模型,pyspider將會啟動一個交互式的控制臺請求,那個將要在下一個進程循環(huán)中做的。在控制臺,你可以使用:
```
crawl(url, project=None, **kwargs)
Crawl given url, same parameters as BaseHandler.crawl
url - url or taskid, parameters will be used if in taskdb
project - can be omitted if only one project exists.
quit_interactive()
Quit interactive mode
quit_pyspider()
Close pyspider
```
你可以使用pyspider.libs.utils.python_console()去打開一個交互式控制臺,在你的腳本中。
```
Usage: pyspider bench [OPTIONS]
Run Benchmark test. In bench mode, in-memory sqlite database is used
instead of on-disk sqlite database.
Options:
--fetcher-num INTEGER instance num of fetcher
--processor-num INTEGER instance num of processor
--result-worker-num INTEGER instance num of result worker
--run-in [subprocess|thread] run each components in thread or subprocess.
always using thread for windows.
--total INTEGER total url in test page
--show INTEGER show how many urls in a page
--help Show this message and exit.
```
```
Usage: pyspider scheduler [OPTIONS]
Run Scheduler, only one scheduler is allowed.
Options:
--xmlrpc / --no-xmlrpc
--xmlrpc-host TEXT
--xmlrpc-port INTEGER
--inqueue-limit INTEGER size limit of task queue for each project, tasks
will been ignored when overflow
--delete-time INTEGER delete time before marked as delete
--active-tasks INTEGER active log size
--loop-limit INTEGER maximum number of tasks due with in a loop
--scheduler-cls TEXT scheduler class to be used.
--help Show this message and exit.
```
設置選項去使用自定義的調度類
```
Usage: run.py phantomjs [OPTIONS] [ARGS]...
Run phantomjs fetcher if phantomjs is installed.
Options:
--phantomjs-path TEXT phantomjs path
--port INTEGER phantomjs port
--auto-restart TEXT auto restart phantomjs if crashed
--help Show this message and exit.
```
添加args到phantomjs命令行
```
Usage: pyspider fetcher [OPTIONS]
Run Fetcher.
Options:
--xmlrpc / --no-xmlrpc
--xmlrpc-host TEXT
--xmlrpc-port INTEGER
--poolsize INTEGER max simultaneous fetches
--proxy TEXT proxy host:port
--user-agent TEXT user agent
--timeout TEXT default fetch timeout
--fetcher-cls TEXT Fetcher class to be used.
--help Show this message and exit.
```
默認的代理使用fetcher,選項可以被self.crawl重寫。
Usage: pyspider processor [OPTIONS]
Run Processor.
Options:
--processor-cls TEXT Processor class to be used.
--help Show this message and exit.
Usage: pyspider result_worker [OPTIONS]
Run result worker.
Options:
--result-cls TEXT ResultWorker class to be used.
--help Show this message and exit.
```
Usage: pyspider webui [OPTIONS]
Run WebUI
Options:
--host TEXT webui bind to host
--port INTEGER webui bind to host
--cdn TEXT js/css cdn server
--scheduler-rpc TEXT xmlrpc path of scheduler
--fetcher-rpc TEXT xmlrpc path of fetcher
--max-rate FLOAT max rate for each project
--max-burst FLOAT max burst for each project
--username TEXT username of lock -ed projects
--password TEXT password of lock -ed projects
--need-auth need username and password
--webui-instance TEXT webui Flask Application instance to be used.
--help Show this message and exit.
```
JS/CSS 基于 CDN 服務的。URL必須兼容cdnjs
fetcher XMLRPC 服務器 的 XML-RPC 的路徑 URI。如果不設置,使用使用 Fetcher 實例.
如果為真,通過--username``--password,所有頁面的請求都將被指定。
更多建議: