閱讀(925) 書簽贊(0) 我要糾錯

pyspider 命令行

2023-02-16 16:06 更新

命令行命令

全局配置

你可以獲得命令行幫助通過pyspider --help和pyspider all --help來獲得幫助。全局的選項適用于所有的子命令

```
Usage: pyspider [OPTIONS] COMMAND [ARGS]...

  A powerful spider system in python.

Options:
  -c, --config FILENAME    a json file with default values for subcommands.
                           {“webui”: {“port”:5001}}
  --logging-config TEXT    logging config file for built-in python logging
                           module  [default: pyspider/pyspider/logging.conf]
  --debug                  debug mode
  --queue-maxsize INTEGER  maxsize of queue
  --taskdb TEXT            database url for taskdb, default: sqlite
  --projectdb TEXT         database url for projectdb, default: sqlite
  --resultdb TEXT          database url for resultdb, default: sqlite
  --message-queue TEXT     connection url to message queue, default: builtin
                           multiprocessing.Queue
  --amqp-url TEXT          [deprecated] amqp url for rabbitmq. please use
                           --message-queue instead.
  --beanstalk TEXT         [deprecated] beanstalk config for beanstalk queue.
                           please use --message-queue instead.
  --phantomjs-proxy TEXT   phantomjs proxy ip:port
  --data-path TEXT         data dir path
  --version                Show the version and exit.
  --help                   Show this message and exit.
```

--config

配置文件是一個（帶有（全局或者子命令）的配置值）JSON文件

    {
      "taskdb": "mysql+taskdb://username:password@host:port/taskdb",
      "projectdb": "mysql+projectdb://username:password@host:port/projectdb",
      "resultdb": "mysql+resultdb://username:password@host:port/resultdb",
      "message_queue": "amqp://username:password@host:port/%2F",
      "webui": {
        "username": "some_name",
        "password": "some_passwd",
        "need-auth": true
      }
    }

--queue-maxsize

隊列大小限制，0就是沒有限制

--taskdb, --projectdb, --resultdb

  ```
    mysql:
    mysql+type://user:passwd@host:port/database
sqlite:
    # relative path
    sqlite+type:///path/to/database.db
    # absolute path
    sqlite+type:////path/to/database.db
    # memory database
    sqlite+type://
mongodb:
    mongodb+type://[username:password@]host1[:port1][,host2[:port2],...[,hostN[:portN]]][/[database][?options]]
    more: http://docs.mongodb.org/manual/reference/connection-string/
sqlalchemy:
    sqlalchemy+postgresql+type://user:passwd@host:port/database
    sqlalchemy+mysql+mysqlconnector+type://user:passwd@host:port/database
    more: http://docs.sqlalchemy.org/en/rel_0_9/core/engines.html
local:
    local+projectdb://filepath,filepath

type:
    should be one of `taskdb`, `projectdb`, `resultdb`.
  ```

--phantomjs-proxy

phantomjs代理地址，你需要一個安裝包和運行phantomjs代理用命令：pyspider phantomjs

--data-path

SQLite數據庫和計算器轉儲文件的保存路徑

all

```
Usage: pyspider all [OPTIONS]

  Run all the components in subprocess or thread

Options:
  --fetcher-num INTEGER         instance num of fetcher
  --processor-num INTEGER       instance num of processor
  --result-worker-num INTEGER   instance num of result worker
  --run-in [subprocess|thread]  run each components in thread or subprocess.
                                always using thread for windows.
  --help                        Show this message and exit.
```

one

```
Usage: pyspider one [OPTIONS] [SCRIPTS]...

  One mode not only means all-in-one, it runs every thing in one process
  over tornado.ioloop, for debug purpose

Options:
  -i, --interactive  enable interactive mode, you can choose crawl url.
  --phantomjs        enable phantomjs, will spawn a subprocess for phantomjs
  --help             Show this message and exit.
```

注意：WebUI不是在one模式下運行的

在one模式下，結果默認是寫在標準輸出上。你可以捕捉他們通過pyspider one > result.txt

[SCRIPTS]

項目的腳本文件路徑。當項目在運行的時候，比率和并發(fā)可以通過腳本命令去設置。

# rate: 1.0
# burst: 3

當腳本設置好了，默認任務數據庫和結構數據庫將使用一個內存數據庫（可以通過全局的配置--taskdb``--resltdb虛擬出來）。on_start回調將會觸發(fā)在啟動時候。

-i,--interactive

使用交互式的模型，pyspider將會啟動一個交互式的控制臺請求，那個將要在下一個進程循環(huán)中做的。在控制臺，你可以使用：

    ```
    crawl(url, project=None, **kwargs)
    Crawl given url, same parameters as BaseHandler.crawl
    
    url - url or taskid, parameters will be used if in taskdb
    project - can be omitted if only one project exists.
    
    quit_interactive()
    Quit interactive mode
    
    quit_pyspider()
    Close pyspider
    ```

你可以使用pyspider.libs.utils.python_console()去打開一個交互式控制臺，在你的腳本中。

bench

```
Usage: pyspider bench [OPTIONS]

  Run Benchmark test. In bench mode, in-memory sqlite database is used
  instead of on-disk sqlite database.

Options:
  --fetcher-num INTEGER         instance num of fetcher
  --processor-num INTEGER       instance num of processor
  --result-worker-num INTEGER   instance num of result worker
  --run-in [subprocess|thread]  run each components in thread or subprocess.
                                always using thread for windows.
  --total INTEGER               total url in test page
  --show INTEGER                show how many urls in a page
  --help                        Show this message and exit.
```

scheduler

```
Usage: pyspider scheduler [OPTIONS]

  Run Scheduler, only one scheduler is allowed.

Options:
  --xmlrpc / --no-xmlrpc
  --xmlrpc-host TEXT
  --xmlrpc-port INTEGER
  --inqueue-limit INTEGER  size limit of task queue for each project, tasks
                           will been ignored when overflow
  --delete-time INTEGER    delete time before marked as delete
  --active-tasks INTEGER   active log size
  --loop-limit INTEGER     maximum number of tasks due with in a loop
  --scheduler-cls TEXT     scheduler class to be used.
  --help                   Show this message and exit.  
```

--scheduler-cls

設置選項去使用自定義的調度類

phantomjs

```
Usage: run.py phantomjs [OPTIONS] [ARGS]...

  Run phantomjs fetcher if phantomjs is installed.

Options:
  --phantomjs-path TEXT  phantomjs path
  --port INTEGER         phantomjs port
  --auto-restart TEXT    auto restart phantomjs if crashed
  --help                 Show this message and exit.
```

ARGS

添加args到phantomjs命令行

fetcher

```
Usage: pyspider fetcher [OPTIONS]

  Run Fetcher.

Options:
  --xmlrpc / --no-xmlrpc
  --xmlrpc-host TEXT
  --xmlrpc-port INTEGER
  --poolsize INTEGER      max simultaneous fetches
  --proxy TEXT            proxy host:port
  --user-agent TEXT       user agent
  --timeout TEXT          default fetch timeout
  --fetcher-cls TEXT      Fetcher class to be used.
  --help                  Show this message and exit.
```

--proxy

默認的代理使用fetcher，選項可以被self.crawl重寫。

processor

 Usage: pyspider processor [OPTIONS]

   Run Processor.
 
 Options:
   --processor-cls TEXT  Processor class to be used.
   --help                Show this message and exit.

result_worker

Usage: pyspider result_worker [OPTIONS]

  Run result worker.

Options:
  --result-cls TEXT  ResultWorker class to be used.
  --help             Show this message and exit.

webui

```
Usage: pyspider webui [OPTIONS]

  Run WebUI

Options:
  --host TEXT            webui bind to host
  --port INTEGER         webui bind to host
  --cdn TEXT             js/css cdn server
  --scheduler-rpc TEXT   xmlrpc path of scheduler
  --fetcher-rpc TEXT     xmlrpc path of fetcher
  --max-rate FLOAT       max rate for each project
  --max-burst FLOAT      max burst for each project
  --username TEXT        username of lock -ed projects
  --password TEXT        password of lock -ed projects
  --need-auth            need username and password
  --webui-instance TEXT  webui Flask Application instance to be used.
  --help                 Show this message and exit.
  ```

--cdn

JS/CSS 基于 CDN 服務的。URL必須兼容cdnjs

--fetcher-rpc

fetcher XMLRPC 服務器的 XML-RPC 的路徑 URI。如果不設置,使用使用 Fetcher 實例.

--need-auth

如果為真，通過--username``--password，所有頁面的請求都將被指定。

以上內容是否對您有幫助：

← pyspider 快速開始

pyspider 選擇器 →

寫筆記

我要補充