最新消息:20210816 当前crifan.com域名已被污染,为防止失联,请关注(页面右下角的)公众号

【已解决】合并基于搜索的兜底对话到产品Demo中

工作和技术 crifan 2985浏览 0评论

之前已经实现了基本的Naturling的产品demo,主要是dialog对话

现在nlp团队已经实现了search based的QA对话,用于兜底

现已pull更新了代码:

接下来需要去合并到之前到产品的demo中。

其中入口的demo是:

接下来需要先去本地运行起来:

qa = Qa(‘qa’)

reply = qa.get_responses(i, “qa”)

其中此处看到很多代码中,导入了很多NLP和AI相关的库,比如:

numpy

solr:用于搜索

word2vec

等等

感觉需要先去安装库才行。

先不管,先去PyCharm中尝试调试当前文件:

nlp/search/qa/iqa.py

再说

果然刚运行,就出错了:

【已解决】PyCharm中调试出错:ModuleNotFoundError: No module named

接着运行出错:

<code>  File "/Users/crifan/dev/dev_root/xy/search/qa/iqa.py", line 168, in &lt;module&gt;
    main()
  File "/Users/crifan/dev/dev_root/xy/search/qa/iqa.py", line 159, in main
    qa = Qa('qa')
  File "/Users/crifan/dev/dev_root/xy/search/qa/iqa.py", line 38, in __call__
    Singleton, cls).__call__(*args, **kwargs)
  File "/Users/crifan/dev/dev_root/xy/search/qa/iqa.py", line 56, in __init__
    self.word2vec = GloveEmbeddingHelper()
  File "/Users/crifan/dev/dev_root/xy/search/mypy/singleton.py", line 37, in __call__
    instance = super(Singleton, cls).__call__(*args, **kwargs)
  File "/Users/crifan/dev/dev_root/xy/search/utils/embedding/glove_embedding_helper.py", line 40, in __init__
    binary=False)
  File "/Users/crifan/.virtualenvs/xxx-gXiJ4vtz/lib/python3.6/site-packages/gensim/models/keyedvectors.py", line 1436, in load_word2vec_format
    limit=limit, datatype=datatype)
  File "/Users/crifan/.virtualenvs/xxx-gXiJ4vtz/lib/python3.6/site-packages/gensim/models/utils_any2vec.py", line 171, in _load_word2vec_format
    with utils.smart_open(fname) as fin:
  File "/Users/crifan/.virtualenvs/xxx-gXiJ4vtz/lib/python3.6/site-packages/smart_open/smart_open_lib.py", line 181, in smart_open
    fobj = _shortcut_open(uri, mode, **kw)
  File "/Users/crifan/.virtualenvs/xxx-gXiJ4vtz/lib/python3.6/site-packages/smart_open/smart_open_lib.py", line 287, in _shortcut_open
    return io.open(parsed_uri.uri_path, mode, **open_kwargs)
FileNotFoundError: [Errno 2] No such file or directory: '/opt/word2vec/glove.6B/glove.6B.300d.w2vformat.txt'

Process finished with exit code 1
</code>

通过搜索:

去看看此处:

naturling/data

下面是否有

glove.6B/glove.6B.50d.w2vformat.txt

结果就没有

naturling/data

这个文件夹,以及在线服务器中也没有

后来得知,dev服务器上有可以运行的demo

<code>/root/xxx/search/qa
python iqa.py
</code>

去运行试试

然后再下载到本地和当前代码比较看看有何区别

可以运行了:

不过,在合并兜底对话之前,需要去优化代码结构:

【已解决】优化整个系统的多个项目的代码结构

接着继续去测试search的功能,然后找到了之前缺的数据

是在dev服务器上的,去下载:

<code>[root@xxx-general-01 xxx]# ll /opt/
xxxEnv/ word2vec/     
[root@xxx-general-01 xxx]# ll /opt/word2vec/
.DS_Store  glove.6B/  
[root@xxx-general-01 xxx]# ll /opt/word2vec/glove.6B/glove.6B.
glove.6B.100d.txt            glove.6B.200d.txt            glove.6B.300d.txt            glove.6B.300d.w2vformat.txt  glove.6B.50d.txt             glove.6B.50d.w2vformat.txt
[root@xxx-general-01 xxx]# ll /opt/word2vec/glove.6B/glove.6B.^C
[root@xxx-general-01 xxx]# ll /opt/word2vec/glove.6B/
total 3378124
-rwxrwxrwx 1 root root  347116733 Jul 18 21:52 glove.6B.100d.txt
-rwxrwxrwx 1 root root  693432828 Jul 18 21:57 glove.6B.200d.txt
-rwxrwxrwx 1 root root 1037962819 Jul 18 21:43 glove.6B.300d.txt
-rwxrwxrwx 1 root root 1037962830 Jul 18 21:50 glove.6B.300d.w2vformat.txt
-rwxrwxrwx 1 root root  171350079 Jul 18 21:53 glove.6B.50d.txt
-rwxrwxrwx 1 root root  171350089 Jul 18 21:44 glove.6B.50d.w2vformat.txt
[root@xxx-general-01 xxx]# pwd
/root/xxx
[root@xxx-general-01 xxx]# 7za a -t7z -r -bt glove_6b.7z /opt/word2vec/glove.6B/*

7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,4 CPUs x64)

Scanning the drive:
6 files, 3459175378 bytes (3299 MiB)

Creating archive: glove_6b.7z

Items to compress: 6

14% 1 + glove.6B.200d.txt
</code>

发现是3G多,真够大的。。。

后来压缩到1.1G:

-rw-r–r–   1 root root 1.1G Aug 15 18:18 glove_6b.7z

现在问题就是:

【已解决】Mac中从远程CentOS服务器中加速下载大文件

然后去解压:

好像原先那个2.2M的glove_6b.7z是有问题的,没法解压

拿去改名glove_6b.7z.0为:glove_6b_new.7z,再去解压试试

解压得到3.46G的数据

放到对应位置:

xxx/data/glove.6B

然后再去调试search的功能

<code># self.word2vec_file = "/opt/word2vec/glove.6B/glove.6B.300d.w2vformat.txt"
cur_pwd = os.getcwd()
dir_naturling_data = os.path.join(cur_pwd, "../../..", "data")
self.word2vec_file = os.path.join(dir_naturling_data, "glove.6B/glove.6B.300d.w2vformat.txt")
</code>

就可以继续调试了:

然后出错了:

【已解决】Python中solr出错:SolrClient.exceptions.ConnectionError NewConnectionError Failed to establish a new connection Errno 61 Connection refused

此处,算是基本上实现合并效果了。

真正的能跑起来solr的话,还是需要去线上服务器环境,其中有真正运行的solr的服务。

但是又得:

【记录】删除重建Solr的core并重新导入数据建立索引

然后再去调试对话:

结果还是出错:

<code>input: hi
http://localhost:8983/solr/qa/select?q=question_str%3A%22+hi%22&fq=%2A%3A%2A+AND+scene%3Aqa&rows=1&fl=question%2Canswer%2Cid&wt=json&indent=false
http://localhost:8983/solr/qa/select?q=question%3A%22+hi%22&fq=%2A%3A%2A+AND+scene%3Aqa&rows=100&fl=question%2Canswer%2Cid&wt=json&indent=false
failed to find an answer
</code>

发现忘了重启solr了:

<code>➜  solr git:(master) ✗ solr stop -all
Sending stop command to Solr running on port 8983 ... waiting up to 180 seconds to allow Jetty process 43113 to stop gracefully.
➜  solr git:(master) ✗ solr start
Waiting up to 180 seconds to see Solr running on port 8983 [-]
Started Solr server on port 8983 (pid=46413). Happy searching!

➜  solr git:(master) ✗ solr status

Found 1 Solr nodes:

Solr process 46413 running on port 8983
{
  "solr_home":"/usr/local/Cellar/solr/7.2.1/server/solr",
  "version":"7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:54:21",
  "startTime":"2018-08-21T07:27:57.769Z",
  "uptime":"0 days, 0 hours, 0 minutes, 48 seconds",
  "memory":"21.6 MB (%4.4) of 490.7 MB"}
</code>

再去调试看看效果:

【已解决】调试基于solr的兜底对话出错:AttributeError: ‘list’ object has no attribute ‘lower’

结果又出现其他问题:

【已解决】Python调试Solr出错:ValueError: a must be 1-dimensional

然后继续之前的:

【已解决】优化整个系统的多个项目的代码结构

后续已经合并兜底对话到产品demo中了:

相关改动是:

(1)resources/qa.py

qa的初始化调用SearchBasedQA:

<code>resourcesPath = os.path.abspath(os.path.dirname(__file__))
log.info("resourcesPath=%s", resourcesPath)
addNlpRelationPath(getNaturlingRootPath(resourcesPath))
from nlp.search.qa.iqa import SearchBasedQA
log.info('[%s] initing SearchBasedQA', datetime.now())
searchBasedQa = SearchBasedQA(settings.SOLR_CORE)
log.info('[%s] SearchBasedQA loaded', datetime.now())
</code>

common/util.py

<code>import uuid
import io
from flask import send_file
import os, sys
from conf.app import settings

def getNaturlingRootPath(resourcesPath):
    print("getNaturlingRootPath: resourcesPath=%s" % resourcesPath)
    if settings.FLASK_ENV == "production":
        # production: online dev server
        # /xxx/resources/qa.py
        robotDemoPath = os.path.abspath(os.path.join(resourcesPath, ".."))
        print("robotDemoPath=%s" % robotDemoPath)
        serverPath = os.path.abspath(os.path.join(robotDemoPath, ".."))
        print("serverPath=%s" % serverPath)
        webPath = os.path.abspath(os.path.join(serverPath, ".."))
        print("webPath=%s" % webPath)
        naturlingRootPath = os.path.abspath(os.path.join(webPath, ".."))
        print("naturlingRootPath=%s" % naturlingRootPath)
    else:
        # development: local debug
        naturlingRootPath = "/Users/crifan/dev/dev_root/xxx"

    return naturlingRootPath

def addNlpRelationPath(naturlingRootPath):
    print("addNlpRelationPath: naturlingRootPath=%s" % naturlingRootPath)
    nlpPath = os.path.join(naturlingRootPath, "nlp")
    dialogPath = os.path.join(nlpPath, "dialog")
    searchPath = os.path.join(nlpPath, "search")
    searchQaPath = os.path.join(searchPath, "qa")

    print("sys.path=%s" % sys.path)
    if naturlingRootPath not in sys.path:
        sys.path.append(naturlingRootPath)
        print("added to sys.path: %s" % naturlingRootPath)

    if nlpPath not in sys.path:
        sys.path.append(nlpPath)
        print("added to sys.path: %s" % nlpPath)

    if dialogPath not in sys.path:
        sys.path.append(dialogPath)
        print("added to sys.path: %s" % dialogPath)

    if searchPath not in sys.path:
        sys.path.append(searchPath)
        print("added to sys.path: %s" % searchPath)

    if searchQaPath not in sys.path:
        sys.path.append(searchQaPath)
        print("added to sys.path: %s" % searchQaPath)
</code>

后续当之前的query的返回的result为空时,调用这个SearchBasedQA去返回兜底结果:

<code>def getSearchBasedResponse(input_question):
    global searchBasedQa
    log.info("getSearchBasedResponse: input_question=%s", input_question)
    reply = searchBasedQa.get_responses(input_question, settings.SOLR_CORE)
    log.info("reply=%s", reply)
    answer = reply.answer()
    log.info("answer=%s", answer)
    return answer

...

respDict["data"]["input"] = inputStr

aiResult = QueryAnalyse(inputStr, aiContext)
log.info("aiResult=%s", aiResult)
"""
aiResult={'mediaId': None, 'response': None, 'control': 'continue'}
aiResult={'mediaId': None, 'response': None, 'control': 'stop'}
{'mediaId': None, 'response': None, 'control': 'next'}
"""
resultResponse = aiResult["response"]
resultControl = aiResult["control"]
resultMediaId = aiResult["mediaId"]
log.info("resultResponse=%s, resultControl=%s, resultMediaId=%s", resultResponse, resultControl, resultMediaId)

if resultResponse:
    respDict["data"]["response"]["text"] = resultResponse

if resultControl:
    respDict["data"]["control"] = resultControl

responseIsEmpty = not resultResponse
mediaIdIsEmpty = not resultMediaId
controlIsInvalid = (resultControl != "continue") and (resultControl != "stop")
log.info("responseIsEmpty=%s, resultControl=%s, controlIsInvalid=%s", responseIsEmpty, mediaIdIsEmpty, controlIsInvalid)

if responseIsEmpty and mediaIdIsEmpty and controlIsInvalid:
    log.info("query answer is empty, so use search (based on SOLR) based response")
    searchBasedResponse = getSearchBasedResponse(inputStr)
    log.info("inputStr=%s -&gt; searchBasedResponse=%s", inputStr, searchBasedResponse)
    # searchBasedResponse is always not empty
    respDict["data"]["response"]["text"] = searchBasedResponse
</code>

其中:

searchBasedQa = SearchBasedQA(settings.SOLR_CORE)

初始化需要很长时间,目前是4分钟左右,后续需要优化。

(2)nlp的search部分

xx/search/static/context.py

<code>
class Context(metaclass=Singleton):
    def __init__(self):
        ...
        # self.word2vec_file = "/opt/word2vec/glove.6B/glove.6B.300d.w2vformat.txt"
        # cur_pwd = os.getcwd()
        cur_pwd = os.path.abspath(os.path.dirname(__file__))
        dir_naturling_data = os.path.join(cur_pwd, "../../..", "data")
        self.word2vec_file = os.path.join(dir_naturling_data, "glove.6B/glove.6B.300d.w2vformat.txt")
</code>

对于合并后的效果,本地也调试通过了:

本地去PyCharm调试运行:

然后再去命令行中,确保服务都运行了:

先是redis-server:

<code>➜  ~ redis-server
13044:C 27 Aug 13:41:58.271 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
13044:C 27 Aug 13:41:58.271 # Redis version=4.0.9, bits=64, commit=00000000, modified=0, pid=13044, just started
13044:C 27 Aug 13:41:58.271 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
13044:M 27 Aug 13:41:58.273 * Increased maximum number of open files to 10032 (it was originally set to 4864).
                _._
           _.-``__ ''-._
      _.-``    `.  `_.  ''-._           Redis 4.0.9 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._
(    '      ,       .-`  | `,    )     Running in standalone mode
|`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
|    `-._   `._    /     _.-'    |     PID: 13044
  `-._    `-._  `-./  _.-'    _.-'
|`-._`-._    `-.__.-'    _.-'_.-'|
|    `-._`-._        _.-'_.-'    |           http://redis.io
  `-._    `-._`-.__.-'_.-'    _.-'
|`-._`-._    `-.__.-'    _.-'_.-'|
|    `-._`-._        _.-'_.-'    |
  `-._    `-._`-.__.-'_.-'    _.-'
      `-._    `-.__.-'    _.-'
          `-._        _.-'
              `-.__.-'

13044:M 27 Aug 13:41:58.277 # Server initialized
13044:M 27 Aug 13:41:58.279 * DB loaded from disk: 0.001 seconds
13044:M 27 Aug 13:41:58.279 * Ready to accept connections
13044:M 27 Aug 14:24:09.841 * 100 changes in 300 seconds. Saving...
13044:M 27 Aug 14:24:09.842 * Background saving started by pid 16287
。。。
13044:M 27 Aug 17:36:10.779 * Background saving started by pid 19913
19913:C 27 Aug 17:36:10.781 * DB saved on disk
13044:M 27 Aug 17:36:10.884 * Background saving terminated with success
</code>

然后是celery的worker:

<code>cd /Users/crifan/dev/dev_root/xxx
pipenv shell
celery worker -A resources.tasks.celery --loglevel=DEBUG
</code>

然后是celery的beat:

<code>cd /Users/crifan/dev/dev_root/xxx
pipenv shell
celery beat -A resources.tasks.celery -s runtime/celerybeat-schedule --loglevel=DEBUG
</code>

效果:

<code>➜  naturlingRobotDemoServer git:(master) ✗ celery beat -A resources.tasks.celery -s runtime/celerybeat-schedule --loglevel=DEBUG
cur_flask_environ=None
FLASK_ENV=development
cur_dir=/Users/crifan/dev/dev_root/xxx/conf/app
env_folder=development
dotenv_path=/Users/crifan/dev/dev_root/xxx/conf/app/development/.env
dotenv_load_ok=True
After  load .env: DEBUG=True, MONGODB_HOST=ip, FILE_URL_HOST=127.0.0.1
in extensions_celery: celery=&lt;Celery RobotQA at 0x11155f470&gt;
create_celery_app return: celery=&lt;Celery RobotQA at 0x11155f470&gt;, log=&lt;Logger resources.extensions_celery (WARNING)&gt;
celery beat v4.2.1 (windowlicker) is starting.
__    -    ... __   -        _
LocalTime -&gt; 2018-08-27 14:19:09
Configuration -&gt;
    . broker -&gt; redis://localhost:6379/0
    . loader -&gt; celery.loaders.app.AppLoader
    . scheduler -&gt; celery.beat.PersistentScheduler
    . db -&gt; runtime/celerybeat-schedule
    . logfile -&gt; [stderr]@%DEBUG
    . maxinterval -&gt; 5.00 minutes (300s)
[2018-08-27 14:19:09,694: DEBUG/MainProcess] Setting default socket timeout to 30
[2018-08-27 14:19:09,695: INFO/MainProcess] beat: Starting...
[2018-08-27 14:19:09,765: DEBUG/MainProcess] Current schedule:
&lt;ScheduleEntry: refresh ms Azure token every less than 10 minutes resources.tasks.refreshAzureSpeechToken() &lt;freq: 1.00 minute&gt;
&lt;ScheduleEntry: celery.backend_cleanup celery.backend_cleanup() &lt;crontab: 0 4 * * * (m/h/d/dM/MY)&gt;
[2018-08-27 14:19:09,766: DEBUG/MainProcess] beat: Ticking with max interval-&gt;5.00 minutes
[2018-08-27 14:19:09,780: INFO/MainProcess] Scheduler: Sending due task refresh ms Azure token every less than 10 minutes (resources.tasks.refreshAzureSpeechToken)
[2018-08-27 14:19:09,794: DEBUG/MainProcess] beat: Synchronizing schedule...
[2018-08-27 14:19:09,834: DEBUG/MainProcess] resources.tasks.refreshAzureSpeechToken sent. id-&gt;fa8c45e4-1be7-447f-9240-6f61c58c1e9d
[2018-08-27 14:19:09,837: DEBUG/MainProcess] beat: Waking up in 59.92 seconds.
</code>

然后是solr的server:

<code>➜  xxxServer git:(master) ✗ solr start
Waiting up to 180 seconds to see Solr running on port 8983 [/]
Started Solr server on port 8983 (pid=12925). Happy searching!

➜  xxxServer git:(master) ✗ solr status

Found 1 Solr nodes:

Solr process 12925 running on port 8983
{
  "solr_home":"/usr/local/Cellar/solr/7.2.1/server/solr",
  "version":"7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:54:21",
  "startTime":"2018-08-27T05:41:30.394Z",
  "uptime":"0 days, 0 hours, 0 minutes, 13 seconds",
  "memory":"51.2 MB (%10.4) of 490.7 MB"}
</code>

然后就是去部署到在线环境上去了:

【记录】把合并了基于搜索的兜底对话的产品demo部署到在线环境中

转载请注明:在路上 » 【已解决】合并基于搜索的兜底对话到产品Demo中

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
89 queries in 0.441 seconds, using 22.20MB memory