最新消息:20210816 当前crifan.com域名已被污染,为防止失联,请关注(页面右下角的)公众号

【已解决】把文本格式的剧本内容用Python批量导入后台系统

Python crifan 2378浏览 0评论

需要把文本格式的:

已编写的对话剧本

去用Python脚本处理,实现批量导入到系统中

而原本是手动的录入到系统中的:

所以去搞清楚

调用了后台的接口是:

1. Request URL:

http://localhost:65000/api/v1/scripts/

2. Request Method: POST

1. age_end:5

2. age_start:4

3. dialogs:[{type: "0", speaker: "A", content: "hello"}, {type: "0", speaker: "B", content: "hi "},…]

    1. 0:{type: "0", speaker: "A", content: "hello"}

    2. 1:{type: "0", speaker: "B", content: "hi "}

    3. 2:{type: "1", audio_file_id: "5b21c7887f4d384d04535fe2", audio_file_name: "The Tunnel.mp3"}

4. operate_mark:"save"

5. place:"office 0716-1"

6. second_level_topic:"sleepy stories"

7. title:"title 0716-1"

8. topic:"bedtime"

对应web端源码是:

后台源码是:

然后现在需要:

先去用Python处理识别出来具体的每个script的内容

然后再去调用接口去POST创建。

【已解决】Python 3中用正则匹配多段的脚本内容

然后接着去:

用数据结构dict去保存起来,再去调用后台API,且POST之前看来也要去获得token

然后期间需要:

【已解决】python中判断单个或多个单词是否是全部小写或首字母小写

为了能够调用后台系统的api,然后接着需要去搞清楚如何用户登录和传递JWT的token:

【已解决】Python中如何模拟用户登录和传递JWT的token以获取访问API的权限

然后调用保存接口时出错:

【已解决】Python调用接口出错:TypeError string indices must be integers

接着又出现其他错误:

【已解决】Python接口返回400错误:{‘detail’: ‘JSON parse error – Expecting value: line 1 column 1 (char 0)’}

然后接着要去解决:

【已解决】Django后台去重置用户的加了密的密码

【总结】

最后用如下代码:

#!/usr/bin/env python

# -*- encoding: utf-8 -*-

# Created on 2018-07-11 14:12:12

# Project: BatchImportScript

# Function: Batch process witten script content from txt into Naturling CMS system

# Author: Crifan Li

# Updated: 20180717

import os

import re

import logging

import requests

import sys

currentPath = os.path.split(os.path.realpath(__file__))[0]

crifanlibPath = os.path.abspath(os.path.join(currentPath, "crifanLib"))

sys.path.append(crifanlibPath)

from crifanLib import crifanMysql, crifanFile, crifanLogging

################################################################################

# Global Config

################################################################################

# is online/production or local debug environment

IsOnline = True

################################################################################

# Global Settings / Variables

################################################################################

ScriptFolder = "input"

ScriptFilenameList = [ "person_1.txt", "person_2.txt"]

# # for debug

# ScriptFilenameList = ["person_2.txt", "person_1.txt"]

MysqlConfigDict_Local = {

  ‘host’: "localhost",

  ‘port’: 3306,

  ‘user’: "root",

  ‘password’: "crifan_mysql",

  ‘db’: "naturling",

  ‘charset’: "utf8",

}

MysqlConfigDict_Online = {

  ‘host’: "xxxxxx",

  ‘port’: 3306,

  ‘user’: "root",

  ‘password’: "pwd",

  ‘db’: "naturling",

  ‘charset’: "utf8",

}

MysqlConfigDict = None

if IsOnline:

  MysqlConfigDict = MysqlConfigDict_Online

else:

  MysqlConfigDict = MysqlConfigDict_Local

gTopicDict  = {}

UserInfoDict = {

  "wuman" : {

    "username": "user1",

    "password": "pwd1",

  },

  "zhouyue" : {

    "username": "user2",

    "password": "pwd2",

  },

}

ApiHost = None

if IsOnline:

  ApiHost = "http://x.x.x.x"

else:

  ApiHost = "http://localhost"

Port = 65000

ApiVersion = "/api/v1"

ApiPrefix = ApiHost + ":" + str(Port) + ApiVersion # http://localhost:65000/api/v1/

GetJwtTokenUrl = ApiPrefix + "/jwt-token-auth/" # http://localhost:65000/api/v1/jwt-token-auth/

CreateScriptUrl = ApiPrefix + "/scripts/" # http://localhost:65000/api/v1/scripts/

gJwtToken = ""

gHeaders = {

  ‘Content-Type’: ‘application/json; charset=utf-8’,

  "Accept": ‘application/json’,

  "Authorization": "",

}

################################################################################

# Functions

################################################################################

def updateTokenAndHeader(userInfo):

  global gJwtToken, gHeaders

  postBody = {

    "username": userInfo["username"],

    "password": userInfo["password"],

  }

  logging.info("GetJwtTokenUrl=%s, postBody=%s", GetJwtTokenUrl, postBody)

  getTokenResp = requests.post(GetJwtTokenUrl, data=postBody)

  logging.info("getTokenResp=%s", getTokenResp)

  respJson = getTokenResp.json()

  logging.info("respJson=%s", respJson)

  gJwtToken =  respJson["token"]

  logging.info("gJwtToken=%s", gJwtToken)

  if gJwtToken:

    gHeaders["Authorization"] = "JWT " + gJwtToken

    logging.info("gHeaders=%s", gHeaders)

def generateTopic(mysqlConn):

  global gTopicDict

  searchTopicLevel1Sql = """SELECT * from `keyword` WHERE type="sectorTopic" ORDER BY name"""

  findTopicLevel1Ok, resultDict = mysqlConn.executeSql(searchTopicLevel1Sql)

  logging.debug("%s -> %s, %s", searchTopicLevel1Sql, findTopicLevel1Ok, resultDict)

  if not findTopicLevel1Ok:

    logging.error("Fail to find level 1 topics")

    return

  topicLevel1ItemList = resultDict["data"]

  for eachTopicLevel1Item in topicLevel1ItemList:

    topicLevel1Id = eachTopicLevel1Item["id"]

    topicLevel1Name = eachTopicLevel1Item["name"]

    gTopicDict[topicLevel1Name] = {

      "id": topicLevel1Id,

      "children": {}

    }

    findRelationSql = """SELECT * from `keyword_rel` WHERE keyword1=%d""" % (topicLevel1Id)

    findRelationOk, resultDict = mysqlConn.executeSql(findRelationSql)

    logging.debug("%s -> %s, %s", findRelationSql, findRelationOk, resultDict)

    if not findRelationOk:

      logging.error("Failed to find topic level 1 and level 2 relation")

      continue

    level2TopicIdList = resultDict["data"]

    for eachRelation in level2TopicIdList:

      level2TopicId = eachRelation["keyword2"]

      findTopicLevel2Sql = """SELECT * from `keyword` WHERE `id`=%d and `type`="%s" """ % (level2TopicId, "topic")

      findTopicLevel2Ok, resultDict = mysqlConn.executeSql(findTopicLevel2Sql)

      logging.debug("%s -> %s, %s", findTopicLevel2Sql, findTopicLevel2Ok, resultDict)

      if findTopicLevel2Ok:

        if resultDict["data"]:

          level2Topic = resultDict["data"][0]

          level2TopicName = level2Topic["name"]

          gTopicDict[topicLevel1Name]["children"][level2TopicName] = {

            "id": level2TopicId

          }

        else:

          logging.error("Can not find topic id=%d", level2TopicId)

def findRealTopic(curTopic, topicList=[]):

  isExisted, realTopicName = (False, "")

  originTopic = curTopic

  allLowcaseTopic = curTopic.lower()

  capitalizedTopic = curTopic.capitalize()

  titledTopic = curTopic.title()

  logging.info("originTopic=%s,allLowcaseTopic=%s,capitalizedTopic=%s,titledTopic=%s",

    originTopic, allLowcaseTopic, capitalizedTopic, titledTopic)

  if originTopic in topicList:

    isExisted = True

    realTopicName = originTopic

  elif allLowcaseTopic in topicList:

    isExisted = True

    realTopicName = allLowcaseTopic

  elif capitalizedTopic in topicList:

    isExisted = True

    realTopicName = capitalizedTopic

  elif titledTopic in topicList:

    isExisted = True

    realTopicName = titledTopic

  else:

    isExisted = False

    realTopicName = ""

    logging.error("Not found topic: %s", curTopic)

  logging.info("topic: %s -> isExisted=%s, realTopicName=%s", curTopic, isExisted, realTopicName)

  return isExisted, realTopicName

def checkTopic(curTopic, isChild=False, parentTopicName=""):

  isExisted, realTopicName = (False, "")

  level1TopicList = gTopicDict.keys()

  if isChild and parentTopicName:

    # is child topic

    if parentTopicName in level1TopicList:

      level1Topic = gTopicDict[parentTopicName]

      childTopicDict = level1Topic["children"]

      childTopicKeyList = childTopicDict.keys()

      isExisted, realTopicName = findRealTopic(curTopic, childTopicKeyList)

    else:

      logging.error("Can not found parent topic %s for child topic %s", parentTopicName, curTopic)

  else:

    # is level 1=parent topic

    isExisted, realTopicName = findRealTopic(curTopic, level1TopicList)

  return isExisted, realTopicName

def saveScript(curScriptDict):

  # {

  #   "place": "office 0716-1",

  #   "title": "title 0716-1",

  #   "topic": "bedtime",

  #   "second_level_topic": "sleepy stories",

  #   "age_start": 4,

  #   "age_end": 5,

  #   "operate_mark": "save",

  #   "dialogs": [{

  #     "type": "0",

  #     "speaker": "A",

  #     "content": "hello"

  #   }, {

  #     "type": "0",

  #     "speaker": "B",

  #     "content": "hi "

  #   }, {

  #     "type": "1",

  #     "audio_file_id": "5b21c7887f4d384d04535fe2",

  #     "audio_file_name": "The Tunnel.mp3"

  #   }]

  # }

  logging.debug("curScriptDict=%s", curScriptDict)

  saveScriptResp = requests.post(CreateScriptUrl, headers=gHeaders, json=curScriptDict)

  logging.info("saveScriptResp=%s", saveScriptResp)

  if saveScriptResp.ok:

    respJson = saveScriptResp.json()

    logging.debug("respJson=%s", respJson)

    logging.info("+++OK to create script: %s", respJson)

  else:

    logging.error("Fail to create script: %s", curScriptDict)

def processSingleScriptMatch(singleScriptMatch):

  curScriptDict = {

    "operate_mark": "save",

    "place": "",

    "title": "",

    "age_start": -1,

    "age_end": -1,

    "topic": "",

    "second_level_topic": "",

    "dialogs": []

  }

  singleScript = singleScriptMatch.group("singleScript")

  logging.debug("singleScript=%s", singleScript)

  place = singleScriptMatch.group("place")

  topic = singleScriptMatch.group("topic")

  title = singleScriptMatch.group("title")

  age = singleScriptMatch.group("age")

  logging.debug("place=%s,topic=%s,title=%s,age=%s", place, topic, title, age)

  place = place.strip()

  title = title.strip()

  curScriptDict["place"] = place

  curScriptDict["title"] = title

  topic = topic.strip()

  age = age.strip()

  ageMatch = re.search("(?P<ageStart>\d+)\s*-\s*(?P<ageEnd>\d+)", age)

  if ageMatch:

    ageStart = ageMatch.group("ageStart")

    ageEnd = ageMatch.group("ageEnd")

    ageStartInt = int(ageStart)

    ageEndInt = int(ageEnd)

    logging.debug("ageStartInt=%d,ageEndInt=%d", ageStartInt, ageEndInt)

    curScriptDict["age_start"] = ageStartInt

    curScriptDict["age_end"] = ageEndInt

  else:

    logging.error("!!! Can not recognize age format for: %s", age)

  # Classroom layout-routine

  # Food-vegetable

  # topicMatch = re.search("(?P<topicLevel1>\w+)\s*(\-\s*(?P<topicLevel2>[\w/\s]+))?", topic)

  # topicMatch = re.search("(?P<topicLevel1>[\w\s]+)\s*(\-\s*(?P<topicLevel2>[\w/\s]+))?", topic)

  # topicMatch = re.search("(?P<topicLevel1>[\w\s]+)\s*([\-|-]\s*(?P<topicLevel2>[\w/\s]+))?", topic)

  # sports – skiing

  # topicMatch = re.search("(?P<topicLevel1>[\w\s]+)\s*([\-|-|–]\s*(?P<topicLevel2>[\w/\s]+))?", topic)

  # Sports – tae-kwon-do

  # topicMatch = re.search("(?P<topicLevel1>[\w\s]+)\s*([\-|-|–]\s*(?P<topicLevel2>[\w/\-\s]+))?", topic)

  # Food – cakes & desserts

  topicMatch = re.search("(?P<topicLevel1>[\w\s]+)\s*([\-|-|–]\s*(?P<topicLevel2>[\w/\-\&\s]+))?", topic)

  if topicMatch:

    topicLevel1 = topicMatch.group("topicLevel1")

    topicLevel1 = topicLevel1.strip()

    isExisted, realTopicLevel1 = checkTopic(topicLevel1)

    if isExisted:

      curScriptDict["topic"] = realTopicLevel1

      topicLevel2 = topicMatch.group("topicLevel2")

      if topicLevel2:

        topicLevel2 = topicLevel2.strip()

        isExisted, realTopicLevel2 = checkTopic(topicLevel2, isChild=True, parentTopicName=realTopicLevel1)

        if isExisted:

          curScriptDict["second_level_topic"] = realTopicLevel2

        else:

          logging.error("Can not find real child topic %s from parent %s", topicLevel2, realTopicLevel1)

      else:

        logging.warning("Not found topic level 2 from topic string: %s" % topic)

    else:

      logging.error("Can not find level1 topic: %s", topicLevel1)

  else:

    logging.error("!!! Can not recognize topic format for: %s" % topic)

  content = singleScriptMatch.group("content")

  logging.debug("content=%s", content)

  

  singleDialogPattern = r"(?P<speaker>\w+):\s*(?P<sentence>[^\n]+)\n"

  singleDialogMatchIterator = re.finditer(singleDialogPattern, content, flags=re.I | re.M | re.S)

  for dialogIdx, eachDialog in enumerate(singleDialogMatchIterator):

    dialogNum = dialogIdx + 1

    logging.debug("[%d] eachDialog=%s", dialogNum, eachDialog)

    speaker = eachDialog.group("speaker")

    logging.debug("speaker=%s", speaker)

    sentence = eachDialog.group("sentence")

    logging.debug("sentence=%s", sentence)

    DialogTypeText = "0"

    # DialogTypeAudio = "1"

    curDialog = {

      "type": DialogTypeText,

      "speaker": speaker,

      "content": sentence

    }

    curScriptDict["dialogs"].append(curDialog)

  logging.info("[%d] curScriptDict=%s", scriptNum, curScriptDict)

  saveScript(curScriptDict)

################################################################################

# Main

################################################################################

# init logging

logFilename = crifanFile.getInputFileBasenameNoSuffix() + ".log"

crifanLogging.loggingInit(logFilename)

logging.info("Logging initialized to %s", logFilename)

mysqlConn = crifanMysql.MysqlDb(config=MysqlConfigDict)

logging.info("mysqlConn=%s", mysqlConn)

generateTopic(mysqlConn)

logging.info("gTopicDict=%s", gTopicDict)

curPath = os.getcwd()

for eachFilename in ScriptFilenameList:

  eachFullFilePath = os.path.join(curPath, ScriptFolder, eachFilename)

  logging.info("eachFullFilePath=%s", eachFullFilePath)

  singleScriptPattern = ""

  if eachFilename == "person_1.txt":

    singleScriptPattern = r"(?P<singleScript>place:(?P<place>[^\n]+)\ntopic:(?P<topic>[^\n]+)\ntittle:(?P<title>[^\n]+)\nage:(?P<age>[^\n]+)\n(?P<content>.+?\n))\n+"

    updateTokenAndHeader(UserInfoDict["wuman"])

  elif eachFilename == "person_2.txt":

    singleScriptPattern = r"(?P<singleScript>topic:(?P<topic>[^\n]+)\nplace:(?P<place>[^\n]+)\nage:(?P<age>[^\n]+)\ntitle:(?P<title>[^\n]+)\n(?P<content>.+?\n))\n+"

    updateTokenAndHeader(UserInfoDict["zhouyue"])

  with open(eachFullFilePath, "r") as fp:

    allLine = fp.read()

    # print("allLine=%s" % allLine)

    allScriptMatchIterator = re.finditer(singleScriptPattern, allLine, flags=re.I | re.M | re.S)

    logging.info("allScriptMatchIterator=%s", allScriptMatchIterator)

    # if allScriptMatchIterator:

    for scriptIdx, curScriptMatch in enumerate(allScriptMatchIterator):

      scriptNum = scriptIdx + 1

      logging.info("[%d] curScriptMatch=%s", scriptNum, curScriptMatch)

      processSingleScriptMatch(curScriptMatch)

处理了:

Place: School canteen

Topic: food

Tittle:Have lunch

Age: 3-4

J: What did you have for lunch?

L: I ate rice, fish and bread.

J: Do you like rice?

L: Yes, I do.

J: Do you like fish?

L: Yes, I do.

J: Do you like bread?

L: No, I don’t.

J: What did you drink?

L: I drank milk.

J: Do you like milk?

L: Yes, I do.

Place: home

Topic: house

Tittle: Doing housework

Age: 4-5

J: Do you like cooking, mom?

M: Yes, I do a lot. What about you?

J: Mom, you know me. I can’t cook.

M: But can you help me wash dishes?

J: Yes, I can help you.

M: Let’s make a deal, ok?

J: What kind of deal?

M: I’m going to cook.

J: And then?

M: Then you wash the dishes after the meal.

J: That’s ok. I’ d like to help you mom.

M: You are a good boy.

和:

topic: Health – illness

Place: home

Age: 4-5

title: have a headache

A: Bob, I’m not feeling good.

B: What’s wrong?

A: I’ve got a headache.

B: Oh, take some medicine, honey. Let me get some for you.

A: Thanks, Bob.

topic: House – kitchen

Place: home

Age: 6-7

title: feeling hungry

A: Hey, Bob, what are you doing here?

B: I’m hungry. I’m looking for some food.

A: We still have some vegetables in the refrigerator. I can make you a simple salad.

B: Great! Thank you!

A: What kind of dressing do you want?

B: French dressing is fine.

最后保存到CMS系统中了:

转载请注明:在路上 » 【已解决】把文本格式的剧本内容用Python批量导入后台系统

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
89 queries in 0.471 seconds, using 22.23MB memory