2020-05
2

改变字典规则不香吗?

By xrspook @ 20:55:44 归类于: 扮IT

改变字典的键值规则就可以把从一本书里挑随机单词这件事轻松搞定,我真搞不懂参考答案为啥要那么折腾。在Think Python 2的第十三章里,字典的默认规则是单词是键,词频是键值。既然这道题要唯一的索引找随机单词,我把键值变成唯一序号不就完事大吉了?再来一个zip把字典的键值和键互换,random.choice()直接就到达随机单词了。我只改了生成字典的规则,耗时0.12秒,参考答案折腾了不只一点点,耗时0.42秒。之所以参考答案不修改字典规则,是因为他们要灌输python拼装模块的特性,拼装很方便,但事实证明效率不一定最高。

This algorithm works, but it is not very efficient; each time you choose a random word, it rebuilds the list, which is as big as the original book. An obvious improvement is to build the list once and then make multiple selections, but the list is still big.

An alternative is: Use keys to get a list of the words in the book. Build a list that contains the cumulative sum of the word frequencies (see Exercise 2). The last item in this list is the total number of words in the book, n. Choose a random number from 1 to n. Use a bisection search (See Exercise 10) to find the index where the random number would be inserted in the cumulative sum. Use the index to find the corresponding word in the word list.

Exercise 7: Write a program that uses this algorithm to choose a random word from the book. Solution: http://thinkpython2.com/code/analyze_book3.py.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import string
import random
from time import time
def set_book(fin1):
    useless = string.punctuation + string.whitespace + '“' + '”' # 标点符号、换行符全部咔嚓掉
    d = {}
    i = 1
    for line in fin1:
        line = line.replace('-', ' ') # 有-的单词全部一分为二,这样真的好吗?
        for word in line.split():
            word = word.strip(useless)
            word = word.lower()
            if word not in d:
                d[word] = i # 录入字典的时候键值就是序号
                i += 1
            # d[word] = d.get(word, 0) + 1 # 反正我不算词频,这个没必要了
    return d
fin1 = open('emma.txt', encoding='utf-8')
start = time()
book1 = set_book(fin1)
book2 = dict(zip(book1.values(), book1.keys())) # 键和键值互换,序号成了唯一索引号
print('100 random words in book')
for i in range(100):
    if i > 1 and i%8 == 0:
        print()
    print(random.choice(book2), end=' ') # 索引号找词,想多快有多快
print()
end = time()
print(end - start)
# 100 random words in book
# solicit laughing preserve inebriety elton's unimpeded effusions unselfish
# intimate connect native judges charities travel informs colours
# enigmas bragge case greensward cox's particularly unexampled promise
# prone greensward dignity maps fourth christmas creature maximum
# graver mildest pleasant corrected increased named partridge marks
# following kept gloom conjecturing parlour inheriting say consulting
# magnified abundant produces sons malt add unenforceability beautifully
# richly striking confuse greatness asleep steps humility upon
# already paper delight liberties confide appendages undecided male
# prophecies esteem unadorned likelihood shopping deeply unbiased horrors
# man's dumplings business chapter shakespeare sees counsels attentive
# silenced ventured singular double mean waltzes requisite checks
# unattended qualified blessed surmises
# 0.12100672721862793
2020-04
26

算算书里有多少单词

By xrspook @ 18:12:57 归类于: 扮IT

算算书里有多少单词应该是很大路简单的事,但实际上各种状况层出不穷。有些是你料到的,比如排版的用了全角的标点符号,程序默认会删掉标点符号,万一排版那个没有规范地使用空格呢?有些是你不会料到的,比如手误创造出奇葩字符串。很早以前我就发现Notepad++和Word里算的字数是不一致的,Notepad++通常算出来的数都会大一些。谁对谁错,随缘吧,知道大概差不多也就行了,毕竟高考的时候你写少几个字不到800也不会真扣你的分。

字典和列表的相爱相杀我体会得越来越深刻了。

words.txt在这里,emma.txt在这里。

Exercise 1: Write a program that reads a file, breaks each line into words, strips whitespace and punctuation from the words, and converts them to lowercase. Hint: The string module provides a string named whitespace, which contains space, tab, newline, etc., and punctuation which contains the punctuation characters. Let’s see if we can make Python swear:
>>> import string
>>> string.punctuation
‘!”#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~’
Also, you might consider using the string methods strip, replace and translate.

Exercise 2: Go to Project Gutenberg (http://gutenberg.org) and download your favorite out-of-copyright book in plain text format. Modify your program from the previous exercise to read the book you downloaded, skip over the header information at the beginning of the file, and process the rest of the words as before. Then modify the program to count the total number of words in the book, and the number of times each word is used. Print the number of different words used in the book. Compare different books by different authors, written in different eras. Which author uses the most extensive vocabulary?

Exercise 3: Modify the program from the previous exercise to print the 20 most frequently used words in the book.

Exercise 4: Modify the previous program to read a word list (see Section 9.1) and then print all the words in the book that are not in the word list. How many of them are typos? How many of them are common words that should be in the word list, and how many of them are really obscure?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import string
fin = open('words.txt')
mydict = {}
for line in fin:
    word = line.strip()
    mydict[word] = ''
file = open('emma.txt', encoding = 'utf-8')
essay = file.read().lower()
essay = essay.replace('-', ' ')
pun = {}
str_all = '“' + '”' + string.punctuation
for x in str_all: # 建立各种标点符号字符的字典
    pun[x] = ''
useless = essay.maketrans(pun) # maketrans必须被替换和替换等长,字典完美解决这个问题
l = essay.translate(useless).split() # 那些含-的单词会死得很惨,但仍然算是个单词
print('this book has', len(l), 'words')
book = {}
for item in l: # 读取文件为字符串,字符串转为单词列表,列表转为计数的字典,单词为键,次数为键值
    book[item] = book.get(item, 0) + 1
list_words1 = sorted(list(zip(book.values(), book.keys())), reverse = True) # 字典转为列表,键与键值换位
print('this book has', len(list_words1), 'different words')
print('times', 'word', sep='\t')
count = 1
word_len = 0 # 限制最小词长
for times, word in list_words1: # 打印大于某长度用得最多的20个词(不限制,3个字母及以下最最简单的会刷屏)
    if len(word) > word_len:
        print(times, word, sep='\t')
        count += 1
    if count > 20:
        break
count = 0
for word in book:
    if word not in mydict:
        # print(word, end=' ')
        count += 1
print(count, 'words in book not in dict') # 结果惨不忍睹,合计590个
# this book has 164065 words
# this book has 7479 different words
# times   word
# 5379    the
# 5322    to
# 4965    and
# 4412    of
# 3191    i
# 3187    a
# 2544    it
# 2483    her
# 2401    was
# 2365    she
# 2246    in
# 2172    not
# 2069    you
# 1995    be
# 1815    that
# 1813    he
# 1626    had
# 1448    as
# 1446    but
# 1373    for
# 590 words in book not in dict
# -----------------------------解法二----------------------------- 其实就是切单词方法有差异
import string
def set_book(fin1):
    useless = string.punctuation + string.whitespace + '“' + '”'
    d = {}
    for line in fin1:
        line = line.replace('-', ' ')
        for word in line.split():
            word = word.strip(useless)
            word = word.lower()
            d[word] = d.get(word, 0) + 1
    return d
def set_dict(fin2):
    d = {}
    for line in fin2:
        word = line.strip()
        d[word] = d.get(word, 0) + 1
    return d
fin1 = open('emma.txt', encoding='utf-8')
fin2 = open('words.txt')
book = set_book(fin1)
mydict = set_dict(fin2)
l = sorted(list(zip(book.values(), book.keys())), reverse=True)
count = 0
for key in book:
    count = count + book[key]
print('this book has', count, 'words')
print('this book has', len(book), 'different words')
num = 20
print(num, 'most common words in this book')
print('times', 'word', sep='\t')
for times, word in l:
    print(times, word, sep='\t')
    num -= 1
    if num < 1:
        break
count = 0
for word in book:
    if word not in mydict:
        # print(word, end=' ')
        count += 1
# print()
print(count, 'words in book not in dict')
# this book has 164120 words
# this book has 7531 different words
# 20 most common words in this book
# times   word
# 5379    the
# 5322    to
# 4965    and
# 4412    of
# 3191    i
# 3187    a
# 2544    it
# 2483    her
# 2401    was
# 2364    she
# 2246    in
# 2172    not
# 2069    you
# 1995    be
# 1815    that
# 1813    he
# 1626    had
# 1448    as
# 1446    but
# 1373    for
# 683 words in book not in dict
2020-04
25

该知道的我不知道

By xrspook @ 11:35:30 归类于: 烂日记

用两天才终于搞懂一道编程题目,这实在是太过分了。如果有人指点的话,肯定不需要这么长时间。如果在我看到这道题的参考答案之前已经完全明白参考答案里面写的所有东西的语法,我也不需要费这么多时间。对我来说,这个理解的过程就像是在猜谜语。什么样的东西是True,什么东西是False。理论上,这非常的简单,但实际上,当真的问起你的时候,如果还没有人跟你说过有这样的规则,你肯定想不明白,对我这种人来说,搞不明白就直接把那丢给python,要他试给我看会是什么的状况。

以前做条件判断,我都是用一些很明白的东西。用一些大家都知道是True还是False的东西,比如说条件是1大于2,这显然是不成立的,肯定是False,不会在这个条件下进行。但如果条件判断的时候,我传进去的是一个列表呢?列表到底是True,还是False?有东西,比如数字、字符的列表是什么?如果里面只有一对单引号,也就是空字符,那又是什么?还有另外一个情况,列表就只是一对方括号,一个空列表,这又是什么?通常来说,我不会给自己制造这些模棱两可的烦恼,如果是我自己写的条件,我不会这么折磨我自己,大概我会加个明确的判定下去。万一我真的把列表传进去作为条件判断。我会问那个列表是不是空列表,那个列表的长度是不是大于0?只要列表里面的东西,列表的长度就肯定大于0,无论里面是数字、字符,又或者是其他列表,甚至有元组,哪怕列表里面只有一对单引号,空字符串,其实也是有长度的,这样的列表长度为1。但空列表,就只有一对中括号的东西,长度会是0。如果在一对中括号里面又有一堆小号呢?从外面看来,中括号是有元素的,但是从里面的小括号元组看来,元组。这些说起来挺尴尬的事,如果你不知道他们的规则。无论如何都是回答不上来,答案是什么呢?这些答案又非常的明白,非黑则白,没有其他选择。

我不知道为什么在同一个判断上面,参考答案用了好几个表达式,是写脚本的人故意在用这种方式考验我们,还是说他有点随心所欲呢?对优秀程序员来说,通常不会犯这样的错误,或者说这能算是错误,应该是有这样不一致的习惯。养成一致的习惯是非常重要的,比如说注释的习惯。也比如缩进的习惯,在python里,缩进就是4个空格,没有说尽基本上程序就进行不下去了,因为通常你都要写个判断循环函数之类的吧。对我来说,我还没有养成空格的习惯,比如说,有些时候我的运算符和对象之间有没有空格,但有时却又。我完全是凭感觉。有些时候我会把那些东西搞得很开,有些时候我会挤在一起。当然,通常这些都不成问题。

在一道编程题上我之所以耗费那么多时间,就正如我上面所说,是因为在一些我应该知道的东西上面实际上我不知道。于是我得出一个结论,在看这个Think Python 2的时候,估计我得拿着本python的手册,一边学一边翻。显然Think Python 2这本书不会把所有规则都告诉你,因为他们想让你自己去学习,掌握那些他们觉得你铁定要知道的东西。

我不觉得用两天时间去研究透一道题是在浪费时间。

2020-04
24

用两天琢磨一道题

By xrspook @ 20:32:30 归类于: 扮IT

前面还在沾沾自喜我写出来的脚本运行效率战胜了参考答案,但这道题目我是看着参考答案都不知道他们在说什么。如果只是一个词,我的确可以列举出它一次减少一个字母可以出现的所有可能,但怎么知道上一层可能和这一层的哪个配套???我花了2天时间去研究、消化答案。一边搞清楚答案为什么这样,另一边考虑有没有其它容易吃透的表达方式。这道题之所以让我非常纠结,根本的原因是我想不透到底我可以用什么手段实现。没有可以实现的逻辑,就不会有可行的编程。

Exercise 4: Here’s another Car Talk Puzzler (http://www.cartalk.com/content/puzzlers): What is the longest English word, that remains a valid English word, as you remove its letters one at a time? Now, letters can be removed from either end, or the middle, but you can’t rearrange any of the letters. Every time you drop a letter, you wind up with another English word. If you do that, you’re eventually going to wind up with one letter and that too is going to be an English word—one that’s found in the dictionary. I want to know what’s the longest word and how many letters does it have? I’m going to give you a little modest example: Sprite. Ok? You start off with sprite, you take a letter off, one from the interior of the word, take the r away, and we’re left with the word spite, then we take the e off the end, we’re left with spit, we take the s off, we’re left with pit, it, and I. Write a program to find all words that can be reduced in this way, and then find the longest one. This exercise is a little more challenging than most, so here are some suggestions: You might want to write a function that takes a word and computes a list of all the words that can be formed by removing one letter. These are the “children” of the word. Recursively, a word is reducible if any of its children are reducible. As a base case, you can consider the empty string reducible. The wordlist I provided, words.txt, doesn’t contain single letter words. So you might want to add “I”, “a”, and the empty string. To improve the performance of your program, you might want to memoize the words that are known to be reducible. Solution: http://thinkpython2.com/code/reducible.py.

最终,我觉得自己总算消化了,顺便画了个思维导图帮助大家理解到底分解到什么程度叫做完成,什么状态叫做分解失败。[”]和[]是两种不同的东西!!!!!!

is_reducible()是最关键的函数,memos用在这里,memos初始设置了known[”] = [”]也很关键,这是个守卫模式,没有守卫is_reducible()根本没法玩。这个脚本里的5个函数,除了一开始的创建字典函数,其余函数都可以单独测试,把一个固定单词放进去脚手架测试,可以帮助理解。cut_letter(),is_reducible()和all_reducible()这三个函数最终返回的都是列表,它们的样式都是类似的。希望我理解过程中的注释能帮助到有需要的人。PS一句:参考答案的打印效果让人很晕,我修改版的打印效果很美丽:)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
from time import time
def set_dict(fin):
    d = {}
    for line in fin:
        word = line.strip()
        d[word] = 0
    for word in ['a', 'i', '']:
        d[word] = 0
    return d
def cut_letter(word, d): # 生成子单词,返回列表
    l = []
    for i in range(len(word)):
        new_word = word[:i] + word[i+1:]
        if new_word in d:
            l.append(new_word)
    return l # ['']长度为1,[]长度为0,无子词不能分解时返回[],'a'返回['']
def is_reducible(word, d): # 判断能否生成无限子单词,返回列表
    if word in known: # 守卫模式下,''空字符串被列入初始字典,不列入永远会被递归到[],无果
        return known[word]
    # if word == '': # 不用memos的时候,需要加入这句守卫
    #     return ['']
    l = []
    for new_word in cut_letter(word, d):
        if len(is_reducible(new_word, d)) > 0:
            l.append(new_word)
    known[word] = l
    return l
def all_reducible(d): # 收集所有无限子单词的单词,返回列表
    l = []
    for word in d:
        if len(is_reducible(word, d)) > 0: # 有列表,即有无限子单词
            l.append((len(word), word)) # 列表含有N个元组,元组里有2个元素,1为单词的字母数量,2为单词本尊
    new_l = sorted(l, reverse = True) # 每次减少一个字母,单词的字母越多当然就能降解出越多层了
    return new_l
def word_list(word): # 打印单词及子单词
    if len(word) == 0: # 最后一个进入is_reducible()的是[''],对应l[0]为无,打印结束
        return
    print(word)
    l = is_reducible(word, d) # 因为是被鉴定过词汇表里的词,所以必定有无限子单词
    word_list(l[0]) # 子单词有多个时只选第1个
known = {} # memos实际上只在is_reducible()起作用,除了提高效率,还能用作守卫
known[''] = [''] # 因为is_reducible()返回的是列表,所以即便是空字符串,键值也必须是列表!
fin = open('words.txt')
start = time()
d = set_dict(fin) # 普通的字典,键为单词,键值为0
words = all_reducible(d) # 列表,元组,2元素
for i in range(5):
    word_list(words[i][1]) # 列表里第某个元组的第2个元素
end = time()
print(end - start)
# complecting
# completing
# competing
# compting
# comping
# coping
# oping
# ping
# pig
# pi
# i
# twitchiest
# witchiest
# withiest
# withies
# withes
# wites
# wits
# its
# is
# i
# stranglers
# strangers
# stranger
# strange
# strang
# stang
# tang
# tag
# ta
# a
# staunchest
# stanchest
# stanches
# stances
# stanes
# sanes
# anes
# ane
# ae
# a
# restarting
# restating
# estating
# stating
# sating
# sting
# ting
# tin
# in
# i
# 0.6459996700286865
# 无memos 1.5830001831054688, 有memos 0.6459996700286865
2020-04
24

拉卷纸的熊猫

By xrspook @ 9:19:00 归类于: 烂日记

循环和递归,对路人甲来说那是差不多的玩意,反正就是在那里转圈圈。但是,虽然已经认识递归好些时间了,但我依然非常害怕这个存在。之所以害怕递归,是因为我很难想象到底递归什么时候才是个头,而我又非常明白,达不到递归的头,就不能把东西返回出来,得到我想要的答案。这个玩意就像一个无底洞。于是每次当我遇到递归,我都会在那里瑟瑟发抖,大概要克服这个,我需要非常大量的递归练习,让理解这个东西变成我的条件反射。

我不知道,在实际编程过程之中,到底会不会真的经常用到递归这种恐怖的东西。在Think Python 2这本书里面,很早就已经在说递归。还记得递归这种东西他们是结合小海龟一起折腾人的。现在回想起来,这或许是个正确的选择。因为小海龟是一个画图的东西,会让你用更直观地去理解递归到底在做什么。我还记得他们要我们理解的那个树杈和雪花。树杈那个图还算是一个比较正经的东西,雪花那个图案,简直是让我头皮发毛。每次说起递归,我就会想起小时候家里那个卷纸筒。黄色的卷纸筒上面,有一个卡通熊猫在拉卷纸的图案,而它拉的那个卷纸筒上面也贴着一个熊猫在拉卷纸。每次看到纸筒上的那个图案,我就会盯着看,然后脑子就会不断想象,熊猫卷纸筒,熊猫卷纸筒……想着想着甚至会觉得好恐怖,到底什么时候才是个头!这跟俄罗斯的套娃不一样。我总觉得,俄罗斯套娃无论套子多精细,总会有个头,但是,拉卷纸的熊猫,对我来说简直是个噩梦。我觉得,拉卷纸的熊猫是递归,而套娃是循环。

昨天我花了大半天的时间查单位的某些账本。查出了一箩筐的问题,我不知道为什么之前他们检查居然没发现。绝大多数都是弱智的问题,不弱智的问题则代表了他们做事的时候根本没用脑子去思考。那些莫名其妙的错误几乎到达了一种人人都有永不落空的境地。这也实在是太强大了吧!归根到底,是因为根本没有人去统一他们。每个人都看着那个规则,每个人都有自己的理解,又或者他们没看规则,只是按照上一个人的方法去做,但是他们对上一个人的做法的理解又各不相同。情况就像一个人不断地传话给下一个人,当人数传到一定程度的时候,原本的故事就会变得乱七八糟。我还记得第一次发现这个现象是在翡翠台的综艺节目超级掌门人上。这是我第一次认真地去揪他们的错误。不只是核对核心数据,也看了格式上到底合不合要求。有时我真的不明白他们,这样干活,他们觉得自己对得住自己的工资吗?他们晚上为什么居然能睡得着觉?换个说法,为什么他们这般吊儿郎当却没有东西惩罚他们?即便不是金钱上的惩罚,但起码也要让他们心里不好过,比如批评一下。但或许批评根本无效,就像你妈妈骂你一样,左耳进右耳出,就只是一阵耳边风。既然不能用惩罚,能不能用奖励的机制呢?在这个单位,爬上去的那些人你也说不上到底牛逼在哪里。没有惩罚,也没有奖励,于是也就可以理解,为什么他们会这样。

不是人人都能自律,对不自律的人必须用铁手腕。

© 2004 - 2024 我的天 | Theme by xrspook | Power by WordPress