2020-04
26

算算书里有多少单词

By xrspook @ 18:12:57 归类于: 扮IT

算算书里有多少单词应该是很大路简单的事,但实际上各种状况层出不穷。有些是你料到的,比如排版的用了全角的标点符号,程序默认会删掉标点符号,万一排版那个没有规范地使用空格呢?有些是你不会料到的,比如手误创造出奇葩字符串。很早以前我就发现Notepad++和Word里算的字数是不一致的,Notepad++通常算出来的数都会大一些。谁对谁错,随缘吧,知道大概差不多也就行了,毕竟高考的时候你写少几个字不到800也不会真扣你的分。

字典和列表的相爱相杀我体会得越来越深刻了。

words.txt在这里,emma.txt在这里。

Exercise 1: Write a program that reads a file, breaks each line into words, strips whitespace and punctuation from the words, and converts them to lowercase. Hint: The string module provides a string named whitespace, which contains space, tab, newline, etc., and punctuation which contains the punctuation characters. Let’s see if we can make Python swear:
>>> import string
>>> string.punctuation
‘!”#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~’
Also, you might consider using the string methods strip, replace and translate.

Exercise 2: Go to Project Gutenberg (http://gutenberg.org) and download your favorite out-of-copyright book in plain text format. Modify your program from the previous exercise to read the book you downloaded, skip over the header information at the beginning of the file, and process the rest of the words as before. Then modify the program to count the total number of words in the book, and the number of times each word is used. Print the number of different words used in the book. Compare different books by different authors, written in different eras. Which author uses the most extensive vocabulary?

Exercise 3: Modify the program from the previous exercise to print the 20 most frequently used words in the book.

Exercise 4: Modify the previous program to read a word list (see Section 9.1) and then print all the words in the book that are not in the word list. How many of them are typos? How many of them are common words that should be in the word list, and how many of them are really obscure?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import string
fin = open('words.txt')
mydict = {}
for line in fin:
    word = line.strip()
    mydict[word] = ''
file = open('emma.txt', encoding = 'utf-8')
essay = file.read().lower()
essay = essay.replace('-', ' ')
pun = {}
str_all = '“' + '”' + string.punctuation
for x in str_all: # 建立各种标点符号字符的字典
    pun[x] = ''
useless = essay.maketrans(pun) # maketrans必须被替换和替换等长,字典完美解决这个问题
l = essay.translate(useless).split() # 那些含-的单词会死得很惨,但仍然算是个单词
print('this book has', len(l), 'words')
book = {}
for item in l: # 读取文件为字符串,字符串转为单词列表,列表转为计数的字典,单词为键,次数为键值
    book[item] = book.get(item, 0) + 1
list_words1 = sorted(list(zip(book.values(), book.keys())), reverse = True) # 字典转为列表,键与键值换位
print('this book has', len(list_words1), 'different words')
print('times', 'word', sep='\t')
count = 1
word_len = 0 # 限制最小词长
for times, word in list_words1: # 打印大于某长度用得最多的20个词(不限制,3个字母及以下最最简单的会刷屏)
    if len(word) > word_len:
        print(times, word, sep='\t')
        count += 1
    if count > 20:
        break
count = 0
for word in book:
    if word not in mydict:
        # print(word, end=' ')
        count += 1
print(count, 'words in book not in dict') # 结果惨不忍睹,合计590个
# this book has 164065 words
# this book has 7479 different words
# times   word
# 5379    the
# 5322    to
# 4965    and
# 4412    of
# 3191    i
# 3187    a
# 2544    it
# 2483    her
# 2401    was
# 2365    she
# 2246    in
# 2172    not
# 2069    you
# 1995    be
# 1815    that
# 1813    he
# 1626    had
# 1448    as
# 1446    but
# 1373    for
# 590 words in book not in dict
# -----------------------------解法二----------------------------- 其实就是切单词方法有差异
import string
def set_book(fin1):
    useless = string.punctuation + string.whitespace + '“' + '”'
    d = {}
    for line in fin1:
        line = line.replace('-', ' ')
        for word in line.split():
            word = word.strip(useless)
            word = word.lower()
            d[word] = d.get(word, 0) + 1
    return d
def set_dict(fin2):
    d = {}
    for line in fin2:
        word = line.strip()
        d[word] = d.get(word, 0) + 1
    return d
fin1 = open('emma.txt', encoding='utf-8')
fin2 = open('words.txt')
book = set_book(fin1)
mydict = set_dict(fin2)
l = sorted(list(zip(book.values(), book.keys())), reverse=True)
count = 0
for key in book:
    count = count + book[key]
print('this book has', count, 'words')
print('this book has', len(book), 'different words')
num = 20
print(num, 'most common words in this book')
print('times', 'word', sep='\t')
for times, word in l:
    print(times, word, sep='\t')
    num -= 1
    if num < 1:
        break
count = 0
for word in book:
    if word not in mydict:
        # print(word, end=' ')
        count += 1
# print()
print(count, 'words in book not in dict')
# this book has 164120 words
# this book has 7531 different words
# 20 most common words in this book
# times   word
# 5379    the
# 5322    to
# 4965    and
# 4412    of
# 3191    i
# 3187    a
# 2544    it
# 2483    her
# 2401    was
# 2364    she
# 2246    in
# 2172    not
# 2069    you
# 1995    be
# 1815    that
# 1813    he
# 1626    had
# 1448    as
# 1446    but
# 1373    for
# 683 words in book not in dict
2020-04
21

字典真牛

By xrspook @ 12:03:27 归类于: 扮IT

字典用好了以后666得不知道该如何形容。创造字典,搜索词典,快得就像眨眼间。列表需要考虑的长度问题,字典里一个in就高效快捷了。

这是一道扯淡的习题,因为条件没固定,参考答案可以怎么参考呢?

Exercise 5: Two words are “rotate pairs” if you can rotate one of them and get the other (see rotate_word in Exercise 5). Write a program that reads a wordlist and finds all the rotate pairs. Solution: http://thinkpython2.com/code/rotate_pairs.py.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
def rotate_word(something, n): # a-z: 97-122, A-Z: 65-90
    newletter1 = ''
    for letter in something:
        if ord(letter) < ord('A') or ord('Z') < ord(letter) < ord('a') or ord('z') < ord(letter):
            newletter2 = newletter1 + letter
        else:
            if ord(letter) + n > ord('z'):
                newletter2 = newletter1 + chr(ord(letter) + n - 26)
            elif ord('a') > ord(letter) + n > ord('Z'):
                newletter2 = newletter1 + chr(ord(letter) + n - 26)
            else:
                newletter2 = newletter1 + chr(ord(letter) + n)
        newletter1 = newletter2
    return newletter2
def set_dict(fin):
    d = {}
    for line in fin:
        word = line.strip().lower()
        d[word] = 0
    return d
fin = open('words.txt')
count = 0
d = set_dict(fin)
for word in d:    
    for i in range(1, 14): # 8.13.5的习题没有说要移多少位,参考答案直接定义为1-14位,你猜我在想什么
        if rotate_word(word, i) in d:
            print(word, i, rotate_word(word, i))
            count += 1
print(count)
# ......
# zax 1 aby
# zax 4 deb
# zebu 7 glib
# zebu 10 jole
# zee 1 aff
# zee 9 inn
# zips 12 lube
# zoa 6 fug
# zoa 12 lam
# 1137
2020-04
20

更爽

By xrspook @ 12:00:26 归类于: 扮IT

战胜参考答案从昨晚开始貌似就成了我最大的快乐。互锁词的生成要比昨天的回文词复杂一些,因为这意味着搜索的次数更多了。虽然都是用二分法搜索的思路,但我就是要比参考答案快接近30倍肿么破。至于为什么会这样,我没有研究,或许我应该仔细研究一下。

10万条的单词表里:二词互锁1254条,我用时1.4秒,参考答案用时39.1秒;三词互锁991条,我用时1.5秒,参考答案用时43.4秒。一箩筐的成就感啊啊啊啊啊啊啊~~~

Exercise 12: Two words “interlock” if taking alternating letters from each forms a new word. For example, “shoe” and “cold” interlock to form “schooled”. Solution: http://thinkpython2.com/code/interlock.py. Credit: This exercise is inspired by an example at http://puzzlers.org. Write a program that finds all pairs of words that interlock. Hint: don’t enumerate all pairs! Can you find any words that are three-way interlocked; that is, every third letter forms a word, starting from the first, second or third?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import time
def in_bisect(library, first, last, myword): # 二分法搜索,10万数据查询最多只需不到20步
    if first > last: # 这是一句拯救了我的条件
        return -1
    else:
        mid = (first + last)//2
        if myword == library[mid]:
            return mid
        elif library[mid] > myword:
            return in_bisect(library, first, mid-1, myword)
        else:
            return in_bisect(library, mid+1, last, myword)
def interlock(library, i, num):
    m = 0
    n = num
    while m < num:
        if in_bisect(library, 0, len(library)-1, library[i][m::n]) == -1:
            return False
        else:
            m += 1
            n -+ 1
    return True
count = 0
library = []
fin = open('words.txt')
for line in fin:
    word = line.strip()
    library.append(word)
library.sort()
start = time.time()
# for i in range(len(library)-1): # 二词互锁
#     if interlock(library, i, 2):
#         print(library[i], library[i][::2], library[i][1::2])
#         count += 1
for i in range(len(library)-1): # 三词互锁
    if interlock(library, i, 3):
        print(library[i], library[i][::3], library[i][1::3], library[i][2::3])
        count += 1
print(count)
end = time.time()
print(end - start)
# 1254, 1.3558001518249512 # 二词互锁 xrspook解法
# 1254, 39.10080027580261  # 二词互锁 参考答案
# 991, 1.4504001140594482  # 三词互锁 xrspook解法
# 991, 43.366000175476074  # 三次互锁 参考答案
2020-04
16

循环,循环

By xrspook @ 19:58:07 归类于: 扮IT

觉得自己虽然见过递归,但几乎不用,不逼着我我都不用,循环用得越来越遛。前两题我和参考答案得出的结论一致,最后一题,我觉得参考答案有问题。下面的都是我的脚本。下面要用到的words.txt在这里

Exercise 7:This question is based on a Puzzler that was broadcast on the radio program Car Talk: Give me a word with three consecutive double letters. I’ll give you a couple of words that almost qualify, but don’t. For example, the word committee, c-o-m-m-i-t-t-e-e. It would be great except for the ‘i’ that sneaks in there. Or Mississippi: M-i-s-s-i-s-s-i-p-p-i. If you could take out those i’s it would work. But there is a word that has three consecutive pairs of letters and to the best of my knowledge this may be the only word. Of course there are probably 500 more but I can only think of one. What is the word? Write a program to find it. Solution: http://thinkpython2.com/code/cartalk1.py.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def double_letter(word):
    num = 0
    i = 0
    if len(word) >= 6:
        while i < len(word)-1:
            if word[i] == word[i+1]: 
                num = num + 1
                i = i + 2
            elif i > 2 and word[i-2] != word[i-3]:
                break
            else:
                i = i + 1
        if num == 3:
            print(word)
fin = open('words.txt')
n = 0
for line in fin:
    word = line.strip()
    double_letter(word)
# bookkeeper
# bookkeepers
# bookkeeping
# bookkeepings

Exercise 8: Here’s another Car Talk Puzzler: “I was driving on the highway the other day and I happened to notice my odometer. Like most odometers, it shows six digits, in whole miles only. So, if my car had 300,000 miles, for example, I’d see 3-0-0-0-0-0. “Now, what I saw that day was very interesting. I noticed that the last 4 digits were palindromic; that is, they read the same forward as backward. For example, 5-4-4-5 is a palindrome, so my odometer could have read 3-1-5-4-4-5. “One mile later, the last 5 numbers were palindromic. For example, it could have read 3-6-5-4-5-6. One mile after that, the middle 4 out of 6 numbers were palindromic. And you ready for this? One mile later, all 6 were palindromic! “The question is, what was on the odometer when I first looked?” Write a Python program that tests all the six-digit numbers and prints any numbers that satisfy these requirements. Solution: http://thinkpython2.com/code/cartalk2.py.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def is_palindrome(word):
    if word[::-1] == word:
        return True 
def test_palindrome(number):
    if is_palindrome(str(number)[2:]):
        if is_palindrome(str(number+1)[1:]):
            if is_palindrome(str(number+2)[1:1]):
                if is_palindrome(str(number+3)):
                    return True
for number in range(100000, 999999):
    if test_palindrome(number):
        print(number)
# 198888
# 199999

Exercise 9: Here’s another Car Talk Puzzler you can solve with a search: “Recently I had a visit with my mom and we realized that the two digits that make up my age when reversed resulted in her age. For example, if she’s 73, I’m 37. We wondered how often this has happened over the years but we got sidetracked with other topics and we never came up with an answer. “When I got home I figured out that the digits of our ages have been reversible six times so far. I also figured out that if we’re lucky it would happen again in a few years, and if we’re really lucky it would happen one more time after that. In other words, it would have happened 8 times over all. So the question is, how old am I now?” Write a Python program that searches for solutions to this Puzzler. Hint: you might find the string method zfill useful. Solution: http://thinkpython2.com/code/cartalk3.py.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
year = 99
meet = int(input('how many times have we met?(1-8): '))
print('mom born me at', '\t','my age', '\t',"mon's age")
for i in range(10, 80): # 假设你妈生你的最低年龄是10,最高年龄是80
    n = 0
    for age in range(1, year):
        if age < int(str(age).zfill(2)[::-1]) and int(str(age).zfill(2)[::-1]) - age == i:            
            # print(i, '\t\t', age, '\t\t', str(age).zfill(2)[::-1])             
            n = n + 1
            if n == meet:
                print(i, '\t\t', age, '\t\t', str(age).zfill(2)[::-1])
 
# how many times have we met?(1-8): 6
# mom born me at   my age          mon's age
# 18               57              75
# 27               58              85
# 36               59              95
 
# how many times have we met?(1-8): 8
# mom born me at   my age          mon's age
# 18               79              97
 
# mom born me at   my age          mon's age
# 18               2               20
# 18               13              31
# 18               24              42
# 18               35              53
# 18               46              64
# 18               57              75
# 18               68              86
# 18               79              97
# 27               3               30
# 27               14              41
# 27               25              52
# 27               36              63
# 27               47              74
# 27               58              85
# 27               69              96
# 36               4               40
# 36               15              51
# 36               26              62
# 36               37              73
# 36               48              84
# 36               59              95
# 45               5               50
# 45               16              61
# 45               27              72
# 45               38              83
# 45               49              94
# 54               6               60
# 54               17              71
# 54               28              82
# 54               39              93
# 63               7               70
# 63               18              81
# 63               29              92
# 72               8               80
# 72               19              91
2012-07
25

飞吞现西一

By xrspook @ 16:16:46 归类于: 烂日记

下午不舒服,软绵绵且没事干很无聊,于是挖出了现西一看起来。一目十行,因为我都懂的,Rosetta Stone不是盖的,半年的时光也不是盖的。一个小时就可以飞看100页,神一般的速度,这不是小说,这是西班牙语专业课本啊!之所以可以那么快,因为里面的东西我都知道,我都学过,只是confirm一下而已。这样发展下去的话,估计回家的时候我要把现西二带上了。现西的前言说,在前三册里他们会大体把常用的语法都过一遍。RS学到现在,我都不知道我神马事态没见过了,很多固定搭配我肯定还不懂,但事态,估计差不多了,因为我觉得那些变化越来越少,不过因为有些我东西我一直是猜的,一直都只是觉得大概是这个样子,所以这就是我为什么现在要看传统西语教材了。

武器我有了,但却不怎么用,该怎么组装起来,但如果你在用的话,我会知道你在干什么。

这周我的RS学习速度不知为啥非常慢,从上周开始我就在琢磨要不要先停掉新课程,重温以前的核心课程。因为很多用法和单词我只能认出来不能随口说了,这是很恐怖的。就像学习语文到小学二、三年级开始提笔忘字一样。这是一个奇怪的状态,学语文的时候不知道从什么时候开始就没有了那个烦恼。西语,我很少去额外书写或键盘拼写,所以这种现象出现得更早。每一课通常会有30+个新单词,也就是每周要消化掉60个+,每周却只有4-5个小时去记忆,神仙都难记住啊。

记得小学的某个数学老师说过记忆是有一个规律的,神马现在记住,过多少个小时后重温,然后是多少天后重温,在关键时间点上控制好,基本上就不会忘记了。我没有对西语采取这种策略,哪怕我早上学完,晚上再读一读估计效果也会好很多。

懒,谁叫我是个懒人,没有测验,没有考试,没有压力,我只是在随心所欲,坚持着干某事。但事实证明,单是这种程度的坚持已经无法让我走得更远更好了,我必须主动付出更多。

浑身酸软,但没有发烧,不过我没有一点心情去干任何事了,坐姿也改变了N回。

人啊,难免偶尔会有这种不知所谓的时光。

© 2004 - 2024 我的天 | Theme by xrspook | Power by WordPress