python set 交集、并集、差集

2024-06-17 12:15| 来源: 网络整理| 查看: 265

交集(intersection) example： valid = set(['yellow', 'red', 'blue', 'green', 'black']) input_set = set(['red', 'brown']) print(input_set.intersection(valid)) ### 输出：set(['red']) # 方法一: >>> a=[2,3,4,5] >>> b=[2,5,8] >>> tmp = [val for val in a if val in b] >>> tmp [2, 5] # 方法二 >>> list(set(a).intersection(set(b))) [2, 5] # 方法三： >>>list(set(a) & set(b)) [2, 5] 字符串交集 # 方法一： ''.join(sorted(set(str1) & set(str2), key = str1.index)) # 方法二： def strIntersection(s1, s2): out = "" for c in s1: if c in s2 and not c in out: out += c return out # 方法三： >>> a='asdfasdfasfd' >>> b='qazwsxedc' >>> set(a).intersection(b) set(['a', 's', 'd']) # 方法四： def hasIntersection(a, b): return not set(a).isdisjoint(b) 最大交集

How to find all intersections (also called the longest common substrings) of two strings and their positions in both strings? For example: if S1=”never” and S2=”forever” then resulted intersection must be [“ever”] and its positions are [(1,3)]. If S1=”address” and S2=”oddness” then resulted intersections are [“dd”,”ess”] and their positions are [(1,1),(4,4)].

# 方法一： In [31]: import difflib In [32]: difflib.SequenceMatcher(None, "never", "forever").get_matching_blocks() Out[32]: [Match(a=1, b=3, size=4), Match(a=5, b=7, size=0)] In [33]: difflib.SequenceMatcher(None, "address", "oddness").get_matching_blocks() Out[33]: [Match(a=1, b=1, size=2), Match(a=4, b=4, size=3), Match(a=7, b=7, size=0)] # 方法二： import itertools def longest_common_substring(s1, s2): set1 = set(s1[begin:end] for (begin, end) in itertools.combinations(range(len(s1)+1), 2)) set2 = set(s2[begin:end] for (begin, end) in itertools.combinations(range(len(s2)+1), 2)) common = set1.intersection(set2) maximal = [com for com in common if sum((s.find(com) for s in common)) == -1 * (len(common)-1)] return [(s, s1.index(s), s2.index(s)) for s in maximal] >>> longest_common_substring('address', 'oddness') [('dd', 1, 1), ('ess', 4, 4)] >>> longest_common_substring('never', 'forever') [('ever', 1, 3)] >>> longest_common_substring('call', 'wall') [('all', 1, 1)] >>> longest_common_substring('abcd1234', '1234abcd') [('abcd', 0, 4), ('1234', 4, 0)] 并集(union) # 方法一： >>> list(set(a).union(set(b))) [2, 3, 4, 5, 8] # 方法二： >>> list(set(b) | (set(a))) [2, 3, 4, 5, 8] 差集(difference)

差集：找出无效的数据，相当于用一个集合减去另一个集合的数据。

# example： valid = set(['yellow', 'red', 'blue', 'green', 'black']) input_set = set(['red', 'brown']) print(input_set.difference(valid)) ### 输出: set(['brown']) # 方法一： >>> list(set(b).difference(set(a))) # b中有而a中没有的 [8] # 方法二： >>> list(set(b) - (set(a))) [8] 集合操作汇总 >>> x = set('abcde') >>> y = set('bdxyz') >>> x set(['a', 'c', 'b', 'e', 'd']) # 2.6 display format >>> 'e' in x # Membership 成员 True >>> x – y # Difference 差集 set(['a', 'c', 'e']) >>> x | y # Union 并集 set(['a', 'c', 'b', 'e', 'd', 'y', 'x', 'z']) >>> x & y # Intersection 交集 set(['b', 'd']) >>> x ^ y # Symmetric difference (XOR) 补集 set(['a', 'c', 'e', 'y', 'x', 'z']) >>> x > y, x < y # Superset, subset 父级，子级 (False, False) 巨型集合处理（数量在百万，千万甚至更大）

方法一：set 特点： - 速度快； - 内存消耗大，一个1万个元素的集合，其占用的内存远大于1万 * 每个元素的大小，因为整个set数据结构占用大量其他空间来存储索引之类的东西。

并集：s.union(t) 或者 s | t 交集：s.intersection(t) 或者 s & t 差集：s.difference(t) 或者 s - t

方法二：Numpy 特点： - 先把要操作的元素放在数组而不是set中，同样内容的数组占用的内存比set小的多；占用内存小于set的方式； - 速度接近set方式。

import numpy as np 并集： np.union1d(s, t) # 返回排序的、去重的两个list的合集交集： np.intersect1d(s, t, assume_unique=True) # 返回排序的、去重的两个list的交集，尽可能保证传入的两个list是去重的，这可以加快运算速度。差集： np.setdiff1d(s, t, assume_unique=True) # 返回排序的，去重的差集，assume_unique参数同上。

方法三：cmd 以上两种方法的缺点就是当集合足够大而内存又不够的时候，会MemoryError（在试验中2000万个长度为24的字符串在4G的内存中就报MemoryError了）；解决办法：使用linux 命令。特点： - 内存消耗小，会使用临时文件来避免内存问题； - 耗时长。

1.文件排序，使用sort命令： sort --buffer-size=1G --output=/path/to/output /path/to/src_file # --buffer-size在Debian上可用，其他平台未知，不是标准参数. 并集：sort -m /path/to/src1 /path/tosrc2 -u --output=/path/to/result # 注意src1, src2必须是已排序的文件，而且结果也是已排序的。交集：comm -12 file1 file2 > output # 使用comm命令，注意传入的文件必须都是已排序的。差集：comm -3 file1 file2 > output # 使用comm命令，注意传入的文件必须都是已排序的。

综上，三种方法依次对内存的依赖减小，耗时增加，可依据集合大小以及硬件环境来选择。

作者：Chihwei_hsu 来源：http://chihweihsu.com Github：https://github.com/HsuChihwei

【本文地址】

公司简介

联系我们