通过文件魔数识别文件真实类型

前言

有的时候从网络上下载的文件看似是 png 格式实际是 exe 或者其他伪装的格式，一旦打开或许就会中招；又或者下载的文件看似是 pdf 格式实际是 rar 格式，默认使用 pdf 工具打开必然是不行的，需要改成 rar 后缀再解压才能使用……

本期介绍下如何查看文件的真实类型。

文件魔数（Magic Number）是存储在文件头部的一组特定的二进制标识符，用于标记文件类型。通过检查文件的魔数，可以在不依赖文件扩展名的情况下识别文件类型。

Windows 查看文件魔数

win+r，输入 cmd，打开命令提示符窗口。
运行以下命令：

1	certutil -dump <文件路径> \| more

pdf后缀实际是rar压缩包

根据魔数判断文件类型。常见文件类型的魔数如下：

52 61 72 21 (Rar!)：RAR 压缩文件。
50 4B 03 04：ZIP 压缩文件。
FF D8 FF：JPEG 图片文件。
89 50 4E 47：PNG 图片文件。

其他类型可以查阅文件魔数列表—维基百科。

维基百科打不开的可以看这个文件魔数列表（部分）

python 查询

不使用第三方库

以下 python 代码的查询字段是从这个github 项目摘出来的。

github 访问不稳可以点击这里下载json 数据。

查询字段可自定义修改。

import tkinter as tk
from tkinter import filedialog
import sys

def format_hex(hex_string):
    return ' '.join(hex_string[i:i+2] for i in range(0, len(hex_string), 2))

def get_file_type(file_path):
    magic_dict = {
        '00001A00051004': '123',
        '4D5A': 'cpl, dll, exe',
        'DCDC': 'cpl',
        '504B03040A000200': 'epub',
        '0001000000': 'ttf',
        '1F8B08': 'gz, tgz',
        '28546869732066696C65206D75737420626520636F6E76657274656420776974682042696E48657820': 'hqx',
        '0D444F43': 'doc',
        'CF11E0A1B11AE100': 'doc',
        'D0CF11E0A1B11AE1': 'apr, doc, dot, pps, ppt, xla, xls',
        'DBA52D00': 'doc',
        'ECA5C100': 'doc',
        '060E2B34020501010D0102010102': 'mxf',
        '3C435472616E7354696D656C696E653E': 'mxf',
        '2D6C68': 'lha, lzh',
        'CAFEBABE': 'class',
        '000100005374616E64617264204A6574204442': 'img',
        '504943540008': 'img',
        '514649FB': 'img',
        '53434D49': 'img',
        '7E742C015070024D52010000000800000001000031000000310000004301FF0001000800010000007e742c01': 'img',
        'EB3C902A': 'img',
        '4344303031': 'iso',
        '4F67675300020000000000000000': 'oga, ogg, ogv, ogx',
        '504B0304': 'jar, kmz, kwd, odp, odt, ott, oxps, sxc, sxd, sxi, sxw, xpi, xps, zip',
        '25504446': 'ai, fdf, pdf',
        '64000000': 'p10',
        '5B706C61796C6973745D': 'pls',
        '252150532D41646F62652D332E3020455053462D332030': 'eps',
        'C5D0D3C6': 'eps',
        '7B5C72746631': 'rtf',
        '47': 'ts, tsa, tsv',
        '2F2F203C212D2D203C6D64623A6D6F726B3A7A': 'msf',
        '3C4D616B657246696C6520': 'fm, mif',
        '0020AF30': 'tpl',
        '6D7346696C7465724C697374': 'tpl',
        '00001A000210040000000000': 'wk4',
        '00001A000010040000000000': 'wk3',
        '0000020006040600080000000000': 'wk1',
        '1A0000040000': 'nsf',
        '4E45534D1A01': 'nsf',
        '1A0000': 'ntf',
        '30314F52444E414E43452053555256455920202020202020': 'ntf',
        '4E49544630': 'ntf',
        '414F4C564D313030': 'org',
        '576F726450726F': 'lwp',
        '5B50686F6E655D': 'sam',
        '56657273696F6E20': 'mif',
        '3C3F786D6C2076657273696F6E3D22312E30223F3E': 'xul',
        '3026B2758E66CF11A6D900AA0062CE6C': 'asf, wma, wmv',
        '49536328': 'cab, hdr',
        '4D534346': 'cab',
        '0908100000060500': 'pcx, xls',
        'FDFFFFFF04': 'ppt, xls',
        'FDFFFFFF20000000': 'xls',
        '49545346': 'chm',
        '006E1EF0': 'ppt',
        '0F00E803': 'ppt',
        'A0461DF0': 'ppt',
        '0E574B53': 'wks',
        'FF000200040405540200': 'wks',
        '4D6963726F736F66742057696E646F7773204D6564696120506C61796572202D2D20': 'wpl',
        '5B56657273696F6E': 'cif',
        '504B030414000600': 'docx, pptx, xlsx',
        '424F4F4B4D4F4249': 'prc',
        '74424D504B6E5772': 'prc',
        '000000000000000000000000000000000000000000000000': 'pdb',
        '4D2D5720506F636B6574204469637469': 'pdb',
        '4D6963726F736F667420432F432B2B20': 'pdb',
        '736D5F': 'pdb',
        '737A657A': 'pdb',
        'ACED0005737200126267626C69747A2E': 'pdb',
        '00004D4D585052': 'qxd',
        '526172211A0700': 'rar',
        '526172211A070100': 'rar',
        '4D4D4D440000': 'mmf',
        '52545353': 'cap',
        '58435000': 'cap',
        '4D444D5093A7': 'dmp',
        '5041474544553634': 'dmp',
        '5041474544554D50': 'dmp',
        'FF575043': 'wpd',
        '78617221': 'xar',
        '5350464900': 'spf',
        '0764743264647464': 'dtd',
        '504B030414000100630000000000': 'zip',
        '504B0708': 'zip',
        '504B4C495445': 'zip',
        '504B537058': 'zip',
        '57696E5A6970': 'zip',
        '2321414D52': 'amr',
        '2E736E64': 'au',
        '646E732E': 'au',
        '00000020667479704D344120': 'm4a',
        '667479704D344120': 'm4a',
        '494433': 'mp3',
        'FFFB': 'mp3',
        '52494646': 'avi, qcp, wav, webp',
        '49443303000000': 'koz',
        '424D': 'bmp, dib',
        '01000000': 'emf',
        '53494D504C4520203D202020202020202020202020202020202020202054': 'fits',
        '474946383961': 'gif',
        '0000000C6A5020200D0A': 'jp2',
        'FFD8': 'jfif, jpe, jpeg, jpg, mpeg, mpg',
        '89504E470D0A1A0A': 'png',
        '492049': 'tif, tiff',
        '49492A00': 'tif, tiff',
        '4D4D002A': 'tif, tiff',
        '4D4D002B': 'tif, tiff',
        '38425053': 'psd',
        '41433130': 'dwg',
        '00000100': 'ico, mpeg, mpg, spl',
        '4550': 'mdi',
        '233F52414449414E43450A': 'hdr',
        '010009000003': 'wmf',
        'D7CDC69A': 'wmf',
        '46726F6D3A20': 'eml',
        '52657475726E2D506174683A20': 'eml',
        '582D': 'eml',
        '4A47040E': 'art',
        '3C3F786D6C2076657273696F6E3D': 'manifest',
        '2A2A2A2020496E7374616C6C6174696F6E205374617274656420': 'log',
        '424547494E3A56434152440D0A': 'vcf',
        '444D5321': 'dms',
        '0000001466747970336770': '3g2, 3gp',
        '0000002066747970336770': '3g2, 3gp',
        '000000146674797069736F6D': 'mp4',
        '000000186674797033677035': 'mp4',
        '0000001C667479704D534E56012900464D534E566D703432': 'mp4',
        '6674797033677035': 'mp4',
        '667479704D534E56': 'mp4',
        '6674797069736F6D': 'mp4',
        '00000018667479706D703432': 'm4v',
        '00000020667479704D345620': 'flv, m4v',
        '667479706D703432': 'm4v',
        '000001BA': 'mpg',
        '00': 'mov',
        '000000146674797071742020': 'mov',
        '6674797071742020': 'mov',
        '6D6F6F76': 'mov',
        '4350543746494C45': 'cpt',
        '43505446494C45': 'cpt',
        '425A68': 'bz2',
        '454E5452595643440200000102001858': 'vcd',
        '6375736800000002000000': 'csh',
        '4A4152435300': 'jar',
        '504B0304140008000800': 'jar',
        '5F27A889': 'jar',
        'EDABEEDB': 'rpm',
        '435753': 'swf',
        '465753': 'swf',
        '5A5753': 'swf',
        '5349542100': 'sit',
        '5374756666497420286329313939372D': 'sit',
        '7573746172': 'tar',
        'FD377A585A00': 'xz',
        '4D546864': 'mid, midi',
        '464F524D00': 'aiff',
        '664C614300000022': 'flac',
        '727473703A2F2F': 'ram',
        '2E524D46': 'rm',
        '2E524D460000001200': 'ra',
        '2E7261FD00': 'ra',
        '50350A': 'pgm',
        '01DA01010003': 'rgb',
        '1A45DFA3': 'mkv, webm',
        '464C5601': 'flv',
        '3C': 'asx',
    }

    with open(file_path, 'rb') as f:
        file_header = f.read(32)

    file_header_hex = file_header.hex().upper()

    matched_type = "Unknown"
    matched_header = ""

    for magic, file_type in sorted(magic_dict.items(), key=lambda x: len(x[0]), reverse=True):
        if file_header_hex.startswith(magic):
            matched_type = file_type
            matched_header = magic
            break

    return matched_type, format_hex(matched_header), format_hex(file_header_hex)

def select_file():
    root = tk.Tk()
    root.withdraw()

    file_path = filedialog.askopenfilename(title="Select a file")
    if file_path:
        file_type, matched_header, file_header = get_file_type(file_path)
        print(f"File: {file_path}\nHeader: {file_header}\nMatched Header: {matched_header}\nType: {file_type}")
    else:
        print("No file selected.")
    input("Press Enter to exit...")
    sys.exit()

if __name__ == "__main__":
    select_file()

使用方法：确保已经安装好 python，新建文本文档，把代码复制进去，改后缀为 py，双击运行，选择文件后会输出信息

使用示例：

使用示例

使用第三方库

win10 推荐直接安装此版本python-magic-bin，内置了所需的动态链接库

1	pip install python-magic-bin

也可安装python-magic，但对于 win10 安装了python-magic后还需要下载并正确配置 magic1.dll

1	pip install python-magic

Python 文件类型识别——python-magic_python magic-CSDN 博客

下载 magic1.dll

下载 magic1.dll，蓝奏云分流

下载后解压，把 magic1.dll 放到D:\Python\Lib\site-packages\magic目录下；注意替换自己的 python 安装目录；

import tkinter as tk
from tkinter import filedialog
import magic
import sys

magic_ins = magic.Magic(mime=True)

def get_file_type(file_path):

    try:
        with open(file_path, 'rb') as f:
            buffer = f.read(2048)

        file_type = magic_ins.from_buffer(buffer)

        file_header = buffer[:16].hex().upper()
        return file_type, file_header
    except FileNotFoundError:
        print("Error: File not found. Please check the file path.")
        return None, None
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return None, None

def select_file():

    root = tk.Tk()
    root.withdraw()

    file_path = filedialog.askopenfilename(title="Select a file")
    if file_path:
        file_type, file_header = get_file_type(file_path)
        if file_type and file_header:
            print(f"File: {file_path}\nHeader: {file_header}\nType: {file_type}")
        else:
            print("Failed to detect file type.")
    else:
        print("No file selected.")
    input("Press Enter to exit...")
    sys.exit()

if __name__ == "__main__":
    select_file()

使用方法：确保已经安装好 python，新建文本文档，把代码复制进去，改后缀为 py，双击运行，选择文件后会输出信息。

使用示例：

使用示例

注意：

输出的 type 类型如果不认识可以百度搜索；

新版本的 Office 文件（如 .xlsx、.docx、.pptx）采用的是 Open XML 格式，它们本质上是一个 ZIP 压缩包，所以使用此方法会被识别为 zip。

数据处理的流程

点击打开数据处理过程

处理 json

待处理数据的格式

从一个 JSON 文件中提取信息，JSON 文件部分内容如下：

{
  "123": {
    "signs": ["0,00001A00051004"],
    "mime": "application/vnd.lotus-1-2-3"
  },
  "cpl": {
    "signs": ["0,4D5A", "0,DCDC"],
    "mime": "application/cpl+xml"
  },
  "epub": {
    "signs": ["0,504B03040A000200"],
    "mime": "application/epub+zip"
  }
}

结构说明：
1. 外层是一个字典，键是文件类型（如 "123"、"cpl"）。
2. 每个键对应一个字典，其中包含：
  - signs：一个列表，存储字符串，格式为 "offset,signature"。
  - mime：文件的 MIME 类型。

完整 json 数据下载，下载后改文件名为input.json；json 源数据地址：github 项目

处理目标结果

目标是将 signs 中的每个 signature 和外层键组合，输出到一个文本文件中。格式如下：

'00001A00051004': '123',
'4D5A': 'cpl',
'DCDC': 'cpl',
'504B03040A000200': 'epub',

'signature' 是从 signs 中提取的签名值。
'123'、'cpl' 等是外层键值。

分析

处理逻辑：
- 读取 JSON 数据：将文件解析为 Python 的字典。
- 遍历字典：访问每个键值对，处理 signs 字段。
- 分割字符串：通过 split(',') 提取 offset 和 signature，这里只需要 signature。
- 格式化输出：将 signature 和外层键组合为字符串并写入文件。
边界条件：
- 如果 signs 不存在或为空，不处理。
- signs 中的字符串格式必须为 offset,signature，否则可能报错。

语法

以下是用到的主要语法和功能：

文件操作（open 函数）：
- 读取文件：open(file_path, 'r', encoding='utf-8')
- 写入文件：open(file_path, 'w', encoding='utf-8')
- 推荐用 with 管理文件资源，确保自动关闭文件。
JSON 数据解析（json.load 函数）：
- 将 JSON 格式的内容转换为 Python 的字典。
- 例如，{"key": "value"} 在 Python 中表示为 {"key": "value"}。

字典遍历：

使用 for key, value in dict.items() 遍历字典的键值对。

示例：

1
2
3

data = {"a": 1, "b": 2}
for key, value in data.items():
    print(key, value)

字符串分割（split 方法）：

用指定分隔符将字符串分割成列表。

示例：

1
2
3

text = "0,00001A00051004"
parts = text.split(",")
print(parts)  # ['0', '00001A00051004']

条件判断（if 语句）：
- 确保数据满足处理条件。
- 例如，if 'signs' in value and value['signs']: 检查 value 中是否有 signs 且不为空。

字符串格式化（f-string）：

用 f"{变量}" 插入变量到字符串中。

示例：

key = "123"
signature = "00001A00051004"
formatted = f"'{signature}': '{key}',"
print(formatted)  # 输出: '00001A00051004': '123',

完整代码

以下是处理逻辑的完整代码：

import json

# JSON 文件路径
json_file_path = 'input.json'
# 输出 TXT 文件路径
output_txt_path = 'output.txt'

# 读取 JSON 文件
def read_json(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return json.load(file)

# 提取数据并写入 TXT 文件
def extract_and_write(json_data, output_path):
    with open(output_path, 'w', encoding='utf-8') as file:
        for key, value in json_data.items():
            if 'signs' in value and value['signs']:
                for sign in value['signs']:
                    offset, signature = sign.split(',')
                    formatted_line = f"'{signature}': '{key}',\n"
                    file.write(formatted_line)

# 主函数
def main():
    try:
        json_data = read_json(json_file_path)
        extract_and_write(json_data, output_txt_path)
        print(f"数据已成功写入 {output_txt_path}")
    except Exception as e:
        print(f"发生错误: {e}")

if __name__ == '__main__':
    main()

运行结果

执行后，output.txt 文件的内容为：

'00001A00051004': '123',
'4D5A': 'cpl',
'DCDC': 'cpl',
'504B03040A000200': 'epub',

数据去重

对上面输出的output.txt文件进行去重。

文件部分内容如下

'504B0304': 'odp',
'504B0304': 'odt',
'504B0304': 'ott',
'504B030414000600': 'pptx',
'504B030414000600': 'xlsx',
'504B030414000600': 'docx',

处理后的结果：

1 2	'504B0304': 'odp, odt, ott', '504B030414000600': 'docx, pptx, xlsx'

完整代码

# 读取和处理 txt 文件，将结果写入新的 txt 文件

def process_txt(input_file, output_file):
    # 读取文件内容
    with open(input_file, 'r', encoding='utf-8') as file:
        lines = file.readlines()

    # 处理数据
    merged_result = {}
    for line in lines:
        # 提取键值对
        line = line.strip().strip(',')
        if not line:
            continue
        key, value = line.split(': ')
        key = key.strip("'")
        value = value.strip("'")
        # 合并值
        if key in merged_result:
            merged_result[key].add(value)
        else:
            merged_result[key] = {value}

    # 格式化结果
    final_result = {key: ', '.join(sorted(values)) for key, values in merged_result.items()}

    # 写入新的文件
    with open(output_file, 'w', encoding='utf-8') as file:
        for key, values in final_result.items():
            file.write(f"'{key}': '{values}',\n")

# 示例调用
input_file = 'input.txt'  # 输入文件路径
output_file = 'output.txt'  # 输出文件路径
process_txt(input_file, output_file)

读取文件内容

1	lines = file.readlines()

读取所有行，lines 的值为：

[
    "'504B0304': 'odp',\n",
    "'504B0304': 'odt',\n",
    "'504B0304': 'ott',\n",
    "'504B030414000600': 'pptx',\n",
    "'504B030414000600': 'xlsx',\n",
    "'504B030414000600': 'docx',\n"
]

逐行处理数据

for line in lines:
    line = line.strip().strip(',')
    if not line:
        continue
    key, value = line.split(': ')
    key = key.strip("'")
    value = value.strip("'")
    if key in merged_result:
        merged_result[key].add(value)
    else:
        merged_result[key] = {value}

第一行处理：'504B0304': 'odp',

line = "'504B0304': 'odp'"
key, value = line.split(': ') -> key = "'504B0304'", value = "'odp'"
key.strip("'") -> key = "504B0304"
value.strip("'") -> value = "odp"

merged_result 的更新：

1
2
3

{
    '504B0304': {'odp'}
}

第二行处理：'504B0304': 'odt',

line = "'504B0304': 'odt'"
key = "504B0304", value = "odt"
key 已存在于 merged_result 中，将 value 添加到集合：

1
2
3

{
    '504B0304': {'odp', 'odt'}
}

第三行处理：'504B0304': 'ott',

line = "'504B0304': 'ott'"
key = "504B0304", value = "ott"
key 已存在于 merged_result 中，将 value 添加到集合：

1
2
3

python复制代码{
    '504B0304': {'odp', 'odt', 'ott'}
}

第四行处理：'504B030414000600': 'pptx',

line = "'504B030414000600': 'pptx'"
key = "504B030414000600", value = "pptx"
key 不存在于 merged_result 中，新建集合并添加值：

{
    '504B0304': {'odp', 'odt', 'ott'},
    '504B030414000600': {'pptx'}
}

第五行处理：'504B030414000600': 'xlsx',

line = "'504B030414000600': 'xlsx'"
key = "504B030414000600", value = "xlsx"
key 已存在于 merged_result 中，将 value 添加到集合：

{
    '504B0304': {'odp', 'odt', 'ott'},
    '504B030414000600': {'pptx', 'xlsx'}
}

第六行处理：'504B030414000600': 'docx',

line = "'504B030414000600': 'docx'"
key = "504B030414000600", value = "docx"
key 已存在于 merged_result 中，将 value 添加到集合：

{
    '504B0304': {'odp', 'odt', 'ott'},
    '504B030414000600': {'pptx', 'xlsx', 'docx'}
}

最终结果

{
    '504B0304': {'odp', 'odt', 'ott'},
    '504B030414000600': {'pptx', 'xlsx', 'docx'}
}

合并和排序值

1	final_result = {key: ', '.join(sorted(values)) for key, values in merged_result.items()}

将集合转换为逗号分隔的字符串，并按字母顺序排序：

{
    '504B0304': 'odp, odt, ott',
    '504B030414000600': 'docx, pptx, xlsx'
}

写入新文件

1
2
3

with open(output_file, 'w', encoding='utf-8') as file:
    for key, values in final_result.items():
        file.write(f"'{key}': '{values}',\n")

写入到 output.txt 文件的内容：

1 2	'504B0304': 'odp, odt, ott', '504B030414000600': 'docx, pptx, xlsx',

语法：

with open()
语法：
1
with open(file, mode, encoding) as f:
作用：
- 用于打开文件并确保操作完成后自动关闭文件。
- mode
指定文件操作模式，如：
- 'r'：读取模式（默认）。
- 'w'：写入模式（覆盖原内容）。
- 'a'：追加模式。
- encoding 指定编码，常用 utf-8。
示例：
1
2
with open('example.txt', 'r', encoding='utf-8') as file:
content = file.read()

readlines()

语法：

1	lines = file.readlines()

作用：

按行读取整个文件内容，返回一个包含每一行的列表。

示例：假设 example.txt 内容如下：

1
2
3

line 1
line 2
line 3

运行以下代码：

1
2
3

with open('example.txt', 'r', encoding='utf-8') as file:
    lines = file.readlines()
print(lines)

输出：

1	['line 1\n', 'line 2\n', 'line 3\n']

strip()
语法：
1
string.strip([chars])
作用：
- 去掉字符串两边的指定字符（默认为空格和换行符 \n）。
示例：
1
2
3
line = " 'key': 'value',\n "
line = line.strip()
print(line) # 输出 "'key': 'value',"
常用形式：
- strip(',')：去掉两边的逗号。
- rstrip()：只去掉右侧字符。
- lstrip()：只去掉左侧字符。

split()

语法：

1	string.split(separator, maxsplit)

作用：

按指定的分隔符将字符串拆分成列表。

示例：

line = "'key': 'value'"
key, value = line.split(': ')
print(key)  # 输出 "'key'"
print(value)  # 输出 "'value'"

默认用空格分割：split()。
maxsplit 指定分割次数。

if 条件语句

语法：

if condition:
    # 条件为真执行
else:
    # 条件为假执行

示例：

key = '504B0304'
merged_result = {}

if key in merged_result:
    print("Key exists.")
else:
    print("Key does not exist.")

输出：

1	Key does not exist.

字典操作

语法：

创建字典：

1	dictionary = {'key1': 'value1', 'key2': 'value2'}

检查键是否存在：

1 2	if key in dictionary: print("Key exists.")

添加键值：

1	dictionary[key] = value

合并值（本代码中）：

if key in dictionary:
    dictionary[key].add(value)
else:
    dictionary[key] = {value}

本代码中：

如果键存在，添加值到集合中（防止重复）。
如果键不存在，新建集合并存入值。

集合（set）
作用：
- 无序、唯一的元素集合，适合去重操作。
常用操作：
- 创建集合：
1
my_set = {'a', 'b'}
- 添加元素：
1
my_set.add('c') # {'a', 'b', 'c'}
- 去重：
1
2
my_list = [1, 2, 2, 3]
unique = set(my_list) # {1, 2, 3}
字典推导式
语法：
1
{key: value for key, value in iterable}
作用：
- 一种简洁的方式生成字典。
本代码中：
1
final_result = {key: ', '.join(sorted(values)) for key, values in merged_result.items()}
分解：
- merged_result.items()：获取字典中的键值对。
- sorted(values)：对集合中的值排序。
- ', '.join(...)：将列表用逗号连接成字符串。
文件写入（write）
语法：
1
file.write(content)
作用：
- 将内容写入文件，字符串需指定写入格式。
示例：
1
2
with open('output.txt', 'w', encoding='utf-8') as file:
file.write("'key': 'value',\n")
格式化字符串
语法：
1
f"内容 {变量}"
作用：
- 将变量插入字符串。
本代码中：
1
file.write(f"'{key}': '{values}',\n")
将字典的键值对格式化写入文件。