两种markdown文件解析方法
我常用markdown写东西。写比较大的项目时,比如写包含很多章节和栏目一本书,可能要用Python做一些自动处理工作,如给所有儿歌中的难字加标记,统一处理文件里涉及的所有图片,统计分散在各处的某个栏目的情况,等等。这时需要先对markdown文件进行解析,获得元素树(token/element tree),以便处理。下面是两种实现方法,第一种更可靠;而第二种方法,做过爬虫的会比较熟悉。
import mistune
from lxml import etree
from IPython.display import display
markdown_data = """
# 标题1-1
TEXT after title1-1
TEXT after title1-2
* title1-1 li1
* title1-1 li2
1. title1-1 li1
1. title1-1 li2
TEXT berfor image  TEXT after image
[GitHub](https://github.com/gera2ld/markmap)
**inline** ~~text~~ *styles*
`inline code`
Katex - $x = {-b \pm \sqrt{b^2-4ac} \over 2a}$
```js
console.log('code block');
</div>
</div>
<div class="cell border-box-sizing text_cell rendered" markdown="1">
<div class="inner_cell" markdown="1">
<div class="text_cell_render border-box-sizing rendered_html" markdown="1">
# 通过扩展mistune.Renderer获取tokenTree
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered" markdown="1">
<div class="input">
```python
class TokenTreeRenderer(mistune.Renderer):
# options is required
options = {}
def placeholder(self):
return []
def __getattribute__(self, name):
"""Saves the arguments to each Markdown handling method."""
found = TokenTreeRenderer.__dict__.get(name)
if found is not None:
return object.__getattribute__(self, name)
def fake_method(*args, **kwargs):
return [(name, args, kwargs)]
return fake_method
markdown = mistune.Markdown(renderer=TokenTreeRenderer())
tokenTree = markdown(markdown_data) # tokenTree = markdown.render(markdown_data)
display(tokenTree)
# 或通过lxml获取元素树
# html = mistune.markdown(markdown_data)
markdown = mistune.Markdown()
html = markdown(markdown_data)
print(html)
# 元素树
root = etree.HTML(html)
print(root)
# 元素.标签
l = [x.tag for x in root[0]]
print(l)
# 元素.文本
print(root[0][1].text)
# 转字符串
print(etree.tostring(root, encoding='utf-8'))
print(etree.tostring(root, method="text", encoding='utf-8', pretty_print=True))
# 用xpath查找元素
display(root.xpath("string()")) # 文本 # lxml.etree only!
display(root.xpath("//text()")) # 文本列表 # lxml.etree only!
# 同上
build_text_list = etree.XPath("//text()")
path = build_text_list(root)
print(path)
# 取父
print(path[0])
print(path[0].getparent().tag)
print(path[0].is_text) # 是否文本
print(path[1].is_text)
print(path[1].is_tail) # 是否尾巴??
# 树的迭代
for e in root.iter():
print(f"{e.tag} - {e.text}")
# 树的过滤
for e in root.iter("h1"):
print(f"{e.tag} - {e.text}")
# 树的过滤,或关系
for e in root.iter("h1", "p"):
print(f"{e.tag} - {e.text}")
# 寻找子元素
print(root.find("h2")) # 寻找一级元素,find()找不到时返回None,其他方法会报错
print(root.find(".//h2")) # 在任意一级寻找元素
print(root.find(".//h2")) # 在任意一级寻找元素
print([ b for b in root.iterfind(".//h2") ]) # 迭代查找
print(root.findall(".//h2")) # 查找全部
print(root.findall(".//h1[@x]")) # 带属性查找