관리-도구
편집 파일: clean.cpython-38.pyc
U *�en � @ s d Z ddlZddlZddlmZmZ ddlmZ ddlm Z ddlm Z mZ ddlmZm Z dd d ddd dgZe�dejejB �jZe�dej�jZe�dej�jZe�dej�jZe�dej�jZe�dej�jZdd� Ze�d�jZe�dejejB �Ze�d�Z ejddeid�Z!G dd � d �Z"e"� Z#e#j$Z$e�dej�e�dej�gZ%d d!d"d#d$d%gZ&e�d&ej�e�d'ej�e�d(�gZ'd)gZ(e%e&e'e(fd*d�Z)d+d,� Z*d-d� Z+e)j e+_ d!d d"gZ,d.gZ-d/e,e-e.d0�fd1d �Z/d2d� Z0d3d4� Z1e�d5ej�Z2d6d7� Z3dS )8zcA cleanup tool for HTML. Removes unwanted tags and content. See the `Cleaner` class for details. � N)�urlsplit�unquote_plus)�etree)�defs)� fromstring�XHTML_NAMESPACE)� xhtml_to_html�_transform_result� clean_html�clean�Cleaner�autolink� autolink_html� word_break�word_break_htmlzexpression\s*\(.*?\)z @\s*importz</?[a-zA-Z]+|\son[a-zA-Z]+\s*=zdata:image/(.+);base64,z:(javascript|jscript|livescript|vbscript|data|about|mocha):z (xml|svg)c C s8 d}t | �D ]}t|�r dS |d7 }qtt| ��|kS )Nr T� )�_find_image_dataurls�_is_unsafe_image_type�len�_possibly_malicious_schemes)�sZsafe_image_urlsZ image_type� r �?/opt/hc_python/lib64/python3.8/site-packages/lxml/html/clean.py�_has_javascript_scheme@ s r z[\s\x00-\x08\x0B\x0C\x0E-\x19]+z\[if[\s\n\r]+.*?][\s\n\r]*>zdescendant-or-self::*[@style]z�descendant-or-self::a [normalize-space(@href) and substring(normalize-space(@href),1,1) != '#'] |descendant-or-self::x:a[normalize-space(@href) and substring(normalize-space(@href),1,1) != '#']�x)� namespacesc @ s� e Zd ZdZdZdZdZdZdZdZ dZ dZdZdZ dZdZdZdZdZdZdZdZejZdZdZddhZdd � Zed ddd gd d d dd�Zdd� Zdd� Zdd� Z dd� Z!dd� Z"d"dd�Z#dd� Z$e%�&de%j'�j(Z)dd� Z*d d!� Z+dS )#r a Instances cleans the document of each of the possible offending elements. The cleaning is controlled by attributes; you can override attributes in a subclass, or set them in the constructor. ``scripts``: Removes any ``<script>`` tags. ``javascript``: Removes any Javascript, like an ``onclick`` attribute. Also removes stylesheets as they could contain Javascript. ``comments``: Removes any comments. ``style``: Removes any style tags. ``inline_style`` Removes any style attributes. Defaults to the value of the ``style`` option. ``links``: Removes any ``<link>`` tags ``meta``: Removes any ``<meta>`` tags ``page_structure``: Structural parts of a page: ``<head>``, ``<html>``, ``<title>``. ``processing_instructions``: Removes any processing instructions. ``embedded``: Removes any embedded objects (flash, iframes) ``frames``: Removes any frame-related tags ``forms``: Removes any form tags ``annoying_tags``: Tags that aren't *wrong*, but are annoying. ``<blink>`` and ``<marquee>`` ``remove_tags``: A list of tags to remove. Only the tags will be removed, their content will get pulled up into the parent tag. ``kill_tags``: A list of tags to kill. Killing also removes the tag's content, i.e. the whole subtree, not just the tag itself. ``allow_tags``: A list of tags to include (default include all). ``remove_unknown_tags``: Remove any tags that aren't standard parts of HTML. ``safe_attrs_only``: If true, only include 'safe' attributes (specifically the list from the feedparser HTML sanitisation web site). ``safe_attrs``: A set of attribute names to override the default list of attributes considered 'safe' (when safe_attrs_only=True). ``add_nofollow``: If true, then any <a> tags will have ``rel="nofollow"`` added to them. ``host_whitelist``: A list or set of hosts that you can use for embedded content (for content like ``<object>``, ``<link rel="stylesheet">``, etc). You can also implement/override the method ``allow_embedded_url(el, url)`` or ``allow_element(el)`` to implement more complex rules for what can be embedded. Anything that passes this test will be shown, regardless of the value of (for instance) ``embedded``. Note that this parameter might not work as intended if you do not make the links absolute before doing the cleaning. Note that you may also need to set ``whitelist_tags``. ``whitelist_tags``: A set of tags that can be included with ``host_whitelist``. The default is ``iframe`` and ``embed``; you may wish to include other tags like ``script``, or you may want to implement ``allow_embedded_url`` for more control. Set to None to include all tags. This modifies the document *in place*. TFNr �iframe�embedc K s� t � }|�� D ]�\}}t| ||�}|d ks�|dks�|dkr<nFt|ttttf�rnt|t�r�t d|� d|����nt d|� d|����t | ||� q| jd kr�d|kr�| j| _|� d�r�|� d�r�td ��d| _| jr�t| j�nd | _d S )NTFz Expected a collection, got str: �=zUnknown parameter: �inline_style� allow_tags�remove_unknown_tags�IIt does not make sense to pass in both allow_tags and remove_unknown_tagsr )�object�items�getattr� isinstance� frozenset�set�tuple�list�str� TypeError�setattrr �style�get� ValueErrorr! �host_whitelist)�self�kwZnot_an_attribute�name�value�defaultr r r �__init__� s* �� zCleaner.__init__�src�href�coder# )�script�link�appletr r �layer�ac C s� z |j }W n tk r Y nX |� }t|� |�d�D ] }d|_q8| jsT| �|� t| jp^d�}t| j pld�}t| j pzd�}| jr�|�d� | j r�t| j�}|�tj�D ]&}|j}|�� D ]} | |kr�|| = q�q�| j�r"| j r�| jtjk�s(|�tj�D ],}|j}|�� D ]} | �d��r|| = �qq�|j| jdd� | j�s�t|�D ]P}|�d�} td | �}td |�}| �|��r�|jd= n|| k�rH|�d|� �qH| j�s"t|�d��D ]p}|�d d �� � �!� dk�r�|�"� �q�|j#�p�d } td | �}td |�}| �|��rd|_#n|| k�r�||_#�q�| j�r6|�tj$� | j%�rJ|�tj&� | j�r\|�d� | j�rpt�'|d� | j(�r�|�d � nP| j�s�| j�r�t|�d ��D ]0}d|�dd �� � k�r�| �)|��s�|�"� �q�| j*�r�|�d� | j+�r�|�,d� | j-�rft|�d��D ]B}|�.� }|dk �r<|jdk�r<|�.� }�q|dk�r|�"� �q|�,d� |�,d� | j/�rz|�,tj0� | j1�r�|�d� |�,d� | j2�r�|�,d� g } g }|�� D ]T}|j|k�r�| �)|��rؐq�|�3|� n&|j|k�r�| �)|��r �q�| �3|� �q�| �r>| d |k�r>| �4d�}d|_|j�5� n8|�rv|d |k�rv|�4d�}|jdk�rnd|_|�5� |�6� |D ]}|�"� �q�| D ]}|�7� �q�| j8�r�|�r�t9d��ttj:�}|�r`| j�s�|�tj$� | j%�s�|�tj&� g }|�� D ]}|j|k�r |�3|� �q |�r`|d |k�rL|�4d�}d|_|j�5� |D ]}|�7� �qP| j;�r�t<|�D ]X}| �=|��sp|�d�}|�r�d|k�r�dd | k�r��qpd!| }nd}|�d|� �qpdS )"z& Cleans the document. �imageZimgr r; �onF)Zresolve_base_hrefr. � �typeztext/javascriptz /* deleted */r<