Metadata-Version: 2.4
Name: sre_yield
Version: 1.2
Summary: Expands a regular expression to its possible matches
Home-page: https://github.com/google/sre_yield
Author: Alex Perry
Author-email: alex.perry@google.com
License: Apache
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Intended Audience :: Developers
Classifier: Development Status :: 5 - Production/Stable
Classifier: Operating System :: OS Independent
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: summary

=========
sre_yield
=========

Quick Start
===========

The goal of ``sre_yield`` is to efficiently generate all values that can match a
given regular expression, or count possible matches efficiently.  It uses the
parsed regular expression, so you get a much more accurate result than trying
to just split strings.

.. code-block:: pycon

    >>> s = 'foo|ba[rz]'
    >>> s.split('|')  # bad
    ['foo', 'ba[rz]']

    >>> import sre_yield
    >>> list(sre_yield.AllStrings(s))  # better
    ['foo', 'bar', 'baz']

It does this by walking the tree as constructed by ``sre_parse`` (same thing
used internally by the ``re`` module), and constructing chained/repeating
iterators as appropriate.  There may be duplicate results, depending on your
input string though -- these are cases that ``sre_parse`` did not optimize.

.. code-block:: pycon

    >>> import sre_yield
    >>> list(sre_yield.AllStrings('.|a', charset='ab'))
    ['a', 'b', 'a']

...and happens in simpler cases too:

.. code-block:: pycon

    >>> list(sre_yield.AllStrings('a|a'))
    ['a', 'a']


Quirks
======

The membership check, ``'abc' in values_obj`` is by necessity fullmatch -- it
must cover the entire string.  Imagine that it has ``^(...)$`` around it.
Because ``re.search`` can match anywhere in an arbitrarily string, emulating
this would produce a large number of junk matches -- probably not what you
want.  (If that is what you want, add a ``.*`` on either side.)

Here's a quick example, using the presidents regex from http://xkcd.com/1313/

.. code-block:: pycon

    >>> s = 'bu|[rn]t|[coy]e|[mtg]a|j|iso|n[hl]|[ae]d|lev|sh|[lnd]i|[po]o|ls'

    >>> import re
    >>> re.search(s, 'kennedy') is not None  # note .search
    True
    >>> v = sre_yield.AllStrings(s)
    >>> v.__len__()
    23
    >>> 'bu' in v
    True
    >>> v[:5]
    ['bu', 'rt', 'nt', 'ce', 'oe']

If you do want to emulate search, you end up with a large number of matches
quickly.  Limiting the repetition a bit helps, but it's still a very large
number.

.. code-block:: pycon

    >>> v2 = sre_yield.AllStrings('.{,30}(' + s + ').{,30}')
    >>> el = v2.__len__()  # too big for int
    >>> print(str(el).rstrip('L'))
    57220492262913872576843611006974799576789176661653180757625052079917448874638816841926032487457234703154759402702651149752815320219511292208238103
    >>> 'kennedy' in v2
    True


Capturing Groups
================

If you're interested in extracting what would match during generation of a
value, you can use AllMatches instead to get Match objects.

.. code-block:: pycon

    >>> v = sre_yield.AllMatches(r'a(\d)b')
    >>> m = v[0]
    >>> m.group(0)
    'a0b'
    >>> m.group(1)
    '0'

This even works for simplistic backreferences, in this case to have matching quotes.

.. code-block:: pycon

    >>> v = sre_yield.AllMatches(r'(["\'])([01]{3})\1')
    >>> m = v[0]
    >>> m.group(0)
    '"000"'
    >>> m.groups()
    ('"', '000')
    >>> m.group(1)
    '"'
    >>> m.group(2)
    '000'


Reporting Bugs, etc.
====================

We welcome bug reports -- see our issue tracker on `GitHub
<https://github.com/google/sre_yield/issues>`_ to see if it's been reported before.
If you'd like to discuss anything, we have a `Google Group
<https://groups.google.com/group/sre_yield>`_ as well.


Related Modules
===============

We're aware of three similar modules, but each has a different goal.


xeger
-----

Xeger was originally written `in Java <https://code.google.com/p/xeger/>`_ and
ported `to Python <https://bitbucket.org/leapfrogdevelopment/rstr>`_.  This
generates random entries, which may suffice if you want to get just a few
matching values.  This module and ``xeger`` differ statistically in the way
they handle repetitions:

.. code-block:: pycon

    >>> import random
    >>> v = sre_yield.AllStrings('[abc]{1,4}')
    >>> len(v)
    120

    # Now random.choice(v) has a 3/120 chance of choosing a single letter.
    >>> len([x for x in v if len(x) == 1])
    3

    # xeger(v) has ~25% chance of choosing a single letter, because the length
    and match are chosen independently.
    # (This doesn't run, so the doctests don't require xeger)
    > from rstr import xeger
    > sum([1 if len(xeger('[abc]{1,4}')) == 1 else 0 for _ in range(120)])
    26

In addition, ``xeger`` differs in the default matching of ``'.'`` is for
printable characters (which you can get by setting ``charset=string.printable``
in ``sre_yield``, if desired).


sre_dump
--------

Another module that walks ``sre_parse``'s tree is ``sre_dump``, although it
does not generate matches, only reconstructs the string pattern (useful
primarily if you hand-generate a tree).  If you're interested in the space,
it's a good read.  http://www.dalkescientific.com/Python/sre_dump.html


jpetkau1
--------

Can find matches by using randomization, so sort of handles anchors.  Not
guaranteed though, but another good look at internals.
http://web.archive.org/web/20071024164712/http://www.uselesspython.com/jpetkau1.py
(and slightly older version in the announcement on `python-list
<https://mail.python.org/pipermail/python-list/2001-August/104757.html>`_).


Differences between sre_yield and the re module
===============================================

There are certainly valid regular expressions which ``sre_yield`` does not
handle.  These include things like lookarounds, backreferences, but also a few
other exceptions:

- The maximum value for repeats is system-dependant -- CPython's ``sre`` module
  there's a special value which is treated as infinite (either 2**16-1 or
  2**32-1 depending on build).  In sre_yield, this is taken as a literal,
  rather than infinite, thus (on a 2**16-1 platform):

  .. code-block:: pycon

      >>> len(sre_yield.AllStrings('a*')[-1])
      65535
      >>> import re
      >>> len(re.match('.*', 'a' * 100000).group(0))
      100000

- The ``re`` `module docs <http://docs.python.org/2/library/re.html#regular-expression-syntax>`_
  say "Regular expression pattern strings may not contain null bytes"
  yet this appears to work fine.
- Order does not depend on greediness.
- The regex is treated as fullmatch.
- ``sre_yield`` is confused by complex uses of anchors, but support simple ones:

  .. code-block:: pycon

      >>> list(sre_yield.AllStrings('foo$'))
      ['foo']
      >>> list(sre_yield.AllStrings('^$'))
      ['']
      >>> list(sre_yield.AllStrings('.\\b.'))  # doctest: +IGNORE_EXCEPTION_DETAIL
      Traceback (most recent call last):
      ...
      ParseError: Non-end-anchor None found at END state
