For Regex I like lens-regex-pcre > import Control.Regex.Lens.Text > "Foo, bar" ^...

HelloNurse · on Sept 13, 2024

You are matching ASCII letters? Cute. What about Unicode character classes like \p{Spacing_Combining_Mark} and non-BMP characters?

Can you translate the examples at https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe... to Haskell? This Control.Regex.Lens.Text library doesn't seem to believe in documenting the supported syntax, options, etc.

tome · on Sept 13, 2024

"Cute" comes across as very dismissive. I'm not sure if you intended that. lens-regex-pcre is just a wrapper around PCRE, so anything that works in PCRE will work, for example, from your Mozilla reference:

    ghci> "California rolls $6.99\nCrunchy rolls $8.49\nShrimp tempura $10.99" ^.. [regex|\p{Sc}\s*[\d.,]+|] . match
    ["$6.99","$8.49","$10.99"]

"Spacing combining mark" seems to be "Mc" so this works:

https://unicode.org/reports/tr18/#General_Category_Property

    ghci> "foo bar \x093b baz" ^.. [regex|\p{Mc}|] . match

["\2363"]

(U+093b is a spacing combining mark, according to https://graphemica.com/categories/spacing-combining-mark)

I think in general that Haskellers would probably move to parser combinators in preference to regex when things get this complicated. I mean, who wants to read "\p{Sc}\s*[\d.,]+" in any case?

HelloNurse · on Sept 13, 2024

U+093b is still in the BMP. By the way, what text encodings for source files are supported by GHC? Escaping everything isn't fun.

And I am not sold on lens-regex-pcre documentation; "anything that works in PCRE will work" comes across as very dismissive. What string-like types are supported? What version of PCRE or PCRE2 does it use?

tome · on Sept 13, 2024

> U+093b is still in the BMP

I'm sorry, I don't know what that means. If you have a specific character you'd like me to try then please tell me what it is. My Unicode expertise is quite limited.

> I am not sold on lens-regex-pcre documentation

Nor me. It seems to leave a lot to be desired. In fact, I don't see the point of this lens approach to regex.

> "anything that works in PCRE will work" comes across as very dismissive

Noted, thanks, and apologies. That was not my intention. I was trying to make a statement of fact in response to your question.

> By the way, what text encodings for source files are supported by GHC?

UTF-8 I think. For example, pasting that character into GHC yields:

    ghci> mapM_ T.putStr ("foo bar ः baz" ^.. [regex|\p{Mc}|] . match)
    ः

> What string-like types are supported?

ByteString (raw byte arrays) and Text (Unicode, internal representation UTF-8), as you can see from:

https://hackage.haskell.org/package/lens-regex-pcre

> What version of PCRE or PCRE2 does it use?

Whatever your system version is. For me on Debian it's:

    Package: libpcre3-dev
    Source: pcre3
    Version: 2:8.39-15

iso8859-1 · on Sept 13, 2024

> version of PCRE

It uses https://hackage.haskell.org/package/pcre-light , which seems to link with the system version. So it depends on what you install. With Nix, it will be part of your system expression, of course.

Tarean · on Sept 13, 2024

Either hackernews or autocorrect ate the p, it was supposed to be \p{L} which is a unicode character class.

As the other comment mentioned pcre-compatible Regex are a standard, though the pcre spec isn't super readable. There are some projects that have more readable docs like mariadb and PHP, but it doesn't really make sense to repeat the spec in library docs https://www.php.net/manual/en/regexp.reference.unicode.php

There are libraries for pcre2 or gnu regex syntax with the same API if you prefer those