Split on filename and path with formula in Calc - is it possible?

To get the full filename with path, this formula give the correct result.
Something similar to this .: (c:\Users … \TstCalc.ods)
=MID(CELL(“filename”); SEARCH("///";CELL(“filename”))+3; SEARCH("’#"; CELL(“filename”)) - SEARCH("///";CELL(“filename”))-3)
My wish is to be able to split path and file name to c:\Users … and TstCalc.ods
Is it possible?
How?

For the filename:

=REGEX(A1; "[^\\]+$")

For the directory path:

=REGEX(A1; "^([^\]+\\)+")

where your full path is assumed to be in A1 and the path separator is reverse solidus \. If your path separator is slash /, formulas are changed to:

=REGEX(A1; "[^/]+$")
=REGEX(A1; "^([^/]+/)+")

i.e. no need to double the slash in the second formula because this is not the escape character.

1 Like

gives Err:502 invalid argument, because in [^\]+$ the \] yields a literal ] closing bracket and thus the character set is not closed.
Also not that the string to be processed is the result of CELL("filename") and thus contains an URI that does not contain \ backslash but / forward slash also enclosed in '...' and followed by the sheet fragment.

includes the sheet fragment, like filename.ods'#$Sheet1

results just in 'file:/

For these CELL URIs instead use
=REGEX(A1;"^'(.+/)[^/]+'#\$.+";"$1")
for scheme and path,
=REGEX(A1;"^.+://(.+/)[^/]+'#\$.+";"$1")
for only path,
=REGEX(A1;"^'.+/([^/]+)'#\$.+";"$1")
for file name.

Indeed, my formulas are valid only for a full filename expressed as a simple string. A full URI is quite different and needs elimination of the scheme, quotes and fragment.

why not simply:

=REGEX(A1;".+/") 'for the path?

This will work for a simple string. If the “path” comes from CELL(), this will keep the initial quote (and the scheme, but this is probably not important). But if there is a slash in the the fragment (sheet name), then your regexp will “bloat” beyond the final quote of the path and include also # and a part of the sheet name up to the slash.

Calc doesn’t allow slashes in the Sheetname !
with me on:

Version: 7.6.3.2 (AARCH64) / LibreOffice Community
Build ID: 29d686fea9f6705b262d369fede658f824154cc0
CPU threads: 4; OS: Linux 6.1; UI render: default; VCL: gtk3
Locale: de-DE (de_DE.UTF-8); UI: en-US
Flatpak
Calc: threaded

Yes and no… the UI does not allow it for Excel interoperability and Calc questionably “sanitizes” such names when loading a file so it should not occur, but the ODF standard allows any character.

Thank you!
I try to use / understand / test the result.
My first reflection is that the result of the REGEX is not updated immediately (always).
Sometimes I have to restart the spreadsheet to see the update.
Should it be so?

The formula in the test spreadsheet - Cell A1 .: =CELL("filename")
The result in A1 .: file:///C:/Users/…/TestCalc 2023.ods’#$Blad1 (the path has been shortened with “…”)

Test1a
I copy the formula =REGEX(A1; "[^\\]+$") to the Cell B3
The same result as input is shown .: ‘file:///C:/Users/…/TestCalc 2023.ods’#$Blad1
Is the result correct?

Test1b
I used the filenames below as test strings on https://regex101.com/
’file:///C:/temp/Test 2023.ods’#$‘52’.Q21
’file:///C:/temp/asdf/Test 2023.ods’#$‘52’.Q21
’file:///C:/Test 2023.ods’#$‘52’.Q21

When I copied the following pattern [^\\]+$ to regex101.com
I got the same result as input as Match 1.

  • . _ . - . _ . - . _ . - . _ . - . _ . - . _ . - . _ . - . _ . - . _ . - . _ . - . _ .

Test2 (for scheme and path)
=REGEX(A1;"^'(.+/)[^/]+'#\$.+";"$1") give the result file:///C:/Users/…/ (in Calc)

In regex101 i tried the following pattern ^'(.+\/)[^\/]+'#\$.+
Group 1 give the same result as Calc

  • . _ . - . _ . - . _ . - . _ . - . _ . - . _ . - . _ . - . _ . - . _ . - . _ . - . _ .

Test 3 (for only path - modified)
=REGEX(A1;"^.+:///(.+/)[^/]+'#\$.+";"$1") give the result C:/Users/.../

In regex101 i tried the following pattern ^.+:\/\/\/(.+\/)[^\/]+'#\$.+
Group 1 give the same result as Calc, eg.

  • C:/temp/
  • C:/temp/asdf/
  • C:/
  • . _ . - . _ . - . _ . - . _ . - . _ . - . _ . - . _ . - . _ . - . _ . - . _ . - . _ .

Test 4 (for filename)
=REGEX(A1;"^'.+/([^/]+)'#\$.+";"$1") give the result TestCalc 2023.ods

In regex101 i tried the following pattern ^'.+\/([^\/]+)'#\$.+
Group 1 give the same result as Calc

So far, it seems possible to test the RegEx pattern using regex101.com

  • . _ . - . _ . - . _ . - . _ . - . _ . - . _ . - . _ . - . _ . - . _ . - . _ . - . _ .

But!
Suppose I want to read the year in the file name (between a space " " and “.ods”). In this example 2023 from the filenamne TestCalc 2023.ods

In regex101.com the pattern (\d{4}).ods'#\$ seems to produce the desired result 2023.
But not in Calc.
The formula in Calc =REGEX(A1;"(\d{4}).ods'#\$")
Give the following result .: 2023.ods’#$ (Not 2023 as in regex101)

The formula in Calc =REGEX(A1;"(\d{4}).ods'#\$";"$1")
give the following result .: 'file:///C:/Users/…/TestCalc 2023.odsBlad1 (Not 2023 as in regex101)
Why it’s not the same?
is it possible to select specific group (as shown in regex101)?
I don’t understand replacement = "$1"

From the documentation

REGEX( Text ; Expression [ ; [ Replacement ] [ ; Flags|Occurrence ] ] )

In what situation does it not update?

Yes. There is no \ backslash in the file URI so the greedy [^\\]+$ expression matches the entire string.

Yes. A small caveat though, regex101 allows to choose from different flavours in the left sidebar, the default is PCRE2 (PHP >= 7.3). While most patterns work the same, some are not compatible with the ICU implementation, which bases on the Java regexp. If in doubt, select Java but bear in mind that ICU differs in some details, see Differences with Java Regular Expressions.

In fact for 'file:///C:/temp/Test 2023.ods'#$Blad1 it displays two things:

  • Match 1 22-33 2023.ods'#$
  • Group 1 22-26 2023

At least please give correct examples… for the input string of
file:///C:/temp/Test 2023.ods'#$Blad1
the result is ’file:///C:/temp/Test 2023Blad1 not ’file:///C:/temp/Test 2023.odsBlad1 (note the .ods does not occur in the actual result).

Because you replace the match (2023.ods'#$) with the group (2023).

To replace the entire input string you need to match the entire string:

=REGEX(A1;".*(\d{4}).ods'#\$.*";"$1")

yields 2023

Thank you!

Experience that it is on slightly different occasions.
Example of one occasion.:
In the Cell A1, the formula is =Cell("filename")
The result as shown .: ‘file:///C:/Users/…/TestCalc 2023.ods’#$Blad1 (the path is shorted)
The name of the tab is now Blad1
If the tab-name is changed to 52 - nothing happens in Cell A1
the spreadsheet is saved - nothing happens in Cell A1
But if the spreadsheet has been closed and opened, the Cell A1 has been changed to 'file:///C:/Users/.../TestCalc 2023.ods'#$52
The same thing happens if the worksheet is moved to another directory (using Save As)
If the formula is rewritten after e.g. spreadsheet has been moved (SaveAs), the result will be correct in that cell (but still wrong in the first cell)
This missing update might not be so problematic in my case - but…

Example of another occasion.:
This example is more difficult to reproduce.
If a formula (eg. =REGEX(A1;".*(\d{4}).ods'#\$.*";"$1") ) is entered into a cell, the result is usually displayed immediately.
If a new RegEx pattern is copied from this forum / regex101 / or …, the result of the change may not appear before the spreadsheet is saved / closed and reopened.
Even if I first copy to a text editor (which can’t handle control characters) and then copy into calk, it seems to occur.

last example
Have experienced shady happenings at locked cells. The value from another tab is not retrieved.
If the worksheet is unlocked, the value will be retrieved and if the worksheet/cell is locked again, the “Error” message will appear again.
Sometimes the problem goes away if the spreadsheet is saved, closed and opened - sometimes the problem persists. But, this case is more difficult to reproduce.
I have only experienced this problem with the latest versions of LO.
It is very frustrating when you think the formula has been correct and it still gives an error message - until a restart is performed.

  • . _ . - . _ . - . _ . - . _ . - . _ . - . _ . - . _ . - . _ . - . _ . - . _ .

=REGEX(A1;".*(\d{4}).ods'#\$.*";"$1") ==>
Maybe this structure is more general and better?
=REGEX(A1;".*(\d{4}).*'#\$.*";"$1")
But I don’t understand “replace” = “$1”
What is $1 mean?

Assume a working RegEx pattern in regex101.
Correct result is shown in e.g. Match1, Group1, Match2, Group2, Match3, Group3 etc.
Is it possible to control, if the result to be shown in LO, is in e.g. Group2 or Match3 or…?

That’s because the CELL() function does not listen to sheet name change events to update. tdf#85986. As is, you’ll have to force a recalculation, either manually (with active cell A1 press F9, or Shift+Ctrl+F9 to hard recalculate everything) or by saving/reloading the file, or by using the formula

=CELL("filename")&T(RAND())

where RAND() is a volatile function forcing recalculation on every change, and T() of a number produces an empty string so the actual result is not altered.

Why do you think so? The .ods'#\$ variant (that better should be \.ods'#\$ for a literal . dot) ensures that the match is indeed the file name extension followed by the sheet fragment, whereas .*'#\$ can match anything where a literal '#$ occurs, including a part of the file name. Yes it is unlikely that a file name contains that sequence, but not impossible.

It references the captured data of the first capturing parentheses group in the expression, here (\d{4}). Same as in Find&Replace dialog, see List of Regular Expressions (…) for a mention.

I don’t get what you’re up to, but in general yes, if you have several capturing parentheses groups in an expression then for a match you can use $1, $2, … for different captured data references.

1 Like