Remove duplicates -- "filter" is hard to understand

default_abuser · October 15, 2020, 9:03pm

I frequently have marketing lists of customers. Then a separate dataset that is always updating contains a subset of those customers who made a transaction. My routine task is to remove customers who appear on the transaction sheet from the the marketing sheet so we can keep marketing only to the accounts that have not made a transaction. Simple, right?

In excel, I merely copy the account #s of the customers on the transaction list; take it over to the marketing list and paste it at the bottom of the corresponding column of accounts#; conditional format that column of account numbers to highlight the duplicates; then sort by that formatting so the duplicates are all in one chunk at the top; then delete them; then delete the account numbers that I pasted from the transaction sheet and now I have a fresh marketing sheet that contains only accounts that have not made a transaction. Libre Office Calc can duplicate this step-for-step except sort by formatting. So if I have 200 transactions scattered across a marketing list of 10,000 lines, it’s a laborious task to delete them one by one, but I can’t seem to figure out how to get them to group up for easy deletion.

If your answer is “use Filter” – I have searched for this answer before posting this question – then please explain to me like I’m 5 how to use that because I have tried, and tried and tried – and even followed a youtube tutorial – and the “filter” concept and execution is still extremely puzzling to me.

Thanks!

edit: I agree “filter” is pretty stragithforward for other applications, but to use it to remove duplicates, it’s a really odd duck of a tool.

edit2: Conditional Formatting also identifies duplicates that the Filter function does not. Indeed, when I choose to copy the results of the filter, some of the duplicates have been removed but others are still there and, get this, still Conditionally Formatted. Which means the Conditional Formatting function identified duplicates that Filter for some reason did not. Is that worthy of a bug report?

edit3: removed the “Answered” green check due to this: Frequent super lag when sort by column of calculations

newbie-02 · October 22, 2020, 10:02pm

filter (for kids): [data - more filters - standard filter - select column - options - no duplicates - copy output to] just try and play on sandbox data, is nice, but!:
the solutions by @Frants and @LeroyG are much better, less work, easy … try that!

default_abuser · October 22, 2020, 10:50pm

My primary task usually involves removing the duplicate AND the initial instance as well – anything with a dupe is itself a dupe, in my mind; there’s not like “original” and “dupe” – they are both dupes of one another and both need to be removed. Filter only removes one duplicate, and leaves behind one instance. I’m sure there are other people who want that functionality, but it does nothing for me. I’m not sure how to use filter to do what I need.

newbie-02 · October 23, 2020, 12:14pm

@default_abuser: you may want to nail something with a screwdriver, but that’s nonsense, what you really want is to remove the records from your marketing list which have made transactions and thus also appear in the transactions list …
not! first to inject duplicates to then remove them together with the original …
thus mark em with the vlookup method proposed by Frants and @LeroyG, filter to see only rows which have the marker, autofilter is sufficient for that, delete them … straightforward and! ‘dynamic’, whenever you add customers to the transaction list you see them markered in the marketing list …

default_abuser · October 23, 2020, 4:20pm

The dogpile on VLOOKUP is getting pretty big so I guess that’s the only solution. I’ll spend a couple hours on it today and see how I feel about it – I was really holding out hope someone else would come in here with an easier solution.It’s just such a simple concept – see these accounts on this little sheet? remove them wherever you find them on this much bigger sheet. The VLOOKUP calculation is really codey for such a simple concept, seems to me. To anyone reading htis, I’d still love to find a less codey solution.

Frants · October 23, 2020, 6:10pm

The data structure on the sheet in my example is not suitable for this task (quickly done just to demonstrate VLOOKUP). The document “sample file” suggested by the participant LeroyG is better suited. Data with transactions should be placed on a separate sheet of the marketing document.

Frants · October 16, 2020, 9:07am

It might be worth trying to change the method of data processing. You have two lists, Customers and Marketing. Add a calculated field to the Customers list with a VLOOKUP function that looks up the account number in the Marketing list. For customers who are not in the Marketing list, the function will return the # N / A result. After that, you can sort the range of the Clients list by this column. Customers with a transaction will be at the top of the list. An exemplary form of the formula = VLOOKUP (A2; E: F; 1; 0)
For example Clients_and_transaction.ods
Best regards.

default_abuser · October 16, 2020, 6:25pm

I’ve never done anything like that. I looked at your sheet; I understand the concept; I do not understand one iota “=VLOOKUP(A22,E:F,1,0)”. I’ve never used a comma in a calculation before lol and I have no idea what it “means”. I trust your suggestion does do what I’m asking for – I trust that you have offered a valid solution – but I’m not sure how far I’d have to backup from your suggestion to have the foundation to even begin to implement your idea. And that’s pretty UGH when I’m trying to accomplish something I’ve done for literally years, have always thought of as simple and straight forward.

Frants · October 17, 2020, 4:37pm

I am sorry that my answer did not help you. Still, let me say something. When I look at the formula in my document, I see that =VLOOKUP(A22; E: F; 1; 0). There is no comma here. Maybe it’s because of the national settings of the application. I have them for the Russian region. Either way, the formula contains parameters and they are separated (in your case) by a comma. You can refer to the application help and read more about it (maybe the separators need to be corrected according to your region). In my example, this function does the following. Takes a value from A22 (customer account number) and searches for it in the range delimited by columns E and F (list of transactions). The search is performed in the first (parameter value is 1) column of this range (customer account with transaction). A parameter with a value of 0 means that the range is not sorted and the match must be exact. I agree with you that the use of new methods creates anxiety and involves risks.

Frants · October 17, 2020, 4:40pm

But, you can use this method for a while in parallel with your usual method.

default_abuser · October 19, 2020, 7:31pm

My boss won’t have much tolerance for mistakes I make ramping up the learning curve–it’s the sort of penalty that makes learning new things not worth it. I don’t think they realize that LO lacks features that Office 2010 had (sort by color/format) and will not be forgiving of errors I make as I ramp up the learning curve on VLOOKUP. I can understand their confusion/frustration; why would/does LO still lack features that were in Excel 10 years ago (I no longer have access to Excel) (and have been requested by LO users for 6 years?)? They may realize it’s not my fault but will not have anyone else to blame, either. I may or may not experiment with VLOOKUP it in my personal time, but in the meantime at work I’ll just be searching for and manually deleting the 200 lines that are intermixed in the other thousands or tens of thousands of lines based on conditional formatting. It won’t be efficient, but it’ll be accurate.

Frants · October 20, 2020, 7:36am

Your position is quite understandable. If the proposed solution does not inspire confidence and cannot be considered reliable, then it could help solve at least part of the problem (the rest of it manually). Can we think about this task together with you? Can you link to an example of your document with at least a couple of lines of fictional data? It would help to understand the data structure and think. There is another option, use the Base application. The SQL query tool is very effective. Best regards!

default_abuser · October 22, 2020, 8:35pm

I would love to think through this task together. I can not however figure out how to upload a file and link to it as you did. Also I’m not sure how to create a fake document that gets the point across since the real documents are troves of real sensitive customer data. Two spreadsheets, each several columns of customer data, each row a new customer. I just copy the entire column of account numbers from the sales sheet, paste all those account numbers at the bottom of the column of account numbers on the marketing sheet, then highlight the dupes. Now all the customers who have made a transaction are highlighted so I can remove them. Then on the marketing sheet I remove the account numbers I had (temporarily) pasted from the Sales sheet, and presto – clean marketing sheet containing only accounts that are not on the sales sheet. And it only takes like 2 minutes – unless you can’t sort by color/format, in which case you’re hunting and pecking forever.

Frants · October 23, 2020, 5:25am

Before that, prepare fake documents.

Make copies of your original documents and keep them in a safe place.
Create a folder for fake documents.
Save each of the documents with a new name (Save As) in this folder. The name should be very different so as not to confuse you in the future. For example, TempClients123, TempMarketing456.
Open each of the fake documents and remove all data lines from it. Correct the column names, if they contain any confidential data. And, make sure filtering is disabled. Leave only rows with column headings.
Look again at all Sheets in the fake documents. Only the column names should remain.
Close the application with documents.
Do Upload as described below.

LeroyG · October 23, 2020, 3:59pm

@Frants, Answers are not intended for comments or adding information.

Every user can upload ODF files to his/her question. See AskLibO: How many "karma" are required for …? 2020

Frants · October 23, 2020, 4:06pm

Thanks! There is a lot written there, but I could not figure it out, sorry. Just help, give step-by-step instructions on what to do to the user to upload documents. Best regards!
I think I’m starting to understand …

Frants · October 23, 2020, 4:14pm

To upload your documents:

In the form of your question, click the Edit button.
Buttons will appear above the form, click the “attachment” button. A dialog box will appear to select a file. Repeat this for each document.

Don’t forget to click Save.

LeroyG · October 23, 2020, 4:49pm

@Frants, I’m not angry, I just wanted to avoid confusion. I just changed a word in step 7 of your instructions to match the current position of the comments. Thanks for the help given. I’m learning too. Regards!

default_abuser · October 23, 2020, 6:31pm

It works. I don’t like it one bit, it took too long to figure out, and I fear it represents a lot of technical debt for me, but it works.

default_abuser · October 30, 2020, 3:45pm

update: It doesn’t work: Frequent super lag when sort by column of calculations