• Lucid Dreaming - Dream Views




    Results 1 to 24 of 24
    Like Tree1Likes
    • 1 Post By ninja9578

    Thread: Java - Finding duplicate string values in over ten thousand element array

    1. #1
      Member Rakjavik's Avatar
      Join Date
      Nov 2007
      Gender
      Location
      USA
      Posts
      462
      Likes
      7

      Java - Finding duplicate string values in over ten thousand element array

      There's got to be a quicker way to do these comparisons.

      for (int x = 0; x < subjects.size(); x++)
      {
      for (int y = 0; y < subjects.size(); y++)
      {

      System.out.println(contents.size()-x + " more to go");
      try
      {
      if (subjects.get(x).equals(subjects.get(y)))
      {
      if (contents.get(x).equals(contents.get(y)))
      {
      if (x != y)
      {
      System.out.println("Match found - " + subjects.get(x) + " Contents - " + contents.get(x));
      matches ++;
      }
      }
      }
      }
      catch(NullPointerException ex)
      {


      }

      }
      }

      System.out.println("Total matched - " + matches);

    2. #2
      ├┼┼┼┼┤
      Join Date
      Jun 2006
      Gender
      Location
      Equestria
      Posts
      6,315
      Likes
      1191
      DJ Entries
      1
      Use MYSQL?

      Java is generally pretty quick when doing stuff like that, since it does some pretty smart stuff for optimization, but I think using an actual DB language would be much more efficient. I don't know how to make it work with java, but there's probably a way

      ---------
      Lost count of how many lucid dreams I've had
      ---------

    3. #3
      Member Rakjavik's Avatar
      Join Date
      Nov 2007
      Gender
      Location
      USA
      Posts
      462
      Likes
      7
      Good idea. Now I just need to find out how to build a query to do that :/

    4. #4
      Achievements:
      Veteran First Class 5000 Hall Points
      reci's Avatar
      Join Date
      Feb 2008
      LD Count
      18
      Gender
      Location
      -
      Posts
      380
      Likes
      90
      If the list of strings is sorted, it should be relatively quick to find duplicates.
      Tutorial: How to Fall Asleep Faster
      You are dreaming.Do a reality check.

    5. #5
      ├┼┼┼┼┤
      Join Date
      Jun 2006
      Gender
      Location
      Equestria
      Posts
      6,315
      Likes
      1191
      DJ Entries
      1
      SQL Tutorial

      I believe the SELECT command automatically sorts out duplicates.

      ---------
      Lost count of how many lucid dreams I've had
      ---------

    6. #6
      Member Achievements:
      Created Dream Journal Referrer Bronze 5000 Hall Points Tagger First Class Populated Wall Veteran First Class
      Arra's Avatar
      Join Date
      Jan 2011
      Posts
      3,838
      Likes
      3887
      DJ Entries
      50
      If I understand this correctly, you're going through a single list and searching for duplicates. But there are two actual arrays that you're going through, because subjects[3] corresponds to contents[3], etc., and you need to make sure both the subjects and the contents of the two indices you're comparing are the same.
      You want it to find a match at index 3 and 5, only if subjects[3] = contents[3] and subjects[5] = contents[5].

      If that's right, I'd say you could double the efficiency by changing this line:

      for (int y = 0; y < subjects.size(); y++)
      to this:
      for (int y = x; y < subjects.size(); y++)

      since in your version, when y is from 0 to x-1, you are going through elements you already went through.

      And as someone else said, if the arrays are sorted in some way, you could do it a lot more efficiently, but it would be a bit more complicated to code.
      Last edited by Dianeva; 09-30-2011 at 10:41 PM.

    7. #7
      Banned
      Join Date
      Apr 2007
      Location
      Out Chasing Rabbits
      Posts
      15,193
      Likes
      935
      Wow, so far the advice is terrible.

      Putting it in a database? It's still gonna be an n squared routine in there, plus you have the overhead of the database.

      The double loops advice is just as bad because it still is an n squared routine.

      Sort the array using a quicksort.
      Iterate the array and check each item only against the item preceding it.
      That is (n log n) + n, which is a hell of a lot better than n ^ 2

      Sorting it, then searching it is much better than brute force searching it.
      Last edited by ninja9578; 10-01-2011 at 01:04 AM.

    8. #8
      ├┼┼┼┼┤
      Join Date
      Jun 2006
      Gender
      Location
      Equestria
      Posts
      6,315
      Likes
      1191
      DJ Entries
      1
      That's a way better solution, do that. Not sure why I didn't think of just using some sorting technique. Makes everything simpler

      ---------
      Lost count of how many lucid dreams I've had
      ---------

    9. #9
      Member Achievements:
      Created Dream Journal Referrer Bronze 5000 Hall Points Tagger First Class Populated Wall Veteran First Class
      Arra's Avatar
      Join Date
      Jan 2011
      Posts
      3,838
      Likes
      3887
      DJ Entries
      50
      I suppose that is a better method if you have a lot of elements and are going to be using the program a lot. I'm disappointed in myself for not thinking of it. But if not, the extra time and effort it would take to write the quick sort algorithm and then the searching algorithm might not be worth it.

    10. #10
      Banned
      Join Date
      Apr 2007
      Location
      Out Chasing Rabbits
      Posts
      15,193
      Likes
      935
      Actually, I'm sure java has a quicksort algorithm for you, other than in college there is no reason to write a quicksort algorithm.
      Dianeva likes this.

    11. #11
      Member Achievements:
      1 year registered Veteran First Class 5000 Hall Points

      Join Date
      Sep 2004
      Gender
      Location
      Seattle, WA
      Posts
      2,503
      Likes
      217
      wtf - just put it into a HashSet. 10K is not that big anyway.

    12. #12
      Banned
      Join Date
      Apr 2007
      Location
      Out Chasing Rabbits
      Posts
      15,193
      Likes
      935
      Replicon, I'm assuming this is homework, homework rarely gives you the choice of the containers I would have suggested a binary tree myself for only 10K items.

    13. #13
      Member Achievements:
      1 year registered Veteran First Class 5000 Hall Points

      Join Date
      Sep 2004
      Gender
      Location
      Seattle, WA
      Posts
      2,503
      Likes
      217
      Unless they specifically said "no containers" I'd say they're fair game. If you don't have access to the basic containers, you probably don't get to quicksort either. Meh I guess OP will have to clarify the rules. If it doesn't specify, then I don't see why he'd want to do anything other than that obvious 3 line solution with a hash set.

      Also, loving the "stick it into a database and use mysql" thing - Is there a "poe's law" for tech advice?

    14. #14
      Banned
      Join Date
      Apr 2007
      Location
      Out Chasing Rabbits
      Posts
      15,193
      Likes
      935
      Oh... whoa, where did I get the idea that he was using an array? O.o Yeah, a hash is the best. In fact, the best way to do it I think would be implement part of a radix sort. Keep bucketing the items recursively until you have 2 items in the bucket, then just compare those.

      Since this sounds like homework, I'm not going you java code, but for the challenge, I decided to do it in C++. I thing this algorithm is as good as you can get. Replicon, see any room for improvement? (In the algorithm, not the code.) Very very few string comparisons actually have to happen due to the bucketing.

      Code:
      #include <iostream>
      #include <string>
      #include <vector>
      #include <map>
      using namespace std;
      void find_dups(const vector<string> & bucket, size_t pos){
      	vector<string>::const_iterator i = bucket.begin();
      	vector<string>::const_iterator end = bucket.end();
      	map< char, vector<string> > buckets;
      	
      	//fill in the buckets
      	for(; i != end; ++i){
      		if (i -> length() > pos){
      			char key = (*i)[pos];
      			if (buckets.find(key) == buckets.end()){
      				buckets[key] = vector<string>();
      				buckets[key].push_back(*i);
      			} else if (buckets[key].at(0) == (*i)){
      				cout << "Found duplicate: " << *i << " at search depth " << pos << endl;
      			} else {
      				buckets[key].push_back(*i);
      			}
      		}
      	}
      	
      	//go through each bucket and see if we can ignore them, or have to recurse
      	map< char, vector<string> >::iterator i2 = buckets.begin();
      	while(i2 != buckets.end()){
      		//if its two, I know they arent equal or the above check would have caught it
      		if (i2 -> second.size() > 2){
      			//recurse on this bucket
      			find_dups(i2 -> second, pos + 1);
      		}
      		buckets.erase(i2);  //destroy the memory to make it RAM efficient too
      		i2 = buckets.begin();
      	}
      }
      
      int main(){
      	vector<string> list;
      	list.push_back("12 Bar Original");
      	list.push_back("A Beginning");
      	list.push_back("A Day In The Life");
      	list.push_back("Act Naturally");
      	list.push_back("Ain't She Sweet");
      	list.push_back("All I've Got To Do");
      	list.push_back("All My Loving");
      	list.push_back("Across The Universe");
      	list.push_back("All Things Must Pass");
      	list.push_back("And I Love Her");
      	list.push_back("And Your Bird Can Sing");
      	list.push_back("Anna (Go To Him)");
      	list.push_back("Another Girl");
      	list.push_back("A Hard Day's Night");
      	list.push_back("Across The Universe");
      	list.push_back("All Together Now");
      	list.push_back("All You Need Is Love");
      	list.push_back("A Shot Of Rhythm and Blues");
      	list.push_back("A Taste Of Honey");
      	list.push_back("Across The Universe");
      	list.push_back("All Together Now");
      	list.push_back("A Hard Day's Night");
      	list.push_back("All You Need Is Love");
      	list.push_back("Any Time At All");
      	list.push_back("Ask Me Why");
      	find_dups(list, 0);
      	return 0;
      }
      Last edited by ninja9578; 10-04-2011 at 02:28 AM.

    15. #15
      Member Rakjavik's Avatar
      Join Date
      Nov 2007
      Gender
      Location
      USA
      Posts
      462
      Likes
      7
      With how many people said this sounds like homework, it's making me embarrassed as hell. I'm actually a Junior Java Developer (very entry level). I have no education in Java and only did it as a hobby for a while and this new team gave me a shot at a junior position.

      Thanks everyone for your advice. The more I learn the better!

    16. #16
      Banned
      Join Date
      Apr 2007
      Location
      Out Chasing Rabbits
      Posts
      15,193
      Likes
      935
      Oh, we're sorry. We didn't mean to embarrass you, you gotta admit though, it does sound like something a professor would assign.

      If it's for production product, use my code, it's the fastest.

    17. #17
      Member Achievements:
      1 year registered Veteran First Class 5000 Hall Points

      Join Date
      Sep 2004
      Gender
      Location
      Seattle, WA
      Posts
      2,503
      Likes
      217
      Quote Originally Posted by ninja9578 View Post
      If it's for production product, use my code, it's the fastest.
      I beg to differ. Not so much with the "fastest" comment, but rather that it would be a good idea to use that in production.

      By the way, this is not meant to be any kind of peacocking or slap in the face, but I think there is a really great lesson here that our budding junior engineer would be good to learn early on in his career.

      I had some free time this evening, so I decided to put your "fastest" claim to the test. I took your code, as-is, and compiled and ran it. Then, I changed it extremely slightly (not the algo, mind you). My changes consist of:

      1) Instead of the bunch of list.push_backs, I changed it to read strings from stdin:

      Code:
              string input;
              while(cin >> input) {
                  list.push_back(input);
              }
      2) I shortened your output line to be:

      Code:
      cout << "Found duplicate: " << *i << endl;
      You will see why later.

      Then, I put together a simple hash table based implementation, using the unordered_set library that comes pretty standard with any C++ distribution these days, and find_dups came out looking like this (I kept the vector around for consistency, but really, I'd want to use a list here):

      Code:
      void find_dups(const vector<string> & list){
          unordered_set<string> stringSet;
          for(vector<string>::const_iterator i = list.begin(); i != list.end(); ++i) {
              if(stringSet.count(*i) > 0) {
                  cout << "Found duplicate: " << *i << endl;
              } else {
                  stringSet.insert(*i);
              }
          }
      }
      Note that my output format is exactly the same as that I changed the one in ninja.cpp to.

      Now, we need a decent source of input data. I thought, what better input data to use, than my all time favourite work of fiction: the king james bible! (oh snap!)

      So let's run our benchmarks.

      First, the "do nothing" which is just to time reading the biggish input file:

      Code:
      time cat kjv12.txt | ./donothing 
      Reading input...
      Done reading 821160 strings.
      
      real	0m0.807s
      user	0m0.748s
      sys	0m0.044s
      Ok great, basically takes about .8 seconds on my laptop.

      Now the two solutions:

      Code:
      time cat kjv12.txt | ./ninja | wc -l
      628031
      
      real	0m4.144s
      user	0m2.916s
      sys	0m1.172s
      Code:
      time cat kjv12.txt | ./replicon | wc -l
      787572
      
      real	0m3.251s
      user	0m1.612s
      sys	0m1.616s
      So looks like my short-and-sweet solution beat yours, time-wise. At first, I wasn't using unordered_set. I started with a regular map, which as you know, is an rbtree under the hood, and the two were performing just about the same.

      Then, just to be fair, I tried compiling with -O4, to squeeze all the sweet optimization out of it. In that scenario, the two were a bit faster, and it was quite a bit closer. In fact, ninja.cpp beat replicon.cpp by almost 0.2 seconds consistently (0m2.943s to my 0m3.120s), so it would not come as a surprise that your algorithm is, in fact, the fastest way, or at least, better-approaching the asymptotic, theoretical fastest for this problem.

      But there's one little peculiarity. Notice that the output is different. We should be finding pretty much the same duplicates, so why the difference? Let's investigate more closely, with just the first 100 lines of the holy buy-bull:

      Code:
      head kjv12.txt -n100 | ./ninja | sort | uniq | wc -l
      72
      
      head kjv12.txt -n100 | ./replicon | sort | uniq | wc -l
      81
      Odd, these REALLY should be the same! Let's investigate further:

      Code:
      head kjv12.txt -n100 | ./ninja | sort | uniq > ninja_out
      
      head kjv12.txt -n100 | ./replicon | sort | uniq > replicon_out
      
      diff ninja_out replicon_out 
      2a3
      > Found duplicate: a
      4a6,7
      > Found duplicate: And
      > Found duplicate: be
      13a17
      > Found duplicate: divide
      16a21
      > Found duplicate: earth
      27a33
      > Found duplicate: he
      41a48
      > Found duplicate: light
      52a60
      > Found duplicate: seed
      65a74
      > Found duplicate: waters
      Hmm, it looks like my code is finding duplicates that yours is not, right now.

      Sure enough, when I open the first 100 lines of the bible, I see the word "waters" appear 8 times. "divide" appears twice, "seed" 4 times, etc.

      Now to be fair, if ninja were coding this for a production system and not some random DV thread, he would have tested it better and found the bugs and fixed them, and it would have been wickedly fast. But the lesson is this:

      USE THE LIBRARIES YOU HAVE AT YOUR DISPOSAL!!! That's why they're there! If you're writing fairly general purpose software, there is absolutely no reason to roll your own here. Unless you're writing a real-time missile guidance system, low level video game blitting code, or software that must fit into a tiny amount of memory (like, 1kb and such), your best bet really is to use the libraries that are there.

      Your code will be readable (remember: others have to maintain it), and less likely to be error-prone, because the standard libraries are thoroughly tested by the community, and bugs are identified and fixed aggressively.

      So yeah, in most production systems, you probably wouldn't want to hand-code something that gives you almost no noticeable performance gain, while taking a HUGE hit in terms of maintenance and support burden.

    18. #18
      Banned
      Join Date
      Apr 2007
      Location
      Out Chasing Rabbits
      Posts
      15,193
      Likes
      935
      I didn't think that the unordered_set would be as fast or memory efficient, I also haven't used java in ages and don't really know if all the STL containers had java equivalents :/ Looks like they were very close. Yeah, if there are bugs in the code its because I didn't test it, I just wanted to show the hashing algorithm.

      What is -O4? I don't see any documentation on it, are you using something other than GNU?

    19. #19
      Member Achievements:
      1 year registered Veteran First Class 5000 Hall Points

      Join Date
      Sep 2004
      Gender
      Location
      Seattle, WA
      Posts
      2,503
      Likes
      217
      The O is for "optimizer" and the 4 is the level (I think 4 is the highest). I am not sure exactly what the optimizer does, and how it varies per level. I imagine for large software projects, the compile time makes a huge difference, so they iterate/develop on unoptimized, and then the release build is optimized. Like I know at places like Discreet Logic, they do their dev work on some MS compiler, because of speed, but then the final release is built with the better-optimized Intel compiler (or at least, that's how it was 8 years ago when I interviewed to intern there).

      I read a while ago in a Game Programming Gems book that even in the video game industry, the STL is quite well optimized for the vast majority of tasks, and it's only in the really high-performance, tight-loopy code that it's worth rolling your own.

    20. #20
      Banned
      Join Date
      Apr 2007
      Location
      Out Chasing Rabbits
      Posts
      15,193
      Likes
      935
      I know what -Ox is, but -O3 is the highest on GCC. And the other compilers I know use different optimization flags.

    21. #21
      Member Achievements:
      1 year registered Veteran First Class 5000 Hall Points

      Join Date
      Sep 2004
      Gender
      Location
      Seattle, WA
      Posts
      2,503
      Likes
      217
      Haha sorry question was vague haha. Yeah when I google, it seems O3 is generally the highest, though older incarnations sometimes supported up to O9.

    22. #22
      Banned
      Join Date
      Apr 2007
      Location
      Out Chasing Rabbits
      Posts
      15,193
      Likes
      935
      O.o good god. -O3 can take hours, I don't want to think about an O9

    23. #23
      dsr
      dsr is offline
      我是老外,可是我會說一點中文。
      Join Date
      Nov 2006
      Gender
      Location
      my mind
      Posts
      374
      Likes
      1
      If worst-case performance is relevant here, one could do a full LSD radix sort. The running time of a counting sort on a list of n ASCII characters would have the asymptotic upper bound of O(n + 256), which for large values of n is essentially O(n). So if you know your input strings only contain ASCII characters, a radix sort implemented by performing counting sort on each character in a string, from right to left (keeping in mind that unlike with integers, strings of different lengths are aligned on the left, not the right) would have a running time of O(nk) where k is the maximum length of an input string. If you know the maximum length of your input strings and if that value isn't too large, this approach is probably the fastest sort for large values of n. I don't know the nature of the OP's input, but it shouldn't be hard to determine the k value if your corpus is a known text like the King James Bible.

    24. #24
      Banned
      Join Date
      Apr 2007
      Location
      Out Chasing Rabbits
      Posts
      15,193
      Likes
      935
      Holy crap, it's dsr Been a long time.

    Similar Threads

    1. Family Structures/Values
      By Taosaur in forum Extended Discussion
      Replies: 5
      Last Post: 02-08-2008, 12:19 AM
    2. Society's values
      By Awaken in forum Philosophy
      Replies: 18
      Last Post: 08-20-2004, 03:20 AM
    3. Conflict of personal values... ///\
      By Berserk Exodus in forum Philosophy
      Replies: 13
      Last Post: 03-04-2004, 11:36 PM

    Bookmarks

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •